Z Algorithm 一个和KMP,Boyer Moore并列的字符串搜索

KMP和Boyer Moore应该是最常见的两个子串搜索算法. 相比之下, KMP比较适合用在字符串基数少的string, 比如DNA, 然后BM相反. 和z相比, z的优势也就是短点…

今天看到GeeksforGeeks上介绍了Z function(Z algorithm).我都快忘了还有这个了. Z algorithm的时,空间复杂度和KMP类似, 采用的技术也类似.简单来说, Z algorithm采用了一个Z数组来保留已知子串信息. 和KMP一样, 我们需要预先计算Z数组.

如何计算Z数组?, Z数组的思想就是Longest Common Prefix (最长公共前缀,下面简称lcp). Z数组中在i位的元素表示lcp(s[0,s.length-1], s[i,s.length-1]). 就是包含i位的子串与串s的最长公共前缀. 比如

s: abbabbaa

z: [0, 0, 0, 4, 0, 0, 1, 1]

在第二个a的时候, z是4, 表示abba是和s的最长公共子串, 即:lcp(abbabbaa,abbaa) = |abba|= 4.

如何使用z数组呢? 其实很简单, 只要把目标串s和匹配串p放在一个串中, 中间加个特殊符号, 就可以做子串搜索了.比如:

s: abc

p:abaedfbabcabd

our string: s+’$’+p =abc$abaedfbabcabd

z: [0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 3, 0, 0, 2, 0, 0]

这里因为我们知道s的长度是3, 所以数组中的3代表这完全匹配.即是答案.

下面是计算zfunction的code, 我发现GeeksforGeek上的code有点麻烦,写了一下自己的:

public int[] zAlgorithm(String s) {
		int[] z = new int[s.length()];
		int l = 0;
        int r = 0;
		for (int i = 1; i < z.length; ++i) {
			if (i <= r) // use to skip the calculations. because we already know that, in ith position of String s, if i is in [l,r], then its lcp should be either the substring
			// of [l,r](r-i+1), or its previous z value in [l,r] (z[i-l]), and we need the minimum of them, since either of them could trick us. 
				z[i] = Math.min(r - i + 1, z[i - l]);
			while (i + z[i] < z.length && s.charAt(z[i]) == s.charAt(i + z[i])) // increase r of [l,r]
				z[i]++;
			if (i + z[i] - 1 > r) { // update l and r
				l = i;
				r = i + z[i] - 1;
			}
		}
		return z;
	}

Z Algorithm 一个和KMP,Boyer Moore并列的字符串搜索

Leave A Comment

Leave a Reply

书脊

这青苔碧瓦堆, 俺曾睡风流觉, 将五十年兴亡看饱.

Categories