字符串搜索算法
Brute-Force 算法-一个简单粗暴的方法。
Boyer-Moore 算法-一种高效的字符串子串搜索算法(BM算法)。它不需要对被搜索的字符串中的字符进行逐一比较,而是根据一个查找表跳过其中的某些部分。
Knuth-Morris-Pratt 算法 - 一个线性复杂度字符串搜索算法(KMP算法),通过模式字符进行匹配返回所有的字符串的位置
Rabin-Karp 算法 - 通过哈希算法快速搜索
最长公共子序列算法-找到两个字符串中的最长公共子序列。
Z-Algorithm 在一个字符串中找到所有的模式字符,并返回模式字符在字符串中的开始位置
暴力字符串搜索(Brute-Force String Search)
如果不允许导入Foundation
并且不能使用NSString
的rangeOfString()
方法,那么如何在纯Swift中编写字符串搜索算法呢?
目标是在String
上实现indexOf(pattern: String)
扩展,返回第一次出现的搜索模式的String.Index
,如果在字符串中找不到模式,则返回nil
。
例子:
// Input:
let s = "Hello, World"
s.indexOf("World")
// Output:
<String.Index?> 7
// Input:
let animals = "?????????????????????"
animals.indexOf("?")
// Output:
<String.Index?> 6
注意: 牛的索引是6,而不是你想象的3,因为字符串为表情符号使用更多的存储空间。 String.Index
的实际值并不那么重要,只要它指向字符串中的正确字符。
这是暴力解决方案:
extension String {
func indexOf(_ pattern: String) -> String.Index? {
for i in self.indices {
var j = i
var found = true
for p in pattern.indices {
if j == self.endIndex || self[j] != pattern[p] {
found = false
break
} else {
j = self.index(after: j)
}
}
if found {
return i
}
}
return nil
}
}
这将依次查看源字符串中的每个字符。 如果字符等于搜索模式的第一个字符,则内部循环检查模式的其余部分是否匹配,如果未找到匹配项,则外循环将从中断处继续。 重复此过程直到找到完全匹配或到达源字符串的结尾。
暴力方法运行正常,但效率不高(或漂亮)。 不过,暴力方法应该可以在小字符串上正常工作。 对于使用大块文本更好的智能算法,请查看Boyer-Moore字符串搜索。
Boyer-Moore字符串搜索(Boyer-Moore String Search)
这个主题已经有教程 here
目标:在纯Swift中编写字符串搜索算法,而无需导入Foundation
或使用NSString
的rangeOfString()
方法。
换句话说,我们想在String
上实现一个indexOf(pattern:String)
扩展,它返回在字符串里面第一次出现搜索模式的String.Index
,如果找不到模式则返回nil
。
例子:
// Input:
let s = "Hello, World"
s.indexOf(pattern: "World")
// Output:
<String.Index?> 7
// Input:
let animals = "?????"
animals.indexOf(pattern: "?")
// Output:
<String.Index?> 6
注意: 牛的索引是6,而不是你想象的3,因为字符串为表情符号使用更多的存储空间。String.Index
的实际值并不那么重要,只是它指向字符串中的正确字符。
暴力方法工作正常,但效率不高,尤其是在大块文本上。 事实证明,您不需要从源字符串中查看 每个 字符 —— 通常可以跳过多个字符。
这种跳过算法被称为Boyer-Moore算法,它已存在很长时间了。它被认为是所有字符串搜索算法的基准。
以下是您在Swift中编写它的方法:
extension String {
func index(of pattern: String) -> Index? {
let patternLength = pattern.count
guard patternLength > 0, patternLength <= self.count else { return nil }
var skipTable = [Character: Int]()
for (i, c) in pattern.enumerated() {
skipTable[c] = patternLength - i - 1
}
let p = pattern.index(before: pattern.endIndex)
let lastChar = pattern[p]
var i = index(startIndex, offsetBy: patternLength - 1)
func backwards() -> Index? {
var q = p
var j = i
while q > pattern.startIndex {
j = index(before: j)
q = index(before: q)
if self[j] != pattern[q] { return nil }
}
return j
}
while i < endIndex {
let c = self[i]
if c == lastChar {
if let k = backwards() { return k }
i = index(after: i)
} else {
i = index(i, offsetBy: skipTable[c] ?? patternLength, limitedBy: endIndex) ?? endIndex
}
}
return nil
}
}
该算法的工作原理如下。 让源字符与搜索模式字符头部对齐排列,并查看字符串中的哪个字符与搜索模式的 最后 字符匹配:
source string: Hello, World
search pattern: World
^
有三种可能性:
这两个字符是相同的。 你找到了可能的匹配。
字符不相等,但源字符确实有出现在搜索模式其他位置中。
源字符完全没有出现在搜索模式中。
在示例中,字符o
和d
不匹配,但o
确实出现在搜索模式中。 这意味着我们可以跳过几个位置:
source string: Hello, World
search pattern: World
^
注意两个o
字符现在是如何对齐的。 再次,您将搜索模式的最后一个字符与搜索文本进行比较:W
vsd
。 它们是不相同的,但W
确实出现在搜索模式中。 因此,再次跳过一个位置,让两个W
字符在相同位置:
source string: Hello, World
search pattern: World
^
这次两个字符相等并且可能匹配。 要验证匹配,您需要进行暴力搜索,但是从搜索模式的末尾开始向前搜索。 这就是它的全部。
在任何给定时间跳过的数量由“跳过表”确定,“跳过表”是搜索模式中所有字符的字典和跳过的数量。 示例中的跳过表如下所示:
W: 4
o: 3
r: 2
l: 1
d: 0
字符越接近模式的末尾,跳过量越小。 如果某个字符在模式中出现多次,则最接近该模式结尾的字符将确定该字符的跳过值。
注意: 如果搜索模式只包含几个字符,则执行暴力搜索会更快。 在构建跳过表与为短模式执行暴力搜索之间需要进行权衡。
致谢:此代码基于1989年7月Dr Dobb's杂志发表的文章"Faster String Searches" by Costas Menico —— 对 ,1989年! 有时保留那些旧杂志是有用的。
扩展阅读:这个算法的详细分析
Boyer-Moore-Horspool 算法
上面算法的一个变体是Boyer-Moore-Horspool 算法。
像常规的Boyer-Moore算法一样,它使用skipTable
来跳过许多字符。 不同之处在于我们如何检查局部匹配。在上面的版本中,如果找到了部分匹配,但不是完全匹配,我们只跳过一个字符。在这个修订版本中,我们也在那种情况下使用跳过表。
这是Boyer-Moore-Horspool算法的一个实现:
extension String {
func index(of pattern: String, usingHorspoolImprovement: Bool = false) -> Index? {
let patternLength = pattern.count
guard patternLength > 0, patternLength <= self.count else {
return nil
}
var skipTable = [Character: Int]()
for (i, c) in pattern.enumerated() {
skipTable[c] = patternLength - i - 1
}
let p = pattern.index(before: pattern.endIndex)
let lastChar = pattern[p]
var i = index(startIndex, offsetBy: patternLength - 1)
func backwards() -> Index? {
var q = p
var j = i
while q > pattern.startIndex {
j = index(before: j)
q = index(before: q)
if self[j] != pattern[q] { return nil }
}
return j
}
while i < endIndex {
let c = self[i]
if c == lastChar {
if let k = backwards() { return k }
if !usingHorspoolImprovement {
i = index(after: i)
} else {
let jumpOffset = max(skipTable[c] ?? patternLength, 1)
i = index(i, offsetBy: jumpOffset, limitedBy: endIndex) ?? endIndex
}
} else {
i = index(i, offsetBy: skipTable[c] ?? patternLength, limitedBy: endIndex) ?? endIndex
}
}
return nil
}
}
在实践中,Horspool版本的算法往往比原始版本略好一些。 但是,这取决于你愿意做出什么样的权衡。
译注: 阮一峰老师的文章 字符串匹配的Boyer-Moore算法 讲的比较清晰
KMP算法(Knuth-Morris-Pratt String Search)
Goal: Write a linear-time string matching algorithm in Swift that returns the indexes of all the occurrencies of a given pattern.
目标:在Swift中编写线性时间字符串匹配算法,返回给定模式的所有出现的索引。
In other words, we want to implement an indexesOf(pattern: String)
extension on String
that returns an array [Int]
of integers, representing all occurrences' indexes of the search pattern, or nil
if the pattern could not be found inside the string.
换句话说,我们想在String
上实现了indixesOf(pattern:String)
方法,它返回一个整数的数组[Int]
,表示搜索模式的所有出现的索引,如果在字符串中找不到模式,则返回nil
。
For example:
例如:
let dna = "ACCCGGTTTTAAAGAACCACCATAAGATATAGACAGATATAGGACAGATATAGAGACAAAACCCCATACCCCAATATTTTTTTGGGGAGAAAAACACCACAGATAGATACACAGACTACACGAGATACGACATACAGCAGCATAACGACAACAGCAGATAGACGATCATAACAGCAATCAGACCGAGCGCAGCAGCTTTTAAGCACCAGCCCCACAAAAAACGACAATFATCATCATATACAGACGACGACACGACATATCACACGACAGCATA"
dna.indexesOf(ptnr: "CATA") // Output: [20, 64, 130, 140, 166, 234, 255, 270]
let concert = "?????????????"
concert.indexesOf(ptnr: "??") // Output: [6]
The Knuth-Morris-Pratt algorithm is considered one of the best algorithms for solving the pattern matching problem. Although in practice Boyer-Moore is usually preferred, the algorithm that we will introduce is simpler, and has the same (linear) running time.
KMP算法被认为是解决模式匹配问题的最佳算法之一。 虽然在实践中Boyer-Moore算法通常是首选,但KMP算法更简单,并且具有相同的(线性)运行时间。
The idea behind the algorithm is not too different from the naive string search procedure. As it, Knuth-Morris-Pratt aligns the text with the pattern and goes with character comparisons from left to right. But, instead of making a shift of one character when a mismatch occurs, it uses a more intelligent way to move the pattern along the text. In fact, the algorithm features a pattern pre-processing stage where it acquires all the informations that will make the algorithm skip redundant comparisons, resulting in larger shifts.
该算法背后的想法与暴力字符串搜索程序没有太大区别。 因此,KMP算法将文本与模式对齐,并从左到右进行字符比较。 但是,当发生不匹配时,不是使一个字符移位,而是使用更智能的方式沿着文本移动模式。 实际上,该算法具有模式预处理阶段,其中它获取将使算法跳过冗余比较的所有信息,从而导致更大的移位。
The pre-processing stage produces an array (called suffixPrefix
in the code) of integers in which every element suffixPrefix[i]
records the length of the longest proper suffix of P[0...i]
(where P
is the pattern) that matches a prefix of P
. In other words, suffixPrefix[i]
is the longest proper substring of P
that ends at position i
and that is a prefix of P
. Just a quick example. Consider P = "abadfryaabsabadffg"
, then suffixPrefix[4] = 0
, suffixPrefix[9] = 2
, suffixPrefix[14] = 4
.
预处理阶段产生一个整数的数组(在代码中称为suffixPrefix
),其中每个元素suffixPrefix[i]
记录最长的正确后缀P[0...i]
的长度(其中 P
是模式)匹配前缀P
。 换句话说,suffixPrefix[i]
是P
的最长的正确子串,它在位置i
结束并且是P
的前缀。 只是一个简单的例子。 考虑P = "abadfryaabsabadffg"
,然后suffixPrefix[4] = 0
,suffixPrefix 9] = 2
,suffixPrefix[14] = 4
。
There are different ways to obtain the values of SuffixPrefix
array. We will use the method based on the Z-Algorithm. This function takes in input the pattern and produces an array of integers. Each element represents the length of the longest substring starting at position i
of P
and that matches a prefix of P
. You can notice that the two arrays are similar, they record the same informations but on the different places. We only have to find a method to map Z[i]
to suffixPrefix[j]
. It is not that difficult and this is the code that will do for us:
有不同的方法来获取SuffixPrefix
数组的值。 我们将使用基于Z-Algorithm的方法。 此函数接受输入模式并生成整数数组。 每个元素表示从P
的位置i
开始并且与P
的前缀匹配的最长子串的长度。 你可以注意到两个数组是相似的,它们记录相同的信息,但是在不同的地方。 我们只需找到一种方法将Z[i]
映射到suffixPrefix[j]
。 这并不困难,这是为我们做的代码:
for patternIndex in (1 ..< patternLength).reversed() {
textIndex = patternIndex + zeta![patternIndex] - 1
suffixPrefix[textIndex] = zeta![patternIndex]
}
We are simply computing the index of the end of the substring starting at position i
(as we know matches a prefix of P
). The element of suffixPrefix
at that index then it will be set with the length of the substring.
我们只是简单地计算从位置i
开始的子串结束的索引(因为我们知道匹配P
的前缀)。 在该索引处的suffixPrefix
元素然后将使用子字符串的长度进行设置。
Once the shift-array suffixPrefix
is ready we can begin with pattern search stage. The algorithm first attempts to compare the characters of the text with those of the pattern. If it succeeds, it goes on until a mismatch occurs. When it happens, it checks if an occurrence of the pattern is present (and reports it). Otherwise, if no comparisons are made then the text cursor is moved forward, else the pattern is shifted to the right. The shift's amount is based on the suffixPrefix
array, and it guarantees that the prefix P[0...suffixPrefix[i]]
will match its opposing substring in the text. In this way, shifts of more than one character are often made and lot of comparisons can be avoided, saving a lot of time.
一旦移位数组suffixPrefix
准备就绪,我们就可以从模式搜索阶段开始。 该算法首先尝试将文本的字符与模式的字符进行比较。 如果成功,它会一直持续到发生不匹配为止。 当它发生时,它会检查是否存在模式(并报告)。 否则,如果没有进行比较,则文本光标向前移动,否则图案向右移动。 shift的数量基于suffixPrefix
数组,它保证前缀P[0...suffixPrefix[i]]
将匹配文本中相反的子字符串。 通过这种方式,通常可以进行多个字符的移位,并且可以避免大量的比较,从而节省大量时间。
Here is the code of the Knuth-Morris-Pratt algorithm:
KMP算法代码:
extension String {
func indexesOf(ptnr: String) -> [Int]? {
let text = Array(self.characters)
let pattern = Array(ptnr.characters)
let textLength: Int = text.count
let patternLength: Int = pattern.count
guard patternLength > 0 else {
return nil
}
var suffixPrefix: [Int] = [Int](repeating: 0, count: patternLength)
var textIndex: Int = 0
var patternIndex: Int = 0
var indexes: [Int] = [Int]()
/* Pre-processing stage: computing the table for the shifts (through Z-Algorithm) */
let zeta = ZetaAlgorithm(ptnr: ptnr)
for patternIndex in (1 ..< patternLength).reversed() {
textIndex = patternIndex + zeta![patternIndex] - 1
suffixPrefix[textIndex] = zeta![patternIndex]
}
/* Search stage: scanning the text for pattern matching */
textIndex = 0
patternIndex = 0
while textIndex + (patternLength - patternIndex - 1) < textLength {
while patternIndex < patternLength && text[textIndex] == pattern[patternIndex] {
textIndex = textIndex + 1
patternIndex = patternIndex + 1
}
if patternIndex == patternLength {
indexes.append(textIndex - patternIndex)
}
if patternIndex == 0 {
textIndex = textIndex + 1
} else {
patternIndex = suffixPrefix[patternIndex - 1]
}
}
guard !indexes.isEmpty else {
return nil
}
return indexes
}
}
Let's make an example reasoning with the code above. Let's consider the string P = ACTGACTA"
, the consequentially obtained suffixPrefix
array equal to [0, 0, 0, 0, 0, 0, 3, 1]
, and the text T = "GCACTGACTGACTGACTAG"
. The algorithm begins with the text and the pattern aligned like below. We have to compare T[0]
with P[0]
.
让我们用上面的代码作一个例子推理。 让我们考虑字符串P = ACTGACTA"
,结果获得的suffixPrefix
数组等于[0,0,0,0,0,0,1,1]
,文本T ="GCACTGACTGACTGACTAG"
算法从文本和模式开始,如下所示。我们必须比较T[0]
和P[0]
。
1
0123456789012345678
text: GCACTGACTGACTGACTAG
textIndex: ^
pattern: ACTGACTA
patternIndex: ^
x
suffixPrefix: 00000031
We have a mismatch and we move on comparing T[1]
and P[0]
. We have to check if a pattern occurrence is present but there is not. So, we have to shift the pattern right and by doing so we have to check suffixPrefix[1 - 1]
. Its value is 0
and we restart by comparing T[1]
with P[0]
. Again a mismath occurs, so we go on with T[2]
and P[0]
.
我们有一个不匹配,我们继续比较T[1]
和P[0]
。 我们必须检查模式是否存在但是没有。 所以,我们必须正确地改变模式,所以我们必须检查suffixPrefix[1 - 1]
。 它的值为0
,我们通过将T[1]
与P[0]
进行比较来重新启动。 再次出现一个错误,所以我们继续使用T[2]
和P[0]
。
1
0123456789012345678
text: GCACTGACTGACTGACTAG
textIndex: ^
pattern: ACTGACTA
patternIndex: ^
suffixPrefix: 00000031
This time we have a match. And it continues until position 8
. Unfortunately the length of the match is not equal to the pattern length, we cannot report an occurrence. But we are still lucky because we can use the values computed in the suffixPrefix
array now. In fact, the length of the match is 7
, and if we look at the element suffixPrefix[7 - 1]
we discover that is 3
. This information tell us that that the prefix of P
matches the suffix of the susbtring T[0...8]
. So the suffixPrefix
array guarantees us that the two substring match and that we do not have to compare their characters, so we can shift right the pattern for more than one character!
The comparisons restart from T[9]
and P[3]
.
这次我们有一个匹配。 它一直持续到位置8
。 不幸的是,匹配的长度不等于模式长度,我们无法报告发生的事件。 但我们仍然很幸运,因为我们现在可以使用suffixPrefix
数组中计算的值。 实际上,匹配的长度是7
,如果我们看一下元素suffixPrefix[7 - 1]
,我们发现它是3
。 这个信息告诉我们P
的前缀匹配susbtringT [0 ... 8]
的后缀。 所以suffixPrefix
数组保证我们两个子字符串匹配,并且我们不必比较它们的字符,所以我们可以将模式向右移动多个字符!
1
0123456789012345678
text: GCACTGACTGACTGACTAG
textIndex: ^
pattern: ACTGACTA
patternIndex: ^
suffixPrefix: 00000031
They match so we continue the compares until position 13
where a misatch occurs beetwen charcter G
and A
. Just like before, we are lucky and we can use the suffixPrefix
array to shift right the pattern.
它们匹配,所以我们继续比较,直到位置13
,其中发生了一个错误,发生在字母G
和A
之间。 就像以前一样,我们很幸运,我们可以使用suffixPrefix
数组向右移动模式。
1
0123456789012345678
text: GCACTGACTGACTGACTAG
textIndex: ^
pattern: ACTGACTA
patternIndex: ^
suffixPrefix: 00000031
Again, we have to compare. But this time the comparisons finally take us to an occurrence, at position 17 - 7 = 10
.
再次,我们必须比较。 但这次比较最终将我们发生在位置17 - 7 = 10
。
1
0123456789012345678
text: GCACTGACTGACTGACTAG
textIndex: ^
pattern: ACTGACTA
patternIndex: ^
suffixPrefix: 00000031
The algorithm than tries to compare T[18]
with P[1]
(because we used the element suffixPrefix[8 - 1] = 1
) but it fails and at the next iteration it ends its work.
该算法比试图比较T[18]
和P[1]
(因为我们使用元素suffixPrefix [8 - 1] = 1
)但它失败了,在下一次迭代它结束了它的工作。
The pre-processing stage involves only the pattern. The running time of the Z-Algorithm is linear and takes O(n)
, where n
is the length of the pattern P
. After that, the search stage does not "overshoot" the length of the text T
(call it m
). It can be be proved that number of comparisons of the search stage is bounded by 2 * m
. The final running time of the Knuth-Morris-Pratt algorithm is O(n + m)
.
预处理阶段仅涉及模式。 Z算法的运行时间是线性的,取O(n)
,其中n
是模式P
的长度。 之后,搜索阶段不会“超过”文本T
的长度(称之为m
)。 可以证明,搜索阶段的比较数量以2 * m
为界。 KMP算法的最终运行时间是O(n + m)
。
Note: To execute the code in the KnuthMorrisPratt.swift you have to copy the ZAlgorithm.swift file contained in the Z-Algorithm folder. The KnuthMorrisPratt.playground already includes the definition of theZeta
function.
注意: 要执行KnuthMorrisPratt.swift中的代码,您必须复制ZAlgorithm.swift文件中包含的 Z-Algorithm文件夹。 KnuthMorrisPratt.playground已经包含了Zeta
函数的定义。
Credits: This code is based on the handbook "Algorithm on String, Trees and Sequences: Computer Science and Computational Biology" by Dan Gusfield, Cambridge University Press, 1997.
致谢:此代码基于手册 “字符串,树和序列算法:计算机科学和计算生物学” Dan Gusfield,剑桥大学出版社,1997年。
Written for Swift Algorithm Club by Matteo Dunnhofer
Rabin-Karp string search algorithm
Rabin-Karp字符串搜索算法
The Rabin-Karp string search algorithm is used to search text for a pattern.
Rabin-Karp字符串搜索算法用于搜索模式的文本。
A practical application of the algorithm is detecting plagiarism. Given source material, the algorithm can rapidly search through a paper for instances of sentences from the source material, ignoring details such as case and punctuation. Because of the abundance of the sought strings, single-string searching algorithms are impractical.
该算法的实际应用是检测抄袭。 在给定源材料的情况下,该算法可以快速在纸张中搜索来自源材料的句子实例,忽略诸如大小写和标点符号之类的细节。 由于所搜索的字符串丰富,单字符串搜索算法是不切实际的。
Example
例子
Given a text of "The big dog jumped over the fox" and a search pattern of "ump" this will return 13.
It starts by hashing "ump" then hashing "The". If hashed don't match then it slides the window a character
at a time (e.g. "he ") and subtracts out the previous hash from the "T".
鉴于 "The big dog jumped over the fox" 的文字和“ump”的搜索模式,这将返回13。
它首先散列“ump”然后散列“The”。 如果散列不匹配,则它会在窗口中滑动一个字符
一次(例如“他”)并从“T”中减去先前的散列。
Algorithm
算法
The Rabin-Karp algorithm uses a sliding window the size of the search pattern. It starts by hashing the search pattern, then
hashing the first x characters of the text string where x is the length of the search pattern. It then slides the window one character over and uses
the previous hash value to calculate the new hash faster. Only when it finds a hash that matches the hash of the search pattern will it compare
the two strings it see if they are the same (to prevent a hash collision from producing a false positive).
Rabin-Karp算法使用与搜索模式大小相同的滑动窗口。 然后,它通过散列搜索模式开始
散列文本字符串的前x个字符,其中x是搜索模式的长度。 然后它将窗口滑动一个字符并使用
先前的哈希值可以更快地计算新哈希值。 只有当它找到与搜索模式的哈希匹配的哈希时才会进行比较
它们看到的两个字符串是否相同(以防止哈希冲突产生误报)。
The code
代码
The major search method is next. More implementation details are in rabin-karp.swift
接下来是主要的搜索方法。 更多实现细节在rabin-karp.swift中
public func search(text: String , pattern: String) -> Int {
// convert to array of ints
let patternArray = pattern.flatMap { $0.asInt }
let textArray = text.flatMap { $0.asInt }
if textArray.count < patternArray.count {
return -1
}
let patternHash = hash(array: patternArray)
var endIdx = patternArray.count - 1
let firstChars = Array(textArray[0...endIdx])
let firstHash = hash(array: firstChars)
if (patternHash == firstHash) {
// Verify this was not a hash collision
if firstChars == patternArray {
return 0
}
}
var prevHash = firstHash
// Now slide the window across the text to be searched
for idx in 1...(textArray.count - patternArray.count) {
endIdx = idx + (patternArray.count - 1)
let window = Array(textArray[idx...endIdx])
let windowHash = nextHash(prevHash: prevHash, dropped: textArray[idx - 1], added: textArray[endIdx], patternSize: patternArray.count - 1)
if windowHash == patternHash {
if patternArray == window {
return idx
}
}
prevHash = windowHash
}
return -1
}
This code can be tested in a playground using the following:
可以使用以下代码在 playground 测试此代码:
search(text: "The big dog jumped"", "ump")
This will return 13 since ump is in the 13 position of the zero based string.
这将返回13,因为ump处于基于零的字符串的13位置。
Additional Resources
扩展阅读
Longest Common Subsequence
最长公共子序列
The Longest Common Subsequence (LCS) of two strings is the longest sequence of characters that appear in the same order in both strings.
两个字符串的最长公共子序列(LCS)是在两个字符串中以相同顺序出现的最长字符序列。
For example the LCS of "Hello World"
and "Bonjour le monde"
is "oorld"
. If you go through both strings from left-to-right, you'll find that the characters o
, o
, r
, l
, d
appear in both strings in that order.
例如,"Hello World"
和 "Bonjour le monde"
的LCS是"oorld"
。 如果你从左到右遍历两个字符串,你会发现字符 o
, o
, r
, l
, d
按顺序出现在两个字符串中。
Other possible subsequences are "ed"
and "old"
, but these are all shorter than "oorld"
.
其他可能的子序列是"ed"
和"old"
,但这些都比"oorld"
更短。
Note: This should not be confused with the Longest Common Substring problem, where the characters must form a substring of both strings, i.e they have to be immediate neighbors. With a subsequence, it's OK if the characters are not right next to each other, but they must be in the same order.
注意: 这不应与最长公共子串问题混淆,其中字符必须形成两个字符串的子串,即它们必须是直接邻居。 如果字符不是彼此相邻,则可以使用子序列,但它们必须处于相同的顺序。
One way to find the LCS of two strings a
and b
is using dynamic programming and a backtracking strategy.
找到两个字符串a
和b
的LCS的一种方法是使用动态编程和回溯策略。
Finding the length of the LCS with dynamic programming
通过动态编程查找LCS的长度
First, we want to find the length of the longest common subsequence between strings a
and b
. We're not looking for the actual subsequence yet, only how long it is.
首先,我们想要找到字符串a
和b
之间最长公共子序列的长度。 我们还没有找到实际的后续序列,只需要多长时间。
To determine the length of the LCS between all combinations of substrings of a
and b
, we can use a dynamic programming technique. Dynamic programming basically means that you compute all possibilities and store them inside a look-up table.
为了确定a
和b
的所有子串组合之间LCS的长度,我们可以使用动态编程技术。 动态编程基本上意味着您计算所有可能性并将它们存储在查找表中。
Note: During the following explanation,n
is the length of stringa
, andm
is the length of stringb
.
注意: 在下面的解释中,n
是字符串a
的长度,m
是字符串b
的长度。
To find the lengths of all possible subsequences, we use a helper function, lcsLength(_:)
. This creates a matrix of size (n+1)
by (m+1)
, where matrix[x][y]
is the length of the LCS between the substrings a[0...x-1]
and b[0...y-1]
.
为了找到所有可能子序列的长度,我们使用辅助函数lcsLength(_:)
。 这创建了一个大小为(n + 1)
by(m + 1)
的矩阵,其中matrix[x][y]
是子串之间的LCS长度a[0 ... x- 1]
和b [0 ... y-1]
。
Given strings "ABCBX"
and "ABDCAB"
, the output matrix of lcsLength(_:)
is the following:
给定字符串"ABCBX"
和"ABDCAB"
,lcsLength(_ :)
的输出矩阵如下:
| | Ø | A | B | D | C | A | B |
| Ø | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| B | 0 | 1 | 2 | 2 | 2 | 2 | 2 |
| C | 0 | 1 | 2 | 2 | 3 | 3 | 3 |
| B | 0 | 1 | 2 | 2 | 3 | 3 | 4 |
| X | 0 | 1 | 2 | 2 | 3 | 3 | 4 |
In this example, if we look at matrix[3][4]
we find the value 3
. This means the length of the LCS between a[0...2]
and b[0...3]
, or between "ABC"
and "ABDC"
, is 3. That is correct, because these two substrings have the subsequence ABC
in common. (Note: the first row and column of the matrix are always filled with zeros.)
在这个例子中,如果我们看一下matrix[3][4]
,我们找到值3
。 这意味着a [0 ... 2]
和b[0 ... 3]
之间或"ABC"
和"ABDC"
之间的LCS长度为3.这是正确的 ,因为这两个子串具有共同的子序列ABC
。 (注意:矩阵的第一行和第一列始终用零填充。)
Here is the source code for lcsLength(_:)
; this lives in an extension on String
:
这是lcsLength(_ :)
的源代码。 它存在于String
的扩展中:
func lcsLength(_ other: String) -> [[Int]] {
var matrix = [[Int]](repeating: [Int](repeating: 0, count: other.characters.count+1), count: self.characters.count+1)
for (i, selfChar) in self.characters.enumerated() {
for (j, otherChar) in other.characters.enumerated() {
if otherChar == selfChar {
// Common char found, add 1 to highest lcs found so far.
matrix[i+1][j+1] = matrix[i][j] + 1
} else {
// Not a match, propagates highest lcs length found so far.
matrix[i+1][j+1] = max(matrix[i][j+1], matrix[i+1][j])
}
}
}
return matrix
}
First, this creates a new matrix -- really a 2-dimensional array -- and fills it with zeros. Then it loops through both strings, self
and other
, and compares their characters in order to fill in the matrix. If two characters match, we increment the length of the subsequence. However, if two characters are different, then we "propagate" the highest LCS length found so far.
首先,这会创建一个新的矩阵 - 实际上是一个二维数组 - 并用零填充它。 然后它循环遍历两个字符串,self
和other
,并比较它们的字符以填充矩阵。 如果两个字符匹配,我们增加子序列的长度。 但是,如果两个字符不同,那么我们“传播”到目前为止找到的最高LCS长度。
Let's say the following is the current situation:
让我们说以下是目前的情况:
| | Ø | A | B | D | C | A | B |
| Ø | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| B | 0 | 1 | 2 | 2 | 2 | 2 | 2 |
| C | 0 | 1 | 2 | * | | | |
| B | 0 | | | | | | |
| X | 0 | | | | | | |
The *
marks the two characters we're currently comparing, C
versus D
. These characters are not the same, so we propagate the highest length we've seen so far, which is 2
:*
表示我们正在比较的两个字符,C
与D
。 这些字符不一样,所以我们传播到目前为止我们见过的最高长度,即2
:
| | Ø | A | B | D | C | A | B |
| Ø | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| B | 0 | 1 | 2 | 2 | 2 | 2 | 2 |
| C | 0 | 1 | 2 | 2 | * | | |
| B | 0 | | | | | | |
| X | 0 | | | | | | |
Now we compare C
with C
. These are equal, and we increment the length to 3
:
现在我们将C
与C
进行比较。 这些是相等的,我们将长度增加到3
:
| | Ø | A | B | D | C | A | B |
| Ø | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| B | 0 | 1 | 2 | 2 | 2 | 2 | 2 |
| C | 0 | 1 | 2 | 2 | 3 | * | |
| B | 0 | | | | | | |
| X | 0 | | | | | | |
And so on... this is how lcsLength(_:)
fills in the entire matrix.
。。。这就是lcsLength(_ :)
填充整个矩阵的方式。
Backtracking to find the actual subsequence
回溯找到实际的子序列
So far we've calculated the length of every possible subsequence. The length of the longest subsequence is found in the bottom-right corner of matrix, at matrix[n+1][m+1]
. In the above example it is 4, so the LCS consists of 4 characters.
到目前为止,我们已经计算了每个可能的子序列的长度。 最长子序列的长度位于矩阵的右下角,在matrix[n + 1][m + 1]
处。 在上面的示例中,它是4,因此LCS由4个字符组成。
Having the length of every combination of substrings makes it possible to determine which characters are part of the LCS itself by using a backtracking strategy.
具有子串的每个组合的长度使得可以通过使用回溯策略来确定哪个字符是LCS本身的一部分。
Backtracking starts at matrix[n+1][m+1]
and walks up and left (in this priority) looking for changes that do not indicate a simple propagation.
回溯从matrix[n + 1][m + 1]
开始,向上和向左(在此优先级中)寻找不表示简单传播的变化。
| | Ø| A| B| D| C| A| B|
| Ø | 0| 0| 0| 0| 0| 0| 0|
| A | 0|↖ 1| 1| 1| 1| 1| 1|
| B | 0| 1|↖ 2|← 2| 2| 2| 2|
| C | 0| 1| 2| 2|↖ 3|← 3| 3|
| B | 0| 1| 2| 2| 3| 3|↖ 4|
| X | 0| 1| 2| 2| 3| 3|↑ 4|
Each ↖
move indicates a character (in row/column header) that is part of the LCS.
每个↖
移动表示一个字符(在行/列标题中),它是LCS的一部分。
If the number on the left and above are different than the number in the current cell, no propagation happened. In that case matrix[i][j]
indicates a common char between the strings a
and b
, so the characters at a[i - 1]
and b[j - 1]
are part of the LCS that we're looking for.
如果左侧和上方的数字与当前单元格中的数字不同,则不会发生传播。 在那种情况下matrix[i][j]
表示字符串a
和b
之间的公共字符,所以a[i - 1]
和b[j - 1]
的字符是一部分 我们正在寻找的LCS。
One thing to notice is, as it's running backwards, the LCS is built in reverse order. Before returning, the result is reversed to reflect the actual LCS.
需要注意的一点是,当它向后运行时,LCS以相反的顺序构建。 在返回之前,结果反转以反映实际的LCS。
Here is the backtracking code:
这是回溯代码:
func backtrack(_ matrix: [[Int]]) -> String {
var i = self.characters.count
var j = other.characters.count
var charInSequence = self.endIndex
var lcs = String()
while i >= 1 && j >= 1 {
// Indicates propagation without change: no new char was added to lcs.
if matrix[i][j] == matrix[i][j - 1] {
j -= 1
}
// Indicates propagation without change: no new char was added to lcs.
else if matrix[i][j] == matrix[i - 1][j] {
i -= 1
charInSequence = self.index(before: charInSequence)
}
// Value on the left and above are different than current cell.
// This means 1 was added to lcs length.
else {
i -= 1
j -= 1
charInSequence = self.index(before: charInSequence)
lcs.append(self[charInSequence])
}
}
return String(lcs.characters.reversed())
}
This backtracks from matrix[n+1][m+1]
(bottom-right corner) to matrix[1][1]
(top-left corner), looking for characters that are common to both strings. It adds those characters to a new string, lcs
.
这从matrix[n + 1][m + 1]
(右下角)回溯到matrix[1][1]
(左上角),寻找两个字符串共有的字符。 它将这些字符添加到一个新字符串lcs
中。
The charInSequence
variable is an index into the string given by self
. Initially this points to the last character of the string. Each time we decrement i
, we also move back charInSequence
. When the two characters are found to be equal, we add the character at self[charInSequence]
to the new lcs
string. (We can't just write self[i]
because i
may not map to the current position inside the Swift string.)charInSequence
变量是self
给出的字符串的索引。 最初,这指向字符串的最后一个字符。 每当我们减少i
时,我们也会移回charInSequence
。 当发现两个字符相等时,我们将self[charInSequence]
中的字符添加到新的lcs
字符串中。 (我们不能只写self[i]
因为i
可能不会映射到Swift字符串中的当前位置。)
Due to backtracking, characters are added in reverse order, so at the end of the function we call reversed()
to put the string in the right order. (Appending new characters to the end of the string and then reversing it once is faster than always inserting the characters at the front of the string.)
由于回溯,字符以相反的顺序添加,因此在函数的末尾我们调用reversed()
来将字符串按正确的顺序排列。 (将新字符附加到字符串的末尾,然后将其反转一次比始终在字符串前面插入字符要快。)
Putting it all together
把它们放在一起
To find the LCS between two strings, we first call lcsLength(_:)
and then backtrack(_:)
:
要在两个字符串之间找到LCS,我们首先调用lcsLength(_ :)
然后调用backtrack(_ :)
:
extension String {
public func longestCommonSubsequence(_ other: String) -> String {
func lcsLength(_ other: String) -> [[Int]] {
...
}
func backtrack(_ matrix: [[Int]]) -> String {
...
}
return backtrack(lcsLength(other))
}
}
To keep everything tidy, the two helper functions are nested inside the main longestCommonSubsequence()
function.
为了保持一切整洁,两个辅助函数嵌套在主longestCommonSubsequence()
函数中。
Here's how you could try it out in a Playground:
以下是您可以在Playground尝试的方法:
let a = "ABCBX"
let b = "ABDCAB"
a.longestCommonSubsequence(b) // "ABCB"
let c = "KLMK"
a.longestCommonSubsequence(c) // "" (no common subsequence)
"Hello World".longestCommonSubsequence("Bonjour le monde") // "oorld"
Z-Algorithm String Search
Z-算法字符串搜索
Goal: Write a simple linear-time string matching algorithm in Swift that returns the indexes of all the occurrencies of a given pattern.
目标:在Swift中编写一个简单的线性时间字符串匹配算法,返回给定模式的所有副本的索引。
In other words, we want to implement an indexesOf(pattern: String)
extension on String
that returns an array [Int]
of integers, representing all occurrences' indexes of the search pattern, or nil
if the pattern could not be found inside the string.
换句话说,我们想在String
上实现一个indicesOf(pattern: String)
扩展,它返回一个整数的数组[Int]
,表示搜索模式的所有出现的索引,如果是,则返回nil
在字符串中找不到模式。
For example:
例子:
let str = "Hello, playground!"
str.indexesOf(pattern: "ground") // Output: [11]
let traffic = "??????????????????????"
traffic.indexesOf(pattern: "?") // Output: [4, 21]
Many string search algorithms use a pre-processing function to compute a table that will be used in successive stage. This table can save some time during the pattern search stage because it allows to avoid un-needed characters comparisons. The [Z-Algorithm]() is one of these functions. It borns as a pattern pre-processing function (this is its role in the Knuth-Morris-Pratt algorithm and others) but, just like we will show here, it can be used also as a single string search algorithm.
许多字符串搜索算法使用预处理函数来计算将在连续阶段中使用的表。 此表可以在模式搜索阶段节省一些时间,因为它允许避免不需要的字符比较。 [Z-Algorithm]()是这些功能之一。 它具有图案预处理功能(这是它在Knuth-Morris-Pratt算法等中的作用)但是,就像我们将在这里展示的那样,它可以是 也用作单字符串搜索算法。
Z-Algorithm as pattern pre-processor
Z-算法作为模式预处理器
As we said, the Z-Algorithm is foremost an algorithm that process a pattern in order to calculate a skip-comparisons-table.
The computation of the Z-Algorithm over a pattern P
produces an array (called Z
in the literature) of integers in which each element, call it Z[i]
, represents the length of the longest substring of P
that starts at i
and matches a prefix of P
. In simpler words, Z[i]
records the longest prefix of P[i...|P|]
that matches a prefix of P
. As an example, let's consider P = "ffgtrhghhffgtggfredg"
. We have that Z[5] = 0 (f...h...)
, Z[9] = 4 (ffgtr...ffgtg...)
and Z[15] = 1 (ff...fr...)
.
在模式P
上计算Z-算法会产生一个整数的数组(在文献中称为Z
),其中每个元素称为Z[i]
,表示最长子串的长度。 P
从i
开始并匹配前缀P
。 简单来说,Z[i]
记录匹配前缀为P
的P [i ... | P |]
的最长前缀。 举个例子,让我们考虑P ="ffgtrhghhffgtggfredg"
。 我们有Z[5] = 0(f ... h ...)
,Z[9] = 4(ffgtr ... ffgtg ...)
和Z[15] = 1(ff...... FR)
。
But how do we compute Z
? Before we describe the algorithm we must indroduce the concept of Z-box. A Z-box is a pair (left, right)
used during the computation that records the substring of maximal length that occurs also as a prefix of P
. The two indices left
and right
represent, respectively, the left-end index and the right-end index of this substring.
The definition of the Z-Algorithm is inductive and it computes the elements of the array for every position k
in the pattern, starting from k = 1
. The following values (Z[k + 1]
, Z[k + 2]
, ...) are computed after Z[k]
. The idea behind the algorithm is that previously computed values can speed up the calculus of Z[k + 1]
, avoiding some character comparisons that were already done before. Consider this example: suppose we are at iteration k = 100
, so we are analyzing position 100
of the pattern. All the values between Z[1]
and Z[99]
were correctly computed and left = 70
and right = 120
. This means that there is a substring of length 51
starting at position 70
and ending at position 120
that matches the prefix of the pattern/string we are considering. Reasoning on it a little bit we can say that the substring of length 21
starting at position 100
matches the substring of length 21
starting at position 30
of the pattern (because we are inside a substring that matches a prefix of the pattern). So we can use Z[30]
to compute Z[100]
without additional character comparisons.
This a simple description of the idea that is behind this algorithm. There are a few cases to manage when the use of pre-computed values cannot be directly applied and some comparisons are to be made.
但是我们如何计算Z
?在我们描述算法之前,我们必须产生Z-box的概念。 Z-box是在计算过程中使用的一对(left, right)
,它记录了最大长度的子串,它也作为P
的前缀出现。两个索引left
和right
分别表示该子字符串的左端索引和右端索引。
Z算法的定义是归纳的,它计算模式中每个位置“k”的数组元素,从k = 1
开始。在Z [k]之后计算以下值(Z[k + 1]
,Z[k + 2]
,...)。该算法背后的想法是,先前计算的值可以加速Z[k + 1]
的计算,避免了之前已经完成的一些字符比较。考虑这个例子:假设我们处于迭代k = 100
,所以我们正在分析模式的位置100
。 Z[1]
和Z[99]
之间的所有值都被正确计算,left = 70
和right = 120
。这意味着有一个长度为51
的子字符串,从位置70
开始,到“120”结尾,与我们正在考虑的模式/字符串的前缀相匹配。稍微推理一下,我们可以说从位置100
开始的长度为21
的子串与从模式的位置“30”开始的长度为21
的子串相匹配(因为我们在匹配a的子串内)模式的前缀)。所以我们可以使用Z[30]
来计算Z[100]
而无需额外的字符比较。
这是对该算法背后的想法的简单描述。当无法直接应用预先计算的值的使用并且要进行一些比较时,有一些情况需要管理。
Here is the code of the function that computes the Z-array:
以下是计算Z数组的函数的代码:
func ZetaAlgorithm(ptrn: String) -> [Int]? {
let pattern = Array(ptrn.characters)
let patternLength: Int = pattern.count
guard patternLength > 0 else {
return nil
}
var zeta: [Int] = [Int](repeating: 0, count: patternLength)
var left: Int = 0
var right: Int = 0
var k_1: Int = 0
var betaLength: Int = 0
var textIndex: Int = 0
var patternIndex: Int = 0
for k in 1 ..< patternLength {
if k > right { // Outside a Z-box: compare the characters until mismatch
patternIndex = 0
while k + patternIndex < patternLength &&
pattern[k + patternIndex] == pattern[patternIndex] {
patternIndex = patternIndex + 1
}
zeta[k] = patternIndex
if zeta[k] > 0 {
left = k
right = k + zeta[k] - 1
}
} else { // Inside a Z-box
k_1 = k - left + 1
betaLength = right - k + 1
if zeta[k_1 - 1] < betaLength { // Entirely inside a Z-box: we can use the values computed before
zeta[k] = zeta[k_1 - 1]
} else if zeta[k_1 - 1] >= betaLength { // Not entirely inside a Z-box: we must proceed with comparisons too
textIndex = betaLength
patternIndex = right + 1
while patternIndex < patternLength && pattern[textIndex] == pattern[patternIndex] {
textIndex = textIndex + 1
patternIndex = patternIndex + 1
}
zeta[k] = patternIndex - k
left = k
right = patternIndex - 1
}
}
}
return zeta
}
Let's make an example reasoning with the code above. Let's consider the string P = “abababbb"
. The algorithm begins with k = 1
, left = right = 0
. So, no Z-box is "active" and thus, because k > right
we start with the character comparisons beetwen P[1]
and P[0]
.
让我们用上面的代码作一个例子推理。 让我们考虑字符串P ="abababbb"
。算法以k = 1
,left = right = 0
开头。所以,没有Z-box是“活跃的”因此,因为k > right
我们 从字符比较 P[1]
和P[0]
开始。
01234567
k: x
abababbb
x
Z: 00000000
left: 0
right: 0
We have a mismatch at the first comparison and so the substring starting at P[1]
does not match a prefix of P
. So, we put Z[1] = 0
and let left
and right
untouched. We begin another iteration with k = 2
, we have 2 > 0
and again we start comparing characters P[2]
with P[0]
. This time the characters match and so we continue the comparisons until a mismatch occurs. It happens at position 6
. The characters matched are 4
, so we put Z[2] = 4
and set left = k = 2
and right = k + Z[k] - 1 = 5
. We have our first Z-box that is the substring "abab"
(notice that it matches a prefix of P
) starting at position left = 2
.
我们在第一次比较时有不匹配,所以从P [1]
开始的子串与P
的前缀不匹配。 所以,我们把Z [1] = 0
并让'left和
right保持不变。 我们用
k = 2开始另一次迭代,我们有
2> 0并且我们再次开始将字符
P [2]'与P [0]
进行比较。 这次字符匹配,因此我们继续比较直到发生不匹配。 它发生在位置“6”。 匹配的字符是4
,所以我们把'Z[2] = 4并设置
left = k = 2和
right = k + Z [k] - 1 = 5。 我们有第一个Z-box,它是子串
"abab"(注意它匹配
P的前缀),从位置'left = 2
开始。
01234567
k: x
abababbb
x
Z: 00400000
left: 2
right: 5
We then proceed with k = 3
. We have 3 <= 5
. We are inside the Z-box previously found and inside a prefix of P
. So we can look for a position that has a previously computed value. We calculate k_1 = k - left = 1
that is the index of the prefix's character equal to P[k]
. We check Z[1] = 0
and 0 < (right - k + 1 = 3)
and we find that we are exactly inside the Z-box. We can use the previously computed value, so we put Z[3] = Z[1] = 0
, left
and right
remain unchanged.
At iteration k = 4
we initially execute the else
branch of the outer if
. Then in the inner if
we have that k_1 = 2
and (Z[2] = 4) >= 5 - 4 + 1
. So, the substring P[k...r]
matches for right - k + 1 = 2
chars the prefix of P
but it could not for the following characters. We must then compare the characters starting at r + 1 = 6
with those starting at right - k + 1 = 2
. We have P[6] != P[2]
and so we have to set Z[k] = 6 - 4 = 2
, left = 4
and right = 5
.
然后我们继续k = 3
。 我们有3 <= 5
。 我们在之前找到的Z-box里面,在P
的前缀里面。 因此,我们可以查找具有先前计算值的位置。 我们计算k_1 = k - left = 1
,它是前缀字符的索引,等于P[k]
。 我们检查Z [1] = 0
和0 <(right - k + 1 = 3)
我们发现我们正好在Z-box内。 我们可以使用先前计算的值,因此我们将Z [3] = Z [1] = 0
,left
和right
保持不变。
在迭代k = 4
时,我们最初执行外部if
的else
分支。 然后在内部if
中我们有k_1 = 2
和(Z [2] = 4)> = 5 - 4 + 1
。 因此,子串P [k ... r]
匹配right-k + 1 = 2
字符P
的前缀,但它不能用于后面的字符。 然后我们必须将从r + 1 = 6
开始的字符与从right-k + 1 = 2
开始的字符进行比较。 我们有P[6]!= P [2]
所以我们必须设置Z[k] = 6 - 4 = 2
,left = 4
和right = 5
。
01234567
k: x
abababbb
x
Z: 00402000
left: 4
right: 5
With iteration k = 5
we have k <= right
and then (Z[k_1] = 0) < (right - k + 1 = 1)
and so we set z[k] = 0
. In iteration 6
and 7
we execute the first branch of the outer if
but we only have mismatches, so the algorithms terminates returning the Z-array as Z = [0, 0, 4, 0, 2, 0, 0, 0]
.
在迭代“k = 5”时,我们有k <= right
然后(Z [k_1] = 0)<(right - k + 1 = 1)
所以我们设置z [k] = 0
。 在迭代6
和7
中,我们执行外部if
的第一个分支,但我们只有不匹配,所以算法终止返回Z数组为Z = [0,0,4,0,2, 0,0,0]
。
The Z-Algorithm runs in linear time. More specifically, the Z-Algorithm for a string P
of size n
has a running time of O(n)
.
Z算法以线性时间运行。 更具体地说,对于大小为n
的字符串P
的Z算法具有“O(n)”的运行时间。
The implementation of Z-Algorithm as string pre-processor is contained in the ZAlgorithm.swift file.
Z-Algorithm作为字符串预处理器的实现包含在ZAlgorithm.swift文件中。
Z-Algorithm as string search algorithm
Z-算法作为字符串搜索算法
The Z-Algorithm discussed above leads to the simplest linear-time string matching algorithm. To obtain it, we have to simply concatenate the pattern P
and text T
in a string S = P$T
where $
is a character that does not appear neither in P
nor T
. Then we run the algorithm on S
obtaining the Z-array. All we have to do now is scan the Z-array looking for elements equal to n
(which is the pattern length). When we find such value we can report an occurrence.
上面讨论的Z算法导致最简单的线性时间串匹配算法。 为了获得它,我们必须简单地在一个字符串S = P $ T
中连接模式P
和文本T
,其中$
是一个既不出现在'P也不出现'T
的字符。 然后我们在S
上运行算法获得Z阵列。 我们现在要做的就是扫描Z阵列,寻找等于“n”的元素(即模式长度)。 当我们找到这样的值时,我们可以报告一个事件。
extension String {
func indexesOf(pattern: String) -> [Int]? {
let patternLength: Int = pattern.characters.count
/* Let's calculate the Z-Algorithm on the concatenation of pattern and text */
let zeta = ZetaAlgorithm(ptrn: pattern + "?" + self)
guard zeta != nil else {
return nil
}
var indexes: [Int] = [Int]()
/* Scan the zeta array to find matched patterns */
for i in 0 ..< zeta!.count {
if zeta![i] == patternLength {
indexes.append(i - patternLength - 1)
}
}
guard !indexes.isEmpty else {
return nil
}
return indexes
}
}
Let's make an example. Let P = “CATA“
and T = "GAGAACATACATGACCAT"
be the pattern and the text. Let's concatenate them with the character $
. We have the string S = "CATA$GAGAACATACATGACCAT"
. After computing the Z-Algorithm on S
we obtain:
我们举个例子吧。 设P =“CATA”
和T =“GAGAACATACATGACCAT”
是模式和文本。 让我们将它们与字符$
连接起来。 我们有字符串S =“CATA $ GAGAACATACATGACCAT”
。 在迭代S
上计算Z算法后,我们得到:
1 2
01234567890123456789012
CATA$GAGAACATACATGACCAT
Z 00000000004000300001300
^
We scan the Z-array and at position 10
we find Z[10] = 4 = n
. So we can report a match occuring at text position 10 - n - 1 = 5
.
As said before, the complexity of this algorithm is linear. Defining n
and m
as pattern and text lengths, the final complexity we obtain is O(n + m + 1) = O(n + m)
.
我们扫描Z阵列,在位置“10”,我们发现Z[10] = 4 = n
。 所以我们可以报告在文本位置10 - n - 1 = 5
发生的匹配。
如前所述,该算法的复杂性是线性的。 将n
和m
定义为模式和文本长度,我们得到的最终复杂度是“O(n + m + 1)= O(n + m)”。
Credits: This code is based on the handbook "Algorithm on String, Trees and Sequences: Computer Science and Computational Biology" by Dan Gusfield, Cambridge University Press, 1997.
致谢:此代码基于手册[“字符串,树和序列算法:计算机科学和计算生物学”](books.google.it/book... )Dan Gusfield,剑桥大学出版社,1997年。
Written for Swift Algorithm Club by Matteo Dunnhofer