Data structure and algorithm -- string matching depth long text

Article directory


If you think that the writing is OK, please pay attention to WeChat official account: background development notes.

string matching

It is to find out whether there is string B in string A. at this time, string a is called the main string and string B is called the pattern string. Assuming that the length of string a is n and the length of string B is m, then n > = m is mainly used in search engines and sensitive word filtering at the industrial level

BF algorithm

Violence matching method, time complexity O(nm)

Although the complexity is high, it is often used in actual development. The reason is that in the actual development, the pattern string and the main string are not too long, and they will jump out if they match halfway, so the performance is relatively good. Besides, code implementation is simple and bug free.

Pseudo code implementation:

bool isSubstring(char *master, char *pattern) {
    for (int i = 0; i <= len(master) - len(pattern); i++) {	//Ergodic matching
    	bool flag = true;
    	for (int j = 0; j < len(pattern); j++) {
        	if (master[i + j] != pattern[j]) {
            	flag = false;
            	break;
        	}
    	}
    	if (flag) return true;	//Returns true directly if the exact match is successful
	}
    return false;
}

RK algorithm

Hash the n-m+1 substring of the main string respectively, and compare them with the hash value of the pattern string one by one. If they are the same, the matching is successful

One way

For a pure lowercase character string (case, case + letter, of course), we can think of it as a 26 base number to hash. The hash rule is: (last hash value% 26^(m-1)) × 26 + (character -'a ')

Features: high efficiency O(n), long mode string is not supported, otherwise, long long burst
‚Äč

Mode two

We can add each substring in the way of hash: in this way, there will be hash conflicts (the string hash values may be the same, but the strings are not equal). We need to check the strings with the same hash values, and if they are the same, the matching will succeed

Features: no explosion risk, the worst-case time complexity is O(nm), of course, the worst-case possibility is very low

Pseudo code:

//Further confirm whether it is the same
bool isSame(char *master, char *pattern, int startInx) {
	for (int i = 0; i < len(pattern); i++) {
		if (master[startInx + i] != pattern[i]) return false; 
	}
	return true;
}

bool isSubstring(char *master, char *pattern) {
	int hash = 0;				//Hash is the hash value of pattern string
	int tempHash = 0;			//hash value of the corresponding substring in the main string
	for (int i = 0; i < len(pattern); i++) {	
		hash += pattern[i] - 'a' + 1;
		tempHash += master[i] - 'a' + 1;
	}
	if (tempHash == hash) {	 //If the hash values are the same, verify that they are exactly the same
		if (isSame(master, pattern, 0)) {
			return true;	//Return if completely consistent
		}
	}
	//Otherwise, continue to traverse
    for (int i = len(pattern); i < len(master); i++) {
    	tempHash += master[i] - 'a' + 1;
   		tempHash -= master[i - len(pattern)] - 'a' + 1;
   		if (tempHash == hash) {	 //If the hash value is the same, check whether it is completely consistent
   			if (isSame(master, pattern, i - len(pattern) + 1)) {
   				return true;
   			}
   		}
	}
    return false;
}

BM algorithm

The essence of high efficiency of BM algorithm is: when encountering unmatched characters, according to some rules, you can slide the pattern string back a few more bits, and skip those cases that will not match for sure

In order to achieve the above purpose, BM algorithm has two rules: bad character rule and good suffix rule. We can calculate the number of digits of good suffix and bad character sliding backward respectively, and then take the maximum as the number of digits of pattern string sliding backward

PS: because the implementation of bad character rules consumes more memory, we can only use good suffix rules to implement BM algorithm

Bad character rule

  1. The pattern string is matched with the main string from the back to the front. The first mismatched main string character is the bad character
  2. Start from the pattern string character corresponding to the bad character and traverse the pattern string forward to find the first character that is the same as the bad character, and then align with the bad character (if not, move to the whole after the bad character)


Good Suffix

  1. The pattern string is matched with the main string from the back to the front. All characters after the first mismatched character are good suffixes
  2. Start from the pattern string character corresponding to the bad character and traverse the pattern string to find the first good suffix alignment (if not, move to the good suffix as a whole)


KMP algorithm

KMP is consistent with BM algorithm, which improves efficiency by moving more bits in case of mismatches and skipping certain mismatches. The time complexity is O(n+m)

Its approach is to find the longest good Prefix suffix substring that can match the pattern string prefix substring, and then move it to the corresponding position. See the figure for details:

How do we find the suffix substring of the longest good prefix? As shown in the picture:

In this way, we only need to preprocess the pattern string (because a good prefix must be the prefix substring of the pattern string). We preprocess the next array to save the ending subscript of the longest matching prefix substring, as shown in the figure:

next array method of optimized version

Its core is to use the next[i-1], next[i-2] To quickly deduce next[i], the derivation rules are:

  1. If next[i-1] = k, and pattern[k+1]==pattern[i], then next[i] = k+1
  2. If next[i-1] = k, but pattern[k+1]!=pattern[i], we will recursively search for the sub large prefix matching until pattern[k+1]==pattern[i] or until the end

Pseudo code

// n. M is the length of the main string and the mode string, respectively.
int kmp(char* master, char* pattern, int n, int m) {
  getNexts(pattern, m);	//Preprocess next array
  int j = 0;	//j represents the current matching position of pattern string
  for (int i = 0; i < n; ++i) {
    while (j > 0 && master[i] != pattern[j]) {	//Move mode string when bad character is encountered
      j = next[j - 1] + 1;
    }
    if (master[i] == pattern[j]) {
      ++j;
    }
    if (j == m) { //Find the fully matched substring in the main string and return the start and bottom corner of the substring
      return i - m + 1;
    }
  }
  return -1;
}

// Preprocess next array
void getNexts(char* pattern, int m) {
  next[0] = -1;		//A good prefix of length 1 has no longest matching prefix substring
  int k = -1;
  for (int i = 1; i < m; ++i) {
    //If pattern [K + 1]! = pattern [i], recursively search for the next largest prefix match
    //Until pattern[k+1]==pattern[i] or until the end
    while (k != -1 && pattern[k + 1] != pattern[i]) {
      k = next[k];
    }
    //If pattern[k+1] == pattern[i], then next[i] = k+1
    if (pattern[k + 1] == pattern[i]) {
      ++k;
    }
    next[i] = k;
  }
   return;
}

Dictionary tree

The above BF, RK, BM and KMP algorithms are single pattern string matching algorithms, which are applicable to finding one pattern string at a time in a main string; the next dictionary tree is a multi pattern matching algorithm, which can find multiple pattern strings in a main string at the same time. If we want to find out whether arm, hi, hill, pair, part, pen and pencil exist in a string of characters, we can use these words to build a dictionary tree:

In this way, we only need to start matching in the dictionary tree from the first character C of the string. When matching to the leaf node of the dictionary tree or encountering mismatched characters halfway, we start to match again in the dictionary tree from the next character of the character C (that is, the main string moves backward one bit). The time complexity is O(n x len), where n is the length of the main string and Len is the average length of the pattern string

The essence of the dictionary tree is to combine the common prefixes among strings and support multi pattern string matching queries. The disadvantage is that it consumes memory (each node has a pointer array of length 26)

For the optimization of dictionary tree memory, we can use the shrink point optimization, that is, for a node with only one child node, and this node is not the end node of the string, we can merge this node with the child node (although memory is saved, but operation difficulty is increased, such as inserting and deleting a word may involve node merging and splitting), as shown in the following figure:

AC automaton

From the dictionary tree matching, we can know that when matching to the leaf node of the dictionary tree or encountering mismatched characters in the middle, the main string starts to match and moves one bit later, and matches again in the dictionary tree. In fact, this method is not efficient. Can we use KMP like algorithm to move the main string back as many bits as possible each time?

AC automata, in fact, adds the next array similar to KMP on the basis of the dictionary tree, that is to say, running KMP on the dictionary tree. Its worst time complexity is O(n x len), which is close to O(n)

The construction of AC automata includes two operations:

  • Construct multiple pattern strings into a dictionary tree
  • Each node in the dictionary tree contains a failure pointer, whose function and construction process are similar to KMP's next array

If a suffix substring can match the prefix of a pattern string, we call it a matching suffix substring. As for the failure pointer, it is used to point to the last node corresponding to the pattern string prefix of the longest matching suffix substring. As shown in the figure, for abc, the longest matching suffix substring is bc:

Fast derivation failure pointer

Remember that a node's failure pointer can only appear on the upper layer of its layer, so we can deduce a deeper failure pointer by finding a deeper failure pointer (traversing the tree by layer)

(1) if the failure pointer of node p points to q, q has the same child node qc as pc character, then pc points to qc

(2) If the failure pointer of node p points to Q, Q does not have a child node with the same character as pc, let q = q - > fail continue to look up until it is found or root

Pseudo code

typedef struct AcNode {
  char data; 	//data
  AcNode* children[26];	//26 pointers to different letters
  bool isEndingChar = false; // End character is true
  int length = -1;	 	// Record mode string length when isEndingChar=true
  AcNode* fail; 			// Failure pointer
} AcNode;

//Failed pointer to initialize AC automaton (traversal by layer)
void buildFailurePointer() {
  Queue<AcNode*> queue;
  root->fail = null;
  queue.push(root);
  //bfs traverse AC automata by layer
  while (!queue.isEmpty()) {
    AcNode* p = queue.pop();
    //Failure pointer to maintain its child nodes
    for (int i = 0; i < 26; ++i) {
      AcNode* pc = p->children[i];
      if (pc == null) continue;
      //The failure pointer of root must be root (recursive termination condition)
      if (p == root) {
        pc->fail = root;
      } else {
        AcNode* q = p->fail;
        while (q != null) {
          AcNode* qc = q->children[pc->data - 'a'];
          if (qc != null) {
            pc->fail = qc;
            break;
          }
          q = q->fail;
        }
        //Special judgment continues up to root
        if (q == null) {
          pc.fail = root;
        }
      }
      queue.push(pc);
    }
  }
}

//AC automaton matching process
bool match(char* text) { // text is the main string
  int n = strlen(text);
  AcNode* p = root;
  //Traversing the main string
  for (int i = 0; i < n; ++i) {
    int idx = text[i] - 'a';
    //Loop up when matching cannot continue
    while (p->children[idx] == null && p != root) {
      p = p->fail; // Where the failure pointer works
    }
    p = p->children[idx];
    if (p == null) p = root; // If there is no match, rematch from root
    //If there is a match, look up one by one to see if its node is the end point
    AcNode* tmp = p;
    while (tmp != root) {
      if (tmp->isEndingChar == true) {
        return true;
      }
      tmp = tmp->fail;
    }
  }
  return false;
}
283 original articles published, 52 praised, 60000 visitors+
Private letter follow

Posted on Sat, 07 Mar 2020 23:21:31 -0800 by willwill100