Introduction to Pattern Matching

Given a text, find all occurrences of a given pattern in it.

 

 
For example,

For text T = “abcabaabcabac” and pattern P = “abaa”, the pattern occurs only once in the text, starting from index 3 (“abcabaabcabac”)

For text T = “abcabaabcabac” and pattern P = “cab”, the pattern occurs twice in the text, starting from index 2 and 8. (“abcabaabcabac” and “abcabaabcabac”)


Pattern matching is very common real life problem that arises frequently in text-editing programs such as MS word, notepad, notepad++, etc. String pattern matching algorithms are also used to search for particular patterns in DNA sequences.

The goal is to find all occurrences of the pattern P[1..m] of length m in given text T[1..n]of length n. We will discuss the naive string-matching algorithm in this post.


The naive algorithm finds all occurrences of pattern by using a loop that checks the condition P[1..m] = T[s+1..s+m] for each of the n-m+1 possible values of s. Below is the psedo-code for the algorithm.

Naive-Pattern-Matching(T, P)

n = length[T]
m = length[P]

for s = 0 to n – m
    if P[1 .. m] = T[s+1 .. s+m]
        print “Pattern found at position s”

 
C++ implementation –

Download   Run Code

Output:

Pattern occurs with shift 2
Pattern occurs with shift 8

 

The time complexity of above solution is O(nm). When the length of the pattern and text are roughly equal, the algorithm runs in the quadratic time.

One worst case is that text T has n number of A’s and the pattern P has (m-1) number of A’s followed by a single B. It will result in maximum comparisons. i.e.
T = AAAAAAAAAA, P = AAAB

The best case occurs when first character of the pattern P is not present in text. It will result in minimum comparisons. i.e.
T = ABCDEFGHIJKLMNO, P = ZOOM
 

There are many efficient string searching algorithms that performs way better than naive algorithm discussed above. Below are some of the famous algorithms –

  • Knuth–Morris–Pratt algorithm
    The KMP pattern matching algorithm has worst case time complexity of O(n + m).
     
  • Rabin-Karp algorithm 
    The worst-case running time of the Rabin-Karp algorithm is O((n – m + 1)m), but it has a good average-case running time. A practical application of this algorithm is detecting plagiarism. Given source material, the algorithm can rapidly search through a paper for instances of sentences from the source material.
     
  • The Boyer-Moore algorithm 
    If the pattern P is relatively long and the text is reasonably large, then this algorithm is likely to be the most efficient string-matching algorithm. It has worst-case running time of O(n + m) only if the pattern does not appear in the text else it runs in O(nm) time in the worst case.

References:
http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap34.htm

 
Thanks for reading.




Please use ideone or C++ Shell or any other online compiler link to post code in comments.
Like us? Please spread the word and help us grow. Happy coding 🙂