Introduction to Pattern Matching

Given a text, find all occurrences of a given pattern in it.

 
For example,

1. For text T = "ABCABAABCABAC" and pattern P = "ABAA",

The pattern occurs only once in the text, starting from index 3 ("ABCABAABCABAC")

 
2. For text T = "ABCABAABCABAC" and pattern P = "CAB",

The pattern occurs twice in the text, starting from index 2 and 8. ("ABCABAABCABAC" and "ABCABAABCABAC")

 
The pattern matching is very common real life problem that arises frequently in text-editing programs such as MS word, notepad, notepad++, etc. String pattern matching algorithms are also used to search for particular patterns in DNA sequences.

The goal is to find all occurrences of the pattern P[1..m] of length m in given text T[1..n] of length n. In this post, we will discuss the naive string-matching algorithm.

 
The naive algorithm finds all occurrences of pattern by using a loop that checks the condition
P[1..m] = T[s+1..s+m] for each of the n-m+1 possible values of s. Below is the psedo-code for the algorithm.

 
Naive-Pattern-Matching(T, P)

n = length[T]
m = length[P]

for s = 0 to n – m do
    if P[1 .. m] = T[s+1 .. s+m] do
        print "Pattern found at position s"
    end-if
end-for

 

C

Download   Run Code

Output:

Pattern occurs with shift 2
Pattern occurs with shift 8

C++

Download   Run Code

Output:

Pattern occurs with shift 2
Pattern occurs with shift 8

Java

Download   Run Code

Output:

Pattern occurs with shift 2
Pattern occurs with shift 8

 

Performance:

 
The time complexity of above solution is O(nm) where n is the length of the text and m is the length of the pattern. When the length of the pattern and text are roughly same, the algorithm runs in the quadratic time.

One worst case is that text T has n number of A’s and the pattern P has (m-1) number of A’s followed by a single B. It will result in maximum comparisons. i.e.

T = AAAAAAAAAA, P = AAAB

The best case occurs when first character of the pattern P is not present in text. It will result in minimum comparisons. i.e.

T = ABCDEFGHIJKLMNO, P = ZOOM
 

There are many efficient string searching algorithms that performs way better than naive algorithm discussed above. Below are some of the famous algorithms –

  • Knuth–Morris–Pratt algorithm
    The KMP pattern matching algorithm has worst case time complexity of O(n + m).
     
  • Rabin-Karp algorithm
    The worst-case running time of the Rabin-Karp algorithm is O((n – m + 1) * m), but it has a good average-case running time. A practical application of this algorithm is detecting plagiarism. Given source material, the algorithm can rapidly search through a paper for instances of sentences from the source material.
     
  • The Boyer-Moore algorithm
    If the pattern P is relatively long and the text is reasonably large, then this algorithm is likely to be the most efficient string-matching algorithm. It has worst-case running time of O(n + m) only if the pattern does not appear in the text else it runs in O(nm) time in the worst case.

 
References: http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap34.htm

 
Thanks for reading.




Please use ideone or C++ Shell or any other online compiler link to post code in comments.
Like us? Please spread the word and help us grow. Happy coding 🙂
 





Leave a Reply

Notify of
avatar
wpDiscuz