See more on github

In this article, I will show you some kinds of popular string matching algorithm and dynamic programming algorithm for wildcard matching.

String Matching algorithm

Rabin-Karp

We can view a string of k characters (digits) as a length-k decimal number. E.g., the string “31425” corresponds to the decimal number 31,425.

Time approximation: finding prefix function take O(m), matching takes O(m+n)

python implementation

#coding: utf-8
''' mbinary
#########################################################################
# File : KMP.py
# Author: mbinary
# Mail: zhuheqin1@gmail.com
# Blog: https://mbinary.github.io
# Github: https://github.com/mbinary
# Created Time: 2018-12-11  14:02
# Description:
#########################################################################
'''

def getPrefixFunc(s):
    '''return the list of prefix function of s'''
    length = 0
    i = 1
    n = len(s)
    ret = [0]
    while i<n:
        if s[i]==s[length]:
            length +=1
            ret.append(length)
            i+=1
        else:
            if length==0:
                ret.append(0)
                i+=1
            else:
                length = ret[length-1]
    return ret

def findAll(s,p):
    pre = getPrefixFunc(p)
    i = j  =0
    n,m = len(s),len(p)
    ret = []
    while i<n:
        if s[i]==p[j]:
            i+=1
            j+=1
            if j==m:
                ret.append(i-j)
                j=pre[j-1]
        else:
            if j==0: i+=1
            else: j = pre[j-1]
    return ret
def randStr(n=3):
    return [randint(ord('a'),ord('z')) for i in range(n)]

if __name__ =='__main__':
    from random import randint
    s = randStr(50)
    p = randStr(1)
    print(s)
    print(p)
    print(findAll(s,p))

Boyer-Moore

Sunday

features

The bad-character shift of the present algorithm is slightly modified to take into account the last character of x as follows: for c in Sigma, qsBc[c]=min{i : 0 < i leq m and x[m-i]=c} if c occurs in x, m+1 otherwise (thanks to Darko Brljak).

The preprocessing phase is in O(m+sigma) time and O(sigma) space complexity.

During the searching phase the comparisons between pattern and text characters during each attempt can be done in any order. The searching phase has a quadratic worst case time complexity but it has a good practical behaviour.

For instance, image.png

In this example, t0, …, t4 = a b c a b is the current text window that is compared with the pattern. Its suffix a b has matched, but the comparison c-a causes a mismatch. The bad-character heuristics of the Boyer-Moore algorithm (a) uses the “bad” text character c to determine the shift distance. The Horspool algorithm (b) uses the rightmost character b of the current text window. The Sunday algorithm (c) uses the character directly right of the text window, namely d in this example. Since d does not occur in the pattern at all, the pattern can be shifted past this position.

python implementation

''' mbinary
#########################################################################
# File : sunday.py
# Author: mbinary
# Mail: zhuheqin1@gmail.com
# Blog: https://mbinary.github.io
# Github: https://github.com/mbinary
# Created Time: 2018-07-11  15:26
# Description: 字符串模式匹配, sunday 算法, kmp 的改进
#               pattern matching for strings using sunday algorithm
#########################################################################
'''



def getPos(pattern):
    dic = {}
    for i,j in enumerate(pattern[::-1]):
        if j not in dic:
            dic[j]= i
    return dic
def find(s,p):
    dic = getPos(p)
    ps = pp = 0
    ns = len(s)
    np = len(p)
    while ps<ns and pp<np:
        if s[ps] == p[pp]:
            ps,pp = ps+1,pp+1
        else:
            idx = ps+ np-pp
            if idx >=ns:return -1
            ch = s[idx]
            if ch in dic:
                ps += dic[ch]+1-pp
            else:
                ps = idx+1
            pp = 0
    if pp==np:return ps-np
    else:
        return -1
def findAll(s,p):
    ns = len(s)
    np = len(p)
    i = 0
    ret = []
    while s:
        print(s,p)
        tmp = find(s,p)
        if tmp==-1: break
        ret.append(i+tmp)
        end = tmp+np
        i +=end
        s = s[end:]
    return ret



def randStr(n=3):
    return [randint(ord('a'),ord('z')) for i in range(n)]

def test(n):
    s = randStr(n)
    p = randStr(3)
    str_s = ''.join((chr(i) for i in s))
    str_p = ''.join((chr(i) for i in p))
    n1 = find(s,p)
    n2 = str_s.find(str_p) # 利用已有的 str find 算法检验
    if n1!=n2:
        print(n1,n2,str_p,str_s)
        return False
    return True
if __name__ =='__main__':
    from random import randint
    n = 1000
    suc = sum(test(n) for i in range(n))
    print('test {n} times, success {suc} times'.format(n=n,suc=suc))

WildCard matching

divider wild card:

Given a string s which doesn’t include wild card, and a pattern p which includes wild card,

Judge if they are matching.

Idea

Using dynamic programming.

n = length(s), m = length(p)

dp[m+1][n+1]: bool

i:n, j:m dp[j][i] represents if s[:i+1] matches p[:j+1]

initialzation : dp[0][0] = True, dp[0][i],dp[j][0] = False, only if p startswith ‘*’, dp[1][0] = True.

if p[j] = ‘*’: dp[j][i] = dp[j-1][i] or dp[j][i-1] elif p[j] = ‘?’: dp[j][i] = dp[j-1][i-1] else : dp[j][i] = dp[j-1][i-1] and s[i] == p[j]

Code

def isMatch(self, s, p):
    """
    :type s: str
    :type p: str   pattern str including wildcard
    :rtype: bool
    """
    n,m = len(s),len(p)
    last =  [False]*(n+1)
    last[0] = True
    for j in range(m):
        if p[j]=='*':
            for i in range(n):
                last[i+1] = last[i+1] or last[i]
        elif p[j]=='?':
            last.pop()
            last.insert(0,False)
        else:
            li = [False]
            for i in range(n):
                li.append( last[i] and p[j]==s[i])
            last = li
    return last[-1]

Reference:

  1. Xuyun, ppt, String matching
  2. Sunday-algorithm
  3. GeeksforGeeks, KMP Algorithm
  4. Augustliang, csdn, dynamic programming for wild card matching