I'm OCRing some text from two different sources. They can each make mistakes in different places, where they won't recognize a letter/group of letters. If they don't recognize something, it's replaced with a ?. For example, if the word is Roflcopter, one source might return Ro?copter, while another, Roflcop?er. I'd like a function that returns whether two matches might be equivalent, allowing for multiple ?s. Example:
match("Ro?copter", "Roflcop?er") --> True
match("Ro?copter", "Roflcopter") --> True
match("Roflcopter", "Roflcop?er") --> True
match("Ro?co?er", "Roflcop?er") --> True
So far I can match one OCR with a perfect one by using regular expressions:
>>> def match(tn1, tn2):
tn1re = tn1.replace("?", ".{0,4}")
tn2re = tn2.replace("?", ".{0,4}")
return bool(re.match(tn1re, tn2) or re.match(tn2re, tn1))
>>> match("Roflcopter", "Roflcop?er")
True
>>> match("R??lcopter", "Roflcopter")
True
But this doesn't work when they both have ?s in different places:
>>> match("R??lcopter", "Roflcop?er")
False
Thanks to Hamish Grubijan for this idea. Every ? in my ocr'd names can be anywhere from 0 to 3 letters. What I do is expand each string to a list of possible expansions:
>>> list(expQuestions("?flcopt?"))
['flcopt', 'flcopt@', 'flcopt@@', 'flcopt@@@', '@flcopt', '@flcopt@', '@flcopt@@', '@flcopt@@@', '@@flcopt', '@@flcopt@', '@@flcopt@@', '@@flcopt@@@', '@@@flcopt', '@@@flcopt@', '@@@flcopt@@', '@@@flcopt@@@']
then I expand both and use his matching function, which I called matchats:
def matchOCR(l, r):
for expl in expQuestions(l):
for expr in expQuestions(r):
if matchats(expl, expr):
return True
return False
Works as desired:
>>> matchOCR("Ro?co?er", "?flcopt?")
True
>>> matchOCR("Ro?co?er", "?flcopt?z")
False
>>> matchOCR("Ro?co?er", "?flc?pt?")
True
>>> matchOCR("Ro?co?e?", "?flc?pt?")
True
The matching function:
def matchats(l, r):
"""Match two strings with @ representing exactly 1 char"""
if len(l) != len(r): return False
for i, c1 in enumerate(l):
c2 = r[i]
if c1 == "@" or c2 == "@": continue
if c1 != c2: return False
return True
and the expanding function, where cartesian_product does just that:
def expQuestions(s):
"""For OCR w/ a questionmark in them, expand questions with
@s for all possibilities"""
numqs = s.count("?")
blah = list(s)
for expqs in cartesian_product([(0,1,2,3)]*numqs):
newblah = blah[:]
qi = 0
for i,c in enumerate(newblah):
if newblah[i] == '?':
newblah[i] = '@'*expqs[qi]
qi += 1
yield "".join(newblah)