python - how to modify peter norvig spell checker to get more number of suggestions per word -
i tried peter norvig code spellchecker on http://norvig.com/spell-correct.html how modify more number of suggestions instead of 1 correct spelling
import re collections import counter def words(text): return re.findall(r'\w+', text.lower()) words = counter(words(open('big.txt').read())) def p(word, n=sum(words.values())): "probability of `word`." return words[word] / n def correction(word): "most probable spelling correction word." return max(candidates(word), key=p) def candidates(word): "generate possible spelling corrections word." return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word]) def known(words): "the subset of `words` appear in dictionary of words." return set(w w in words if w in words) def edits1(word): "all edits 1 edit away `word`." letters = 'abcdefghijklmnopqrstuvwxyz' splits = [(word[:i], word[i:]) in range(len(word) + 1)] deletes = [l + r[1:] l, r in splits if r] transposes = [l + r[1] + r[0] + r[2:] l, r in splits if len(r)>1] replaces = [l + c + r[1:] l, r in splits if r c in letters] inserts = [l + c + r l, r in splits c in letters] return set(deletes + transposes + replaces + inserts) def edits2(word): "all edits 2 edits away `word`." return (e2 e1 in edits1(word) e2 in edits1(e1))import re
you can use candidates
function.
it gives you
- the original word if correct already
- otherwise, known words edit distance of 1 original word
- if there no candidate distance 1, candidates distance 2
- if there nothing in previous case, original word
if candidates found in case 2 or 3, returned set may contain more 1 suggestion.
if original word returned, however, don't know if it's case because it's correct (case 1), or because there no close candidates (case 4).
however,
this approach (the way how edits1()
implemented) brute force, , it's inefficient long words, , gets worse if add more characters (eg. supporting other languages). consider simstring efficiently retrieving words similar spelling in large collection.
Comments
Post a Comment