python - how to modify peter norvig spell checker to get more number of suggestions per word -


i tried peter norvig code spellchecker on http://norvig.com/spell-correct.html how modify more number of suggestions instead of 1 correct spelling

import re collections import counter  def words(text):      return re.findall(r'\w+', text.lower())  words = counter(words(open('big.txt').read()))  def p(word, n=sum(words.values())):      "probability of `word`."     return words[word] / n  def correction(word):      "most probable spelling correction word."     return max(candidates(word), key=p)  def candidates(word):      "generate possible spelling corrections word."     return (known([word]) or known(edits1(word)) or known(edits2(word)) or             [word])  def known(words):      "the subset of `words` appear in dictionary of words."     return set(w w in words if w in words)  def edits1(word):     "all edits 1 edit away `word`."     letters    = 'abcdefghijklmnopqrstuvwxyz'     splits     = [(word[:i], word[i:])    in range(len(word) + 1)]     deletes    = [l + r[1:]               l, r in splits if r]     transposes = [l + r[1] + r[0] + r[2:] l, r in splits if len(r)>1]     replaces   = [l + c + r[1:]           l, r in splits if r c in                    letters]     inserts    = [l + c + r               l, r in splits c in      letters]     return set(deletes + transposes + replaces + inserts)  def edits2(word):      "all edits 2 edits away `word`."     return (e2 e1 in edits1(word) e2 in edits1(e1))import re 

you can use candidates function.

it gives you

  • the original word if correct already
  • otherwise, known words edit distance of 1 original word
  • if there no candidate distance 1, candidates distance 2
  • if there nothing in previous case, original word

if candidates found in case 2 or 3, returned set may contain more 1 suggestion.

if original word returned, however, don't know if it's case because it's correct (case 1), or because there no close candidates (case 4).

however,

this approach (the way how edits1() implemented) brute force, , it's inefficient long words, , gets worse if add more characters (eg. supporting other languages). consider simstring efficiently retrieving words similar spelling in large collection.


Comments

Popular posts from this blog

php - Permission denied. Laravel linux server -

google bigquery - Delta between query execution time and Java query call to finish -

python - Pandas two dataframes multiplication? -