• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

cmclient/H08-Mar-2019-

cmserver/H08-Mar-2019-

levenshtein/H08-Mar-2019-

test/H03-May-2022-

.travis.ymlH A D08-Mar-201925

LICENSEH A D08-Mar-20191 KiB

MakefileH A D08-Mar-201942

README.mdH A D08-Mar-20194.3 KiB

closestmatch.goH A D08-Mar-20198.7 KiB

closestmatch_test.goH A D08-Mar-20195.7 KiB

README.md

1
2# closestmatch :page_with_curl:
3
4<a href="#"><img src="https://img.shields.io/badge/version-2.1.0-brightgreen.svg?style=flat-square" alt="Version"></a>
5<a href="https://travis-ci.org/schollz/closestmatch"><img src="https://img.shields.io/travis/schollz/closestmatch.svg?style=flat-square" alt="Build Status"></a>
6<a href="http://gocover.io/github.com/schollz/closestmatch"><img src="https://img.shields.io/badge/coverage-98%25-brightgreen.svg?style=flat-square" alt="Code Coverage"></a>
7<a href="https://godoc.org/github.com/schollz/closestmatch"><img src="https://img.shields.io/badge/api-reference-blue.svg?style=flat-square" alt="GoDoc"></a>
8
9*closestmatch* is a simple and fast Go library for fuzzy matching an input string to a list of target strings. *closestmatch* is useful for handling input from a user where the input (which could be mispelled or out of order) needs to match a key in a database. *closestmatch* uses a [bag-of-words approach](https://en.wikipedia.org/wiki/Bag-of-words_model) to precompute character n-grams to represent each possible target string. The closest matches have highest overlap between the sets of n-grams. The precomputation scales well and is much faster and more accurate than Levenshtein for long strings.
10
11
12Getting Started
13===============
14
15## Install
16
17```
18go get -u -v github.com/schollz/closestmatch
19```
20
21## Use
22
23####  Create a *closestmatch* object from a list words
24
25```golang
26// Take a slice of keys, say band names that are similar
27// http://www.tonedeaf.com.au/412720/38-bands-annoyingly-similar-names.htm
28wordsToTest := []string{"King Gizzard", "The Lizard Wizard", "Lizzard Wizzard"}
29
30// Choose a set of bag sizes, more is more accurate but slower
31bagSizes := []int{2}
32
33// Create a closestmatch object
34cm := closestmatch.New(wordsToTest, bagSizes)
35```
36
37#### Find the closest match, or find the *N* closest matches
38
39```golang
40fmt.Println(cm.Closest("kind gizard"))
41// returns 'King Gizzard'
42
43fmt.Println(cm.ClosestN("kind gizard",3))
44// returns [King Gizzard Lizzard Wizzard The Lizard Wizard]
45```
46
47#### Calculate the accuracy
48
49```golang
50// Calculate accuracy
51fmt.Println(cm.AccuracyMutatingWords())
52// ~ 66 % (still way better than Levenshtein which hits 0% with this particular set)
53
54// Improve accuracy by adding more bags
55bagSizes = []int{2, 3, 4}
56cm = closestmatch.New(wordsToTest, bagSizes)
57fmt.Println(cm.AccuracyMutatingWords())
58// accuracy improves to ~ 76 %
59```
60
61#### Save/Load
62
63```golang
64// Save your current calculated bags
65cm.Save("closestmatches.gob")
66
67// Open it again
68cm2, _ := closestmatch.Load("closestmatches.gob")
69fmt.Println(cm2.Closest("lizard wizard"))
70// prints "The Lizard Wizard"
71```
72
73### Advantages
74
75*closestmatch* is more accurate than Levenshtein for long strings (like in the test corpus).
76
77*closestmatch* is ~20x faster than [a fast implementation of Levenshtein](https://groups.google.com/forum/#!topic/golang-nuts/YyH1f_qCZVc). Try it yourself with the benchmarks:
78
79```bash
80cd $GOPATH/src/github.com/schollz/closestmatch && go test -run=None -bench=. > closestmatch.bench
81cd $GOPATH/src/github.com/schollz/closestmatch/levenshtein && go test -run=None -bench=. > levenshtein.bench
82benchcmp levenshtein.bench ../closestmatch.bench
83```
84
85which gives the following benchmark (on Intel i7-3770 CPU @ 3.40GHz w/ 8 processors):
86
87```bash
88benchmark                 old ns/op     new ns/op     delta
89BenchmarkNew-8            1.47          1933870       +131555682.31%
90BenchmarkClosestOne-8     104603530     4855916       -95.36%
91```
92
93The `New()` function in *closestmatch* is so slower than *levenshtein* because there is precomputation needed.
94
95### Disadvantages
96
97*closestmatch* does worse for matching lists of single words, like a dictionary. For comparison:
98
99
100```
101$ cd $GOPATH/src/github.com/schollz/closestmatch && go test
102Accuracy with mutating words in book list:      90.0%
103Accuracy with mutating letters in book list:    100.0%
104Accuracy with mutating letters in dictionary:   38.9%
105```
106
107while levenshtein performs slightly better for a single-word dictionary (but worse for longer names, like book titles):
108
109```
110$ cd $GOPATH/src/github.com/schollz/closestmatch/levenshtein && go test
111Accuracy with mutating words in book list:      40.0%
112Accuracy with mutating letters in book list:    100.0%
113Accuracy with mutating letters in dictionary:   64.8%
114```
115
116## License
117
118MIT
119