• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

.github/workflows/H03-May-2022-1816

bench/H03-May-2022-266217

docs/literature/H21-Aug-2021-7362

extras/H21-Aug-2021-5,8323,621

fuzzing/H21-Aug-2021-4333

rapidfuzz/H21-Aug-2021-12,2689,965

test/H03-May-2022-252202

tools/H21-Aug-2021-11870

.clang-formatH A D21-Aug-2021717 2625

.gitignoreH A D21-Aug-202144 77

DoxyfileH A D21-Aug-20213.4 KiB10667

LICENSEH A D21-Aug-20211.1 KiB2218

README.mdH A D21-Aug-20218.4 KiB243192

README.md

1  <h1 align="center">
2<img src="https://raw.githubusercontent.com/maxbachmann/rapidfuzz/master/docs/img/RapidFuzz.svg?sanitize=true" alt="RapidFuzz" width="400">
3</h1>
4<h4 align="center">Rapid fuzzy string matching in C++ using the Levenshtein Distance</h4>
5
6<p align="center">
7  <a href="https://github.com/maxbachmann/rapidfuzz/actions">
8    <img src="https://github.com/maxbachmann/rapidfuzz/workflows/Build/badge.svg"
9         alt="Continous Integration">
10  </a>
11  <a href="https://maxbachmann.github.io/rapidfuzz">
12    <img src="https://img.shields.io/badge/-documentation-blue"
13         alt="Documentation">
14  </a>
15  <a href="https://github.com/maxbachmann/rapidfuzz/blob/dev/LICENSE">
16    <img src="https://img.shields.io/github/license/maxbachmann/rapidfuzz"
17         alt="GitHub license">
18  </a>
19</p>
20
21<p align="center">
22  <a href="#description">Description</a> •
23  <a href="#installation">Installation</a> •
24  <a href="#usage">Usage</a> •
25  <a href="#license">License</a>
26</p>
27
28---
29## Description
30RapidFuzz is a fast string matching library for Python and C++, which is using the string similarity calculations from [FuzzyWuzzy](https://github.com/seatgeek/fuzzywuzzy). However there are two aspects that set RapidFuzz apart from FuzzyWuzzy:
311) It is MIT licensed so it can be used whichever License you might want to choose for your project, while you're forced to adopt the GPL license when using FuzzyWuzzy
322) It is mostly written in C++ and on top of this comes with a lot of Algorithmic improvements to make string matching even faster, while still providing the same results. More details on these performance improvements in form of benchmarks can be found [here](https://github.com/maxbachmann/rapidfuzz/blob/master/Benchmarks.md)
33
34The Library is splitted across multiple repositories for the different supported programming languages:
35- The C++ version is versioned in this repository
36- The Python version can be found at [maxbachmann/rapidfuzz](https://github.com/maxbachmann/rapidfuzz)
37
38
39## CMake Integration
40
41There are severals ways to integrate `rapidfuzz` in your CMake project.
42
43### By Installing it
44```bash
45git clone https://github.com/maxbachmann/rapidfuzz-cpp.git rapidfuzz-cpp
46cd rapidfuzz-cpp
47mkdir build && cd build
48cmake .. -DCMAKE_BUILD_TYPE=Release
49cmake --build .
50cmake --build . --target install
51```
52
53Then in your CMakeLists.txt:
54```cmake
55find_package(rapidfuzz REQUIRED)
56add_executable(foo main.cpp)
57target_link_libraries(foo rapidfuzz::rapidfuzz)
58```
59
60### Add this repository as a submodule
61```bash
62git submodule add https://github.com/maxbachmann/rapidfuzz-cpp.git 3rdparty/RapidFuzz
63```
64Then you can either:
65
661. include it as a subdirectory
67    ```cmake
68    add_subdirectory(3rdparty/RapidFuzz)
69    add_executable(foo main.cpp)
70    target_link_libraries(foo rapidfuzz::rapidfuzz)
71    ```
722. build it at configure time with `FetchContent`
73    ```cmake
74    FetchContent_Declare(
75      rapidfuzz
76      SOURCE_DIR ${CMAKE_SOURCE_DIR}/3rdparty/RapidFuzz
77      PREFIX ${CMAKE_CURRENT_BINARY_DIR}/rapidfuzz
78      CMAKE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=<INSTALL_DIR> "${CMAKE_OPT_ARGS}"
79    )
80    FetchContent_MakeAvailable(rapidfuzz)
81    add_executable(foo main.cpp)
82    target_link_libraries(foo PRIVATE rapidfuzz::rapidfuzz)
83    ```
84### Download it at configure time
85
86If you don't want to add `rapidfuzz-cpp` as a submodule, you can also download it with `FetchContent`:
87```cmake
88FetchContent_Declare(rapidfuzz
89  GIT_REPOSITORY https://github.com/maxbachmann/rapidfuzz-cpp.git
90  GIT_TAG master)
91FetchContent_MakeAvailable(rapidfuzz)
92add_executable(foo main.cpp)
93target_link_libraries(foo PRIVATE rapidfuzz::rapidfuzz)
94```
95It will be downloaded each time you run CMake in a blank folder.
96
97## CMake option
98
99There are CMake options available:
100
1011. `BUILD_TESTS` : to build test (default OFF and requires [Catch2](https://github.com/catchorg/Catch2))
102
1032. `BUILD_BENCHMARKS` : to build benchmarks (default OFF and requires [Google Benchmark](https://github.com/google/benchmark))
104
105## Usage
106```cpp
107#include "rapidfuzz/fuzz.hpp"
108#include "rapidfuzz/utils.hpp"
109```
110
111### Simple Ratio
112```cpp
113using rapidfuzz::fuzz::ratio;
114
115// score is 96.55171966552734
116double score = rapidfuzz::fuzz::ratio("this is a test", "this is a test!");
117```
118
119### Partial Ratio
120```cpp
121// score is 100
122double score = rapidfuzz::fuzz::partial_ratio("this is a test", "this is a test!");
123```
124
125### Token Sort Ratio
126```cpp
127// score is 90.90908813476562
128double score = rapidfuzz::fuzz::ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
129
130// score is 100
131double score = rapidfuzz::fuzz::token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
132```
133
134### Token Set Ratio
135```cpp
136// score is 83.8709716796875
137double score = rapidfuzz::fuzz::token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
138
139// score is 100
140double score = rapidfuzz::fuzz::token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
141```
142
143### Process
144In the Python implementation there is a module process, which is used to compare e.g. a string to a list of strings.
145In Python this both saves the time to implement those features yourself and can be a lot more efficient than repeated type
146conversions between Python and C++. Implementing a similar function in C++ using templates is not easily possible and probably slower than implementing them on your own. Thats why this section describes how users can implement those features with a couple lines of code using the C++ library.
147
148### extract
149
150The following function compares a query string to all strings in a list of choices. It returns all
151elements with a similarity over score_cutoff. Generally make use of the cached implementations when comparing
152a string to multiple strings.
153
154
155```cpp
156template <typename Sentence1,
157          typename Iterable, typename Sentence2 = typename Iterable::value_type>
158std::vector<std::pair<Sentence2, percent>>
159extract(const Sentence1& query, const Iterable& choices, const percent score_cutoff = 0.0)
160{
161  std::vector<std::pair<Sentence2, percent>> results;
162
163  auto scorer = rapidfuzz::fuzz::CachedRatio<Sentence1>(query);
164
165  for (const auto& choice : choices) {
166    double score = scorer.ratio(choice, score_cutoff);
167
168    if (score >= score_cutoff) {
169      results.emplace_back(choice, score);
170    }
171  }
172
173  return results;
174}
175```
176
177### extractOne
178
179The following function compares a query string to all strings in a list of choices.
180
181```cpp
182template <typename Sentence1,
183          typename Iterable, typename Sentence2 = typename Iterable::value_type>
184std::optional<std::pair<Sentence2, percent>>
185extractOne(const Sentence1& query, const Iterable& choices, const percent score_cutoff = 0.0)
186{
187  bool match_found = false;
188  double best_score = score_cutoff;
189  Sentence2 best_match;
190
191  auto scorer = rapidfuzz::fuzz::CachedRatio<Sentence1>(query);
192
193  for (const auto& choice : choices) {
194    double score = scorer.ratio(choice, best_score);
195
196    if (score >= best_score) {
197      match_found = true;
198      best_score = score;
199      best_match = choice;
200    }
201  }
202
203  if (!match_found) {
204    return nullopt;
205  }
206
207  return std::make_pair(best_match, best_score);
208}
209```
210
211### multithreading
212
213It is very simple to use those scorers e.g. with open OpenMP to achieve better performance.
214
215```cpp
216template <typename Sentence1,
217          typename Iterable, typename Sentence2 = typename Iterable::value_type>
218std::vector<std::pair<Sentence2, percent>>
219extract(const Sentence1& query, const Iterable& choices, const percent score_cutoff = 0.0)
220{
221  std::vector<std::pair<Sentence2, percent>> results(choices.size());
222
223  auto scorer = rapidfuzz::fuzz::CachedRatio<Sentence1>(query);
224
225  #pragma omp parallel for
226  for (std::size_t i = 0; i < choices.size(); ++i) {
227    double score = scorer.ratio(choices[i], score_cutoff);
228    results[i] = std::make_pair(choices[i], score);
229  }
230
231  return results;
232}
233```
234
235Note that the scorers are not threadsafe and do not take ownership of the data you pass them.
236Internally they fully operate on string_views. This is especially important for the cached implementations,
237since it might be tempting to pass the first string to the constructor of the cached scorer and delete it afterwards.
238
239
240## License
241RapidFuzz is licensed under the MIT license since I believe that everyone should be able to use it without being forced to adopt the GPL license. Thats why the library is based on an older version of fuzzywuzzy that was MIT licensed as well.
242This old version of fuzzywuzzy can be found [here](https://github.com/seatgeek/fuzzywuzzy/tree/4bf28161f7005f3aa9d4d931455ac55126918df7).
243