1 <h1 align="center"> 2<img src="https://raw.githubusercontent.com/maxbachmann/rapidfuzz/master/docs/img/RapidFuzz.svg?sanitize=true" alt="RapidFuzz" width="400"> 3</h1> 4<h4 align="center">Rapid fuzzy string matching in C++ using the Levenshtein Distance</h4> 5 6<p align="center"> 7 <a href="https://github.com/maxbachmann/rapidfuzz/actions"> 8 <img src="https://github.com/maxbachmann/rapidfuzz/workflows/Build/badge.svg" 9 alt="Continous Integration"> 10 </a> 11 <a href="https://maxbachmann.github.io/rapidfuzz"> 12 <img src="https://img.shields.io/badge/-documentation-blue" 13 alt="Documentation"> 14 </a> 15 <a href="https://github.com/maxbachmann/rapidfuzz/blob/dev/LICENSE"> 16 <img src="https://img.shields.io/github/license/maxbachmann/rapidfuzz" 17 alt="GitHub license"> 18 </a> 19</p> 20 21<p align="center"> 22 <a href="#description">Description</a> • 23 <a href="#installation">Installation</a> • 24 <a href="#usage">Usage</a> • 25 <a href="#license">License</a> 26</p> 27 28--- 29## Description 30RapidFuzz is a fast string matching library for Python and C++, which is using the string similarity calculations from [FuzzyWuzzy](https://github.com/seatgeek/fuzzywuzzy). However there are two aspects that set RapidFuzz apart from FuzzyWuzzy: 311) It is MIT licensed so it can be used whichever License you might want to choose for your project, while you're forced to adopt the GPL license when using FuzzyWuzzy 322) It is mostly written in C++ and on top of this comes with a lot of Algorithmic improvements to make string matching even faster, while still providing the same results. More details on these performance improvements in form of benchmarks can be found [here](https://github.com/maxbachmann/rapidfuzz/blob/master/Benchmarks.md) 33 34The Library is splitted across multiple repositories for the different supported programming languages: 35- The C++ version is versioned in this repository 36- The Python version can be found at [maxbachmann/rapidfuzz](https://github.com/maxbachmann/rapidfuzz) 37 38 39## CMake Integration 40 41There are severals ways to integrate `rapidfuzz` in your CMake project. 42 43### By Installing it 44```bash 45git clone https://github.com/maxbachmann/rapidfuzz-cpp.git rapidfuzz-cpp 46cd rapidfuzz-cpp 47mkdir build && cd build 48cmake .. -DCMAKE_BUILD_TYPE=Release 49cmake --build . 50cmake --build . --target install 51``` 52 53Then in your CMakeLists.txt: 54```cmake 55find_package(rapidfuzz REQUIRED) 56add_executable(foo main.cpp) 57target_link_libraries(foo rapidfuzz::rapidfuzz) 58``` 59 60### Add this repository as a submodule 61```bash 62git submodule add https://github.com/maxbachmann/rapidfuzz-cpp.git 3rdparty/RapidFuzz 63``` 64Then you can either: 65 661. include it as a subdirectory 67 ```cmake 68 add_subdirectory(3rdparty/RapidFuzz) 69 add_executable(foo main.cpp) 70 target_link_libraries(foo rapidfuzz::rapidfuzz) 71 ``` 722. build it at configure time with `FetchContent` 73 ```cmake 74 FetchContent_Declare( 75 rapidfuzz 76 SOURCE_DIR ${CMAKE_SOURCE_DIR}/3rdparty/RapidFuzz 77 PREFIX ${CMAKE_CURRENT_BINARY_DIR}/rapidfuzz 78 CMAKE_ARGS -DCMAKE_INSTALL_PREFIX:PATH=<INSTALL_DIR> "${CMAKE_OPT_ARGS}" 79 ) 80 FetchContent_MakeAvailable(rapidfuzz) 81 add_executable(foo main.cpp) 82 target_link_libraries(foo PRIVATE rapidfuzz::rapidfuzz) 83 ``` 84### Download it at configure time 85 86If you don't want to add `rapidfuzz-cpp` as a submodule, you can also download it with `FetchContent`: 87```cmake 88FetchContent_Declare(rapidfuzz 89 GIT_REPOSITORY https://github.com/maxbachmann/rapidfuzz-cpp.git 90 GIT_TAG master) 91FetchContent_MakeAvailable(rapidfuzz) 92add_executable(foo main.cpp) 93target_link_libraries(foo PRIVATE rapidfuzz::rapidfuzz) 94``` 95It will be downloaded each time you run CMake in a blank folder. 96 97## CMake option 98 99There are CMake options available: 100 1011. `BUILD_TESTS` : to build test (default OFF and requires [Catch2](https://github.com/catchorg/Catch2)) 102 1032. `BUILD_BENCHMARKS` : to build benchmarks (default OFF and requires [Google Benchmark](https://github.com/google/benchmark)) 104 105## Usage 106```cpp 107#include "rapidfuzz/fuzz.hpp" 108#include "rapidfuzz/utils.hpp" 109``` 110 111### Simple Ratio 112```cpp 113using rapidfuzz::fuzz::ratio; 114 115// score is 96.55171966552734 116double score = rapidfuzz::fuzz::ratio("this is a test", "this is a test!"); 117``` 118 119### Partial Ratio 120```cpp 121// score is 100 122double score = rapidfuzz::fuzz::partial_ratio("this is a test", "this is a test!"); 123``` 124 125### Token Sort Ratio 126```cpp 127// score is 90.90908813476562 128double score = rapidfuzz::fuzz::ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 129 130// score is 100 131double score = rapidfuzz::fuzz::token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 132``` 133 134### Token Set Ratio 135```cpp 136// score is 83.8709716796875 137double score = rapidfuzz::fuzz::token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 138 139// score is 100 140double score = rapidfuzz::fuzz::token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 141``` 142 143### Process 144In the Python implementation there is a module process, which is used to compare e.g. a string to a list of strings. 145In Python this both saves the time to implement those features yourself and can be a lot more efficient than repeated type 146conversions between Python and C++. Implementing a similar function in C++ using templates is not easily possible and probably slower than implementing them on your own. Thats why this section describes how users can implement those features with a couple lines of code using the C++ library. 147 148### extract 149 150The following function compares a query string to all strings in a list of choices. It returns all 151elements with a similarity over score_cutoff. Generally make use of the cached implementations when comparing 152a string to multiple strings. 153 154 155```cpp 156template <typename Sentence1, 157 typename Iterable, typename Sentence2 = typename Iterable::value_type> 158std::vector<std::pair<Sentence2, percent>> 159extract(const Sentence1& query, const Iterable& choices, const percent score_cutoff = 0.0) 160{ 161 std::vector<std::pair<Sentence2, percent>> results; 162 163 auto scorer = rapidfuzz::fuzz::CachedRatio<Sentence1>(query); 164 165 for (const auto& choice : choices) { 166 double score = scorer.ratio(choice, score_cutoff); 167 168 if (score >= score_cutoff) { 169 results.emplace_back(choice, score); 170 } 171 } 172 173 return results; 174} 175``` 176 177### extractOne 178 179The following function compares a query string to all strings in a list of choices. 180 181```cpp 182template <typename Sentence1, 183 typename Iterable, typename Sentence2 = typename Iterable::value_type> 184std::optional<std::pair<Sentence2, percent>> 185extractOne(const Sentence1& query, const Iterable& choices, const percent score_cutoff = 0.0) 186{ 187 bool match_found = false; 188 double best_score = score_cutoff; 189 Sentence2 best_match; 190 191 auto scorer = rapidfuzz::fuzz::CachedRatio<Sentence1>(query); 192 193 for (const auto& choice : choices) { 194 double score = scorer.ratio(choice, best_score); 195 196 if (score >= best_score) { 197 match_found = true; 198 best_score = score; 199 best_match = choice; 200 } 201 } 202 203 if (!match_found) { 204 return nullopt; 205 } 206 207 return std::make_pair(best_match, best_score); 208} 209``` 210 211### multithreading 212 213It is very simple to use those scorers e.g. with open OpenMP to achieve better performance. 214 215```cpp 216template <typename Sentence1, 217 typename Iterable, typename Sentence2 = typename Iterable::value_type> 218std::vector<std::pair<Sentence2, percent>> 219extract(const Sentence1& query, const Iterable& choices, const percent score_cutoff = 0.0) 220{ 221 std::vector<std::pair<Sentence2, percent>> results(choices.size()); 222 223 auto scorer = rapidfuzz::fuzz::CachedRatio<Sentence1>(query); 224 225 #pragma omp parallel for 226 for (std::size_t i = 0; i < choices.size(); ++i) { 227 double score = scorer.ratio(choices[i], score_cutoff); 228 results[i] = std::make_pair(choices[i], score); 229 } 230 231 return results; 232} 233``` 234 235Note that the scorers are not threadsafe and do not take ownership of the data you pass them. 236Internally they fully operate on string_views. This is especially important for the cached implementations, 237since it might be tempting to pass the first string to the constructor of the cached scorer and delete it afterwards. 238 239 240## License 241RapidFuzz is licensed under the MIT license since I believe that everyone should be able to use it without being forced to adopt the GPL license. Thats why the library is based on an older version of fuzzywuzzy that was MIT licensed as well. 242This old version of fuzzywuzzy can be found [here](https://github.com/seatgeek/fuzzywuzzy/tree/4bf28161f7005f3aa9d4d931455ac55126918df7). 243