• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

.github/workflows/H23-Mar-2020-3129

fingerprints/H23-Mar-2020-178125

tests/H23-Mar-2020-4834

tools/H23-Mar-2020-2,4312,421

.bumpversion.cfgH A D23-Mar-2020118 96

.gitignoreH A D23-Mar-202074 99

LICENSEH A D23-Mar-20201.1 KiB2217

MANIFEST.inH A D23-Mar-202050 32

MakefileH A D23-Mar-2020335 1611

README.mdH A D23-Mar-20201.7 KiB4025

setup.cfgH A D23-Mar-202061 64

setup.pyH A D23-Mar-20201.1 KiB3835

README.md

1# fingerprints
2
3![package](https://github.com/alephdata/fingerprints/workflows/package/badge.svg)
4
5This library helps with the generation of fingerprints for entity data. A fingerprint
6in this context is understood as a simplified entity identifier, derived from it's
7name or address and used for cross-referencing of entity across different datasets.
8
9## Usage
10
11```python
12import fingerprints
13
14fp = fingerprints.generate('Mr. Sherlock Holmes')
15assert fp == 'holmes sherlock'
16
17fp = fingerprints.generate('Siemens Aktiengesellschaft')
18assert fp == 'ag siemens'
19
20fp = fingerprints.generate('New York, New York')
21assert fp == 'new york'
22```
23
24## Company type names
25
26A significant part of what `fingerprints` does it to recognize company legal form
27names. For example, `fingerprints` will be able to simplify `Общество с ограниченной ответственностью` to `ООО`, or `Aktiengesellschaft` to `AG`. The required database
28is based on two different sources:
29
30* A [Google Spreadsheet](https://docs.google.com/spreadsheets/d/1Cw2xQ3hcZOAgnnzejlY5Sv3OeMxKePTqcRhXQU8rCAw/edit?ts=5e7754cf#gid=0) created by OCCRP.
31* The ISO 20275: [Entity Legal Forms Code List](https://www.gleif.org/en/about-lei/code-lists/iso-20275-entity-legal-forms-code-list)
32
33Wikipedia also maintains an index of [types of business entity](https://en.wikipedia.org/wiki/Types_of_business_entity).
34
35## See also
36
37* [Clustering in Depth](https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth), part of the OpenRefine documentation discussing how to create collisions in data clustering.
38* [probablepeople](https://github.com/datamade/probablepeople), parser for western names made by the brilliant folks at datamade.us.
39
40