1gensim – Topic Modelling in Python
2==================================
3
4<!--
5The following image URLs are obfuscated = proxied and cached through
6Google because of Github's proxying issues. See:
7https://github.com/RaRe-Technologies/gensim/issues/2805
8-->
9
10[![Build Status](https://github.com/RaRe-Technologies/gensim/actions/workflows/tests.yml/badge.svg?branch=develop)](https://github.com/RaRe-Technologies/gensim/actions)
11[![GitHub release](https://img.shields.io/github/release/rare-technologies/gensim.svg?maxAge=3600)](https://github.com/RaRe-Technologies/gensim/releases)
12[![Downloads](https://img.shields.io/pypi/dm/gensim?color=blue)](https://pepy.tech/project/gensim/)
13[![DOI](https://zenodo.org/badge/DOI/10.13140/2.1.2393.1847.svg)](https://doi.org/10.13140/2.1.2393.1847)
14[![Mailing List](https://img.shields.io/badge/-Mailing%20List-blue.svg)](https://groups.google.com/forum/#!forum/gensim)
15[![Follow](https://img.shields.io/twitter/follow/gensim_py.svg?style=social&style=flat&logo=twitter&label=Follow&color=blue)](https://twitter.com/gensim_py)
16
17Gensim is a Python library for *topic modelling*, *document indexing*
18and *similarity retrieval* with large corpora. Target audience is the
19*natural language processing* (NLP) and *information retrieval* (IR)
20community.
21
22<!--
23## :pizza: Hacktoberfest 2019 :beer:
24
25We are accepting PRs for Hacktoberfest!
26See [here](HACKTOBERFEST.md) for details.
27-->
28
29Features
30--------
31
32-   All algorithms are **memory-independent** w.r.t. the corpus size
33    (can process input larger than RAM, streamed, out-of-core),
34-   **Intuitive interfaces**
35    -   easy to plug in your own input corpus/datastream (trivial
36        streaming API)
37    -   easy to extend with other Vector Space algorithms (trivial
38        transformation API)
39-   Efficient multicore implementations of popular algorithms, such as
40    online **Latent Semantic Analysis (LSA/LSI/SVD)**, **Latent
41    Dirichlet Allocation (LDA)**, **Random Projections (RP)**,
42    **Hierarchical Dirichlet Process (HDP)** or **word2vec deep
43    learning**.
44-   **Distributed computing**: can run *Latent Semantic Analysis* and
45    *Latent Dirichlet Allocation* on a cluster of computers.
46-   Extensive [documentation and Jupyter Notebook tutorials].
47
48If this feature list left you scratching your head, you can first read
49more about the [Vector Space Model] and [unsupervised document analysis]
50on Wikipedia.
51
52Installation
53------------
54
55This software depends on [NumPy and Scipy], two Python packages for
56scientific computing. You must have them installed prior to installing
57gensim.
58
59It is also recommended you install a fast BLAS library before installing
60NumPy. This is optional, but using an optimized BLAS such as [ATLAS] or
61[OpenBLAS] is known to improve performance by as much as an order of
62magnitude. On OS X, NumPy picks up the BLAS that comes with it
63automatically, so you don’t need to do anything special.
64
65Install the latest version of gensim:
66
67```bash
68    pip install --upgrade gensim
69```
70
71Or, if you have instead downloaded and unzipped the [source tar.gz]
72package:
73
74```bash
75    python setup.py install
76```
77
78For alternative modes of installation, see the [documentation].
79
80Gensim is being [continuously tested](https://travis-ci.org/RaRe-Technologies/gensim) under Python 3.6, 3.7 and 3.8.
81Support for Python 2.7 was dropped in gensim 4.0.0 – install gensim 3.8.3 if you must use Python 2.7.
82
83How come gensim is so fast and memory efficient? Isn’t it pure Python, and isn’t Python slow and greedy?
84--------------------------------------------------------------------------------------------------------
85
86Many scientific algorithms can be expressed in terms of large matrix
87operations (see the BLAS note above). Gensim taps into these low-level
88BLAS libraries, by means of its dependency on NumPy. So while
89gensim-the-top-level-code is pure Python, it actually executes highly
90optimized Fortran/C under the hood, including multithreading (if your
91BLAS is so configured).
92
93Memory-wise, gensim makes heavy use of Python’s built-in generators and
94iterators for streamed data processing. Memory efficiency was one of
95gensim’s [design goals], and is a central feature of gensim, rather than
96something bolted on as an afterthought.
97
98Documentation
99-------------
100
101-   [QuickStart]
102-   [Tutorials]
103-   [Official API Documentation]
104
105  [QuickStart]: https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html
106  [Tutorials]: https://radimrehurek.com/gensim/auto_examples/
107  [Official Documentation and Walkthrough]: http://radimrehurek.com/gensim/
108  [Official API Documentation]: http://radimrehurek.com/gensim/apiref.html
109
110Support
111-------
112
113Ask open-ended or research questions on the [Gensim Mailing List](https://groups.google.com/forum/#!forum/gensim).
114
115Raise bugs on [Github](https://github.com/RaRe-Technologies/gensim/blob/develop/CONTRIBUTING.md) but **make sure you follow the [issue template](https://github.com/RaRe-Technologies/gensim/blob/develop/ISSUE_TEMPLATE.md)**. Issues that are not bugs or fail to follow the issue template will be closed without inspection.
116
117---------
118
119Adopters
120--------
121
122| Company | Logo | Industry | Use of Gensim |
123|---------|------|----------|---------------|
124| [RARE Technologies](http://rare-technologies.com) | ![rare](docs/src/readme_images/rare.png) | ML & NLP consulting | Creators of Gensim – this is us! |
125| [Amazon](http://www.amazon.com/) |  ![amazon](docs/src/readme_images/amazon.png) | Retail |  Document similarity. |
126| [National Institutes of Health](https://github.com/NIHOPA/pipeline_word2vec) | ![nih](docs/src/readme_images/nih.png) | Health | Processing grants and publications with word2vec. |
127| [Cisco Security](http://www.cisco.com/c/en/us/products/security/index.html) | ![cisco](docs/src/readme_images/cisco.png) | Security |  Large-scale fraud detection. |
128| [Mindseye](http://www.mindseyesolutions.com/) | ![mindseye](docs/src/readme_images/mindseye.png) | Legal | Similarities in legal documents. |
129| [Channel 4](http://www.channel4.com/) | ![channel4](docs/src/readme_images/channel4.png) | Media | Recommendation engine. |
130| [Talentpair](http://talentpair.com) | ![talent-pair](docs/src/readme_images/talent-pair.png) | HR | Candidate matching in high-touch recruiting. |
131| [Juju](http://www.juju.com/)  | ![juju](docs/src/readme_images/juju.png) | HR | Provide non-obvious related job suggestions. |
132| [Tailwind](https://www.tailwindapp.com/) | ![tailwind](docs/src/readme_images/tailwind.png) | Media | Post interesting and relevant content to Pinterest. |
133| [Issuu](https://issuu.com/) | ![issuu](docs/src/readme_images/issuu.png) | Media | Gensim's LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it's all about. |
134| [Search Metrics](http://www.searchmetrics.com/) | ![search-metrics](docs/src/readme_images/search-metrics.png) | Content Marketing | Gensim word2vec used for entity disambiguation in Search Engine Optimisation. |
135| [12K Research](https://12k.co/) | ![12k](docs/src/readme_images/12k.png)| Media |   Document similarity analysis on media articles. |
136| [Stillwater Supercomputing](http://www.stillwater-sc.com/) | ![stillwater](docs/src/readme_images/stillwater.png) | Hardware | Document comprehension and association with word2vec. |
137| [SiteGround](https://www.siteground.com/) |  ![siteground](docs/src/readme_images/siteground.png) | Web hosting | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. |
138| [Capital One](https://www.capitalone.com/) | ![capitalone](docs/src/readme_images/capitalone.png) | Finance | Topic modeling for customer complaints exploration. |
139
140-------
141
142Citing gensim
143------------
144
145When [citing gensim in academic papers and theses], please use this
146BibTeX entry:
147
148    @inproceedings{rehurek_lrec,
149          title = {{Software Framework for Topic Modelling with Large Corpora}},
150          author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
151          booktitle = {{Proceedings of the LREC 2010 Workshop on New
152               Challenges for NLP Frameworks}},
153          pages = {45--50},
154          year = 2010,
155          month = May,
156          day = 22,
157          publisher = {ELRA},
158          address = {Valletta, Malta},
159          note={\url{http://is.muni.cz/publication/884893/en}},
160          language={English}
161    }
162
163  [citing gensim in academic papers and theses]: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:NaGl4SEjCO4C
164
165  [Travis CI for automated testing]: https://travis-ci.org/RaRe-Technologies/gensim
166  [design goals]: http://radimrehurek.com/gensim/about.html
167  [RaRe Technologies]: http://rare-technologies.com/wp-content/uploads/2016/02/rare_image_only.png%20=10x20
168  [rare\_tech]: //rare-technologies.com
169  [Talentpair]: https://avatars3.githubusercontent.com/u/8418395?v=3&s=100
170  [citing gensim in academic papers and theses]: https://scholar.google.cz/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:u-x6o8ySG0sC
171
172
173
174  [documentation and Jupyter Notebook tutorials]: https://github.com/RaRe-Technologies/gensim/#documentation
175  [Vector Space Model]: http://en.wikipedia.org/wiki/Vector_space_model
176  [unsupervised document analysis]: http://en.wikipedia.org/wiki/Latent_semantic_indexing
177  [NumPy and Scipy]: http://www.scipy.org/Download
178  [ATLAS]: http://math-atlas.sourceforge.net/
179  [OpenBLAS]: http://xianyi.github.io/OpenBLAS/
180  [source tar.gz]: http://pypi.python.org/pypi/gensim
181  [documentation]: http://radimrehurek.com/gensim/install.html
182