1gensim – Topic Modelling in Python 2================================== 3 4<!-- 5The following image URLs are obfuscated = proxied and cached through 6Google because of Github's proxying issues. See: 7https://github.com/RaRe-Technologies/gensim/issues/2805 8--> 9 10[![Build Status](https://github.com/RaRe-Technologies/gensim/actions/workflows/tests.yml/badge.svg?branch=develop)](https://github.com/RaRe-Technologies/gensim/actions) 11[![GitHub release](https://img.shields.io/github/release/rare-technologies/gensim.svg?maxAge=3600)](https://github.com/RaRe-Technologies/gensim/releases) 12[![Downloads](https://img.shields.io/pypi/dm/gensim?color=blue)](https://pepy.tech/project/gensim/) 13[![DOI](https://zenodo.org/badge/DOI/10.13140/2.1.2393.1847.svg)](https://doi.org/10.13140/2.1.2393.1847) 14[![Mailing List](https://img.shields.io/badge/-Mailing%20List-blue.svg)](https://groups.google.com/forum/#!forum/gensim) 15[![Follow](https://img.shields.io/twitter/follow/gensim_py.svg?style=social&style=flat&logo=twitter&label=Follow&color=blue)](https://twitter.com/gensim_py) 16 17Gensim is a Python library for *topic modelling*, *document indexing* 18and *similarity retrieval* with large corpora. Target audience is the 19*natural language processing* (NLP) and *information retrieval* (IR) 20community. 21 22<!-- 23## :pizza: Hacktoberfest 2019 :beer: 24 25We are accepting PRs for Hacktoberfest! 26See [here](HACKTOBERFEST.md) for details. 27--> 28 29Features 30-------- 31 32- All algorithms are **memory-independent** w.r.t. the corpus size 33 (can process input larger than RAM, streamed, out-of-core), 34- **Intuitive interfaces** 35 - easy to plug in your own input corpus/datastream (trivial 36 streaming API) 37 - easy to extend with other Vector Space algorithms (trivial 38 transformation API) 39- Efficient multicore implementations of popular algorithms, such as 40 online **Latent Semantic Analysis (LSA/LSI/SVD)**, **Latent 41 Dirichlet Allocation (LDA)**, **Random Projections (RP)**, 42 **Hierarchical Dirichlet Process (HDP)** or **word2vec deep 43 learning**. 44- **Distributed computing**: can run *Latent Semantic Analysis* and 45 *Latent Dirichlet Allocation* on a cluster of computers. 46- Extensive [documentation and Jupyter Notebook tutorials]. 47 48If this feature list left you scratching your head, you can first read 49more about the [Vector Space Model] and [unsupervised document analysis] 50on Wikipedia. 51 52Installation 53------------ 54 55This software depends on [NumPy and Scipy], two Python packages for 56scientific computing. You must have them installed prior to installing 57gensim. 58 59It is also recommended you install a fast BLAS library before installing 60NumPy. This is optional, but using an optimized BLAS such as [ATLAS] or 61[OpenBLAS] is known to improve performance by as much as an order of 62magnitude. On OS X, NumPy picks up the BLAS that comes with it 63automatically, so you don’t need to do anything special. 64 65Install the latest version of gensim: 66 67```bash 68 pip install --upgrade gensim 69``` 70 71Or, if you have instead downloaded and unzipped the [source tar.gz] 72package: 73 74```bash 75 python setup.py install 76``` 77 78For alternative modes of installation, see the [documentation]. 79 80Gensim is being [continuously tested](https://travis-ci.org/RaRe-Technologies/gensim) under Python 3.6, 3.7 and 3.8. 81Support for Python 2.7 was dropped in gensim 4.0.0 – install gensim 3.8.3 if you must use Python 2.7. 82 83How come gensim is so fast and memory efficient? Isn’t it pure Python, and isn’t Python slow and greedy? 84-------------------------------------------------------------------------------------------------------- 85 86Many scientific algorithms can be expressed in terms of large matrix 87operations (see the BLAS note above). Gensim taps into these low-level 88BLAS libraries, by means of its dependency on NumPy. So while 89gensim-the-top-level-code is pure Python, it actually executes highly 90optimized Fortran/C under the hood, including multithreading (if your 91BLAS is so configured). 92 93Memory-wise, gensim makes heavy use of Python’s built-in generators and 94iterators for streamed data processing. Memory efficiency was one of 95gensim’s [design goals], and is a central feature of gensim, rather than 96something bolted on as an afterthought. 97 98Documentation 99------------- 100 101- [QuickStart] 102- [Tutorials] 103- [Official API Documentation] 104 105 [QuickStart]: https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html 106 [Tutorials]: https://radimrehurek.com/gensim/auto_examples/ 107 [Official Documentation and Walkthrough]: http://radimrehurek.com/gensim/ 108 [Official API Documentation]: http://radimrehurek.com/gensim/apiref.html 109 110Support 111------- 112 113Ask open-ended or research questions on the [Gensim Mailing List](https://groups.google.com/forum/#!forum/gensim). 114 115Raise bugs on [Github](https://github.com/RaRe-Technologies/gensim/blob/develop/CONTRIBUTING.md) but **make sure you follow the [issue template](https://github.com/RaRe-Technologies/gensim/blob/develop/ISSUE_TEMPLATE.md)**. Issues that are not bugs or fail to follow the issue template will be closed without inspection. 116 117--------- 118 119Adopters 120-------- 121 122| Company | Logo | Industry | Use of Gensim | 123|---------|------|----------|---------------| 124| [RARE Technologies](http://rare-technologies.com) | ![rare](docs/src/readme_images/rare.png) | ML & NLP consulting | Creators of Gensim – this is us! | 125| [Amazon](http://www.amazon.com/) | ![amazon](docs/src/readme_images/amazon.png) | Retail | Document similarity. | 126| [National Institutes of Health](https://github.com/NIHOPA/pipeline_word2vec) | ![nih](docs/src/readme_images/nih.png) | Health | Processing grants and publications with word2vec. | 127| [Cisco Security](http://www.cisco.com/c/en/us/products/security/index.html) | ![cisco](docs/src/readme_images/cisco.png) | Security | Large-scale fraud detection. | 128| [Mindseye](http://www.mindseyesolutions.com/) | ![mindseye](docs/src/readme_images/mindseye.png) | Legal | Similarities in legal documents. | 129| [Channel 4](http://www.channel4.com/) | ![channel4](docs/src/readme_images/channel4.png) | Media | Recommendation engine. | 130| [Talentpair](http://talentpair.com) | ![talent-pair](docs/src/readme_images/talent-pair.png) | HR | Candidate matching in high-touch recruiting. | 131| [Juju](http://www.juju.com/) | ![juju](docs/src/readme_images/juju.png) | HR | Provide non-obvious related job suggestions. | 132| [Tailwind](https://www.tailwindapp.com/) | ![tailwind](docs/src/readme_images/tailwind.png) | Media | Post interesting and relevant content to Pinterest. | 133| [Issuu](https://issuu.com/) | ![issuu](docs/src/readme_images/issuu.png) | Media | Gensim's LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it's all about. | 134| [Search Metrics](http://www.searchmetrics.com/) | ![search-metrics](docs/src/readme_images/search-metrics.png) | Content Marketing | Gensim word2vec used for entity disambiguation in Search Engine Optimisation. | 135| [12K Research](https://12k.co/) | ![12k](docs/src/readme_images/12k.png)| Media | Document similarity analysis on media articles. | 136| [Stillwater Supercomputing](http://www.stillwater-sc.com/) | ![stillwater](docs/src/readme_images/stillwater.png) | Hardware | Document comprehension and association with word2vec. | 137| [SiteGround](https://www.siteground.com/) | ![siteground](docs/src/readme_images/siteground.png) | Web hosting | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. | 138| [Capital One](https://www.capitalone.com/) | ![capitalone](docs/src/readme_images/capitalone.png) | Finance | Topic modeling for customer complaints exploration. | 139 140------- 141 142Citing gensim 143------------ 144 145When [citing gensim in academic papers and theses], please use this 146BibTeX entry: 147 148 @inproceedings{rehurek_lrec, 149 title = {{Software Framework for Topic Modelling with Large Corpora}}, 150 author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka}, 151 booktitle = {{Proceedings of the LREC 2010 Workshop on New 152 Challenges for NLP Frameworks}}, 153 pages = {45--50}, 154 year = 2010, 155 month = May, 156 day = 22, 157 publisher = {ELRA}, 158 address = {Valletta, Malta}, 159 note={\url{http://is.muni.cz/publication/884893/en}}, 160 language={English} 161 } 162 163 [citing gensim in academic papers and theses]: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:NaGl4SEjCO4C 164 165 [Travis CI for automated testing]: https://travis-ci.org/RaRe-Technologies/gensim 166 [design goals]: http://radimrehurek.com/gensim/about.html 167 [RaRe Technologies]: http://rare-technologies.com/wp-content/uploads/2016/02/rare_image_only.png%20=10x20 168 [rare\_tech]: //rare-technologies.com 169 [Talentpair]: https://avatars3.githubusercontent.com/u/8418395?v=3&s=100 170 [citing gensim in academic papers and theses]: https://scholar.google.cz/citations?view_op=view_citation&hl=en&user=9vG_kV0AAAAJ&citation_for_view=9vG_kV0AAAAJ:u-x6o8ySG0sC 171 172 173 174 [documentation and Jupyter Notebook tutorials]: https://github.com/RaRe-Technologies/gensim/#documentation 175 [Vector Space Model]: http://en.wikipedia.org/wiki/Vector_space_model 176 [unsupervised document analysis]: http://en.wikipedia.org/wiki/Latent_semantic_indexing 177 [NumPy and Scipy]: http://www.scipy.org/Download 178 [ATLAS]: http://math-atlas.sourceforge.net/ 179 [OpenBLAS]: http://xianyi.github.io/OpenBLAS/ 180 [source tar.gz]: http://pypi.python.org/pypi/gensim 181 [documentation]: http://radimrehurek.com/gensim/install.html 182