• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

build/lib/snowballstemmer/H03-May-2022-23,12121,232

snowballstemmer.egg-info/H03-May-2022-211178

src/H16-Nov-2021-23,25221,347

COPYINGH A D15-Nov-20211.6 KiB3026

MANIFEST.inH A D15-Nov-2021126 87

NEWSH A D15-Nov-202127 KiB755511

PKG-INFOH A D16-Nov-20216.3 KiB164132

README.rstH A D16-Nov-20213.8 KiB10172

modules.txtH A D16-Nov-20213.2 KiB6459

setup.cfgH A D16-Nov-2021158 128

setup.pyH A D16-Nov-20212.6 KiB8270

README.rst

1Snowball stemming library collection for Python
2===============================================
3
4Python 3 (>= 3.3) is supported.  We no longer actively support Python 2 as
5the Python developers stopped supporting it at the start of 2020.  Snowball
62.1.0 was the last release to officially support Python 2.
7
8What is Stemming?
9-----------------
10
11Stemming maps different forms of the same word to a common "stem" - for
12example, the English stemmer maps *connection*, *connections*, *connective*,
13*connected*, and *connecting* to *connect*.  So a searching for *connected*
14would also find documents which only have the other forms.
15
16This stem form is often a word itself, but this is not always the case as this
17is not a requirement for text search systems, which are the intended field of
18use.  We also aim to conflate words with the same meaning, rather than all
19words with a common linguistic root (so *awe* and *awful* don't have the same
20stem), and over-stemming is more problematic than under-stemming so we tend not
21to stem in cases that are hard to resolve.  If you want to always reduce words
22to a root form and/or get a root form which is itself a word then Snowball's
23stemming algorithms likely aren't the right answer.
24
25How to use library
26------------------
27
28The ``snowballstemmer`` module has two functions.
29
30The ``snowballstemmer.algorithms`` function returns a list of available
31algorithm names.
32
33The ``snowballstemmer.stemmer`` function takes an algorithm name and returns a
34``Stemmer`` object.
35
36``Stemmer`` objects have a ``Stemmer.stemWord(word)`` method and a
37``Stemmer.stemWords(word[])`` method.
38
39.. code-block:: python
40
41   import snowballstemmer
42
43   stemmer = snowballstemmer.stemmer('english');
44   print(stemmer.stemWords("We are the world".split()));
45
46Automatic Acceleration
47----------------------
48
49`PyStemmer <https://pypi.org/project/PyStemmer/>`_ is a wrapper module for
50Snowball's ``libstemmer_c`` and should provide results 100% compatible to
51**snowballstemmer**.
52
53**PyStemmer** is faster because it wraps generated C versions of the stemmers;
54**snowballstemmer** uses generate Python code and is slower but offers a pure
55Python solution.
56
57If PyStemmer is installed, ``snowballstemmer.stemmer`` returns a ``PyStemmer``
58``Stemmer`` object which provides the same ``Stemmer.stemWord()`` and
59``Stemmer.stemWords()`` methods.
60
61Benchmark
62~~~~~~~~~
63
64This is a crude benchmark which measures the time for running each stemmer on
65every word in its sample vocabulary (10,787,583 words over 26 languages).  It's
66not a realistic test of normal use as a real application would do much more
67than just stemming.  It's also skewed towards the stemmers which do more work
68per word and towards those with larger sample vocabularies.
69
70* Python 2.7 + **snowballstemmer** : 13m00s (15.0 * PyStemmer)
71* Python 3.7 + **snowballstemmer** : 12m19s (14.2 * PyStemmer)
72* PyPy 7.1.1 (Python 2.7.13) + **snowballstemmer** : 2m14s (2.6 * PyStemmer)
73* PyPy 7.1.1 (Python 3.6.1) + **snowballstemmer** : 1m46s (2.0 * PyStemmer)
74* Python 2.7 + **PyStemmer** : 52s
75
76For reference the equivalent test for C runs in 9 seconds.
77
78These results are for Snowball 2.0.0.  They're likely to evolve over time as
79the code Snowball generates for both Python and C continues to improve (for
80a much older test over a different set of stemmers using Python 2.7,
81**snowballstemmer** was 30 times slower than **PyStemmer**, or 9 times slower
82with **PyPy**).
83
84The message to take away is that if you're stemming a lot of words you should
85either install **PyStemmer** (which **snowballstemmer** will then automatically
86use for you as described above) or use PyPy.
87
88The TestApp example
89-------------------
90
91The ``testapp.py`` example program allows you to run any of the stemmers
92on a sample vocabulary.
93
94Usage::
95
96   testapp.py <algorithm> "sentences ... "
97
98.. code-block:: bash
99
100   $ python testapp.py English "sentences... "
101