1
2scandir, a better directory iterator and faster os.walk()
3=========================================================
4
5.. image:: https://img.shields.io/pypi/v/scandir.svg
6   :target: https://pypi.python.org/pypi/scandir
7   :alt: scandir on PyPI (Python Package Index)
8
9.. image:: https://travis-ci.org/benhoyt/scandir.svg?branch=master
10   :target: https://travis-ci.org/benhoyt/scandir
11   :alt: Travis CI tests (Linux)
12
13.. image:: https://ci.appveyor.com/api/projects/status/github/benhoyt/scandir?branch=master&svg=true
14   :target: https://ci.appveyor.com/project/benhoyt/scandir
15   :alt: Appveyor tests (Windows)
16
17
18``scandir()`` is a directory iteration function like ``os.listdir()``,
19except that instead of returning a list of bare filenames, it yields
20``DirEntry`` objects that include file type and stat information along
21with the name. Using ``scandir()`` increases the speed of ``os.walk()``
22by 2-20 times (depending on the platform and file system) by avoiding
23unnecessary calls to ``os.stat()`` in most cases.
24
25
26Now included in a Python near you!
27----------------------------------
28
29``scandir`` has been included in the Python 3.5 standard library as
30``os.scandir()``, and the related performance improvements to
31``os.walk()`` have also been included. So if you're lucky enough to be
32using Python 3.5 (release date September 13, 2015) you get the benefit
33immediately, otherwise just
34`download this module from PyPI <https://pypi.python.org/pypi/scandir>`_,
35install it with ``pip install scandir``, and then do something like
36this in your code:
37
38.. code-block:: python
39
40    # Use the built-in version of scandir/walk if possible, otherwise
41    # use the scandir module version
42    try:
43        from os import scandir, walk
44    except ImportError:
45        from scandir import scandir, walk
46
47`PEP 471 <https://www.python.org/dev/peps/pep-0471/>`_, which is the
48PEP that proposes including ``scandir`` in the Python standard library,
49was `accepted <https://mail.python.org/pipermail/python-dev/2014-July/135561.html>`_
50in July 2014 by Victor Stinner, the BDFL-delegate for the PEP.
51
52This ``scandir`` module is intended to work on Python 2.6+ and Python
533.2+ (and it has been tested on those versions).
54
55
56Background
57----------
58
59Python's built-in ``os.walk()`` is significantly slower than it needs to be,
60because -- in addition to calling ``listdir()`` on each directory -- it calls
61``stat()`` on each file to determine whether the filename is a directory or not.
62But both ``FindFirstFile`` / ``FindNextFile`` on Windows and ``readdir`` on Linux/OS
63X already tell you whether the files returned are directories or not, so
64no further ``stat`` system calls are needed. In short, you can reduce the number
65of system calls from about 2N to N, where N is the total number of files and
66directories in the tree.
67
68In practice, removing all those extra system calls makes ``os.walk()`` about
69**7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS
70X.** So we're not talking about micro-optimizations. See more benchmarks
71in the "Benchmarks" section below.
72
73Somewhat relatedly, many people have also asked for a version of
74``os.listdir()`` that yields filenames as it iterates instead of returning them
75as one big list. This improves memory efficiency for iterating very large
76directories.
77
78So as well as a faster ``walk()``, scandir adds a new ``scandir()`` function.
79They're pretty easy to use, but see "The API" below for the full docs.
80
81
82Benchmarks
83----------
84
85Below are results showing how many times as fast ``scandir.walk()`` is than
86``os.walk()`` on various systems, found by running ``benchmark.py`` with no
87arguments:
88
89====================   ==============   =============
90System version         Python version   Times as fast
91====================   ==============   =============
92Windows 7 64-bit       2.7.7 64-bit     10.4
93Windows 7 64-bit SSD   2.7.7 64-bit     10.3
94Windows 7 64-bit NFS   2.7.6 64-bit     36.8
95Windows 7 64-bit SSD   3.4.1 64-bit     9.9
96Windows 7 64-bit SSD   3.5.0 64-bit     9.5
97CentOS 6.2 64-bit      2.6.6 64-bit     3.9
98Ubuntu 14.04 64-bit    2.7.6 64-bit     5.8
99Mac OS X 10.9.3        2.7.5 64-bit     3.8
100====================   ==============   =============
101
102All of the above tests were done using the fast C version of scandir
103(source code in ``_scandir.c``).
104
105Note that the gains are less than the above on smaller directories and greater
106on larger directories. This is why ``benchmark.py`` creates a test directory
107tree with a standardized size.
108
109
110The API
111-------
112
113walk()
114~~~~~~
115
116The API for ``scandir.walk()`` is exactly the same as ``os.walk()``, so just
117`read the Python docs <https://docs.python.org/3.5/library/os.html#os.walk>`_.
118
119scandir()
120~~~~~~~~~
121
122The full docs for ``scandir()`` and the ``DirEntry`` objects it yields are
123available in the `Python documentation here <https://docs.python.org/3.5/library/os.html#os.scandir>`_.
124But below is a brief summary as well.
125
126    scandir(path='.') -> iterator of DirEntry objects for given path
127
128Like ``listdir``, ``scandir`` calls the operating system's directory
129iteration system calls to get the names of the files in the given
130``path``, but it's different from ``listdir`` in two ways:
131
132* Instead of returning bare filename strings, it returns lightweight
133  ``DirEntry`` objects that hold the filename string and provide
134  simple methods that allow access to the additional data the
135  operating system may have returned.
136
137* It returns a generator instead of a list, so that ``scandir`` acts
138  as a true iterator instead of returning the full list immediately.
139
140``scandir()`` yields a ``DirEntry`` object for each file and
141sub-directory in ``path``. Just like ``listdir``, the ``'.'``
142and ``'..'`` pseudo-directories are skipped, and the entries are
143yielded in system-dependent order. Each ``DirEntry`` object has the
144following attributes and methods:
145
146* ``name``: the entry's filename, relative to the scandir ``path``
147  argument (corresponds to the return values of ``os.listdir``)
148
149* ``path``: the entry's full path name (not necessarily an absolute
150  path) -- the equivalent of ``os.path.join(scandir_path, entry.name)``
151
152* ``is_dir(*, follow_symlinks=True)``: similar to
153  ``pathlib.Path.is_dir()``, but the return value is cached on the
154  ``DirEntry`` object; doesn't require a system call in most cases;
155  don't follow symbolic links if ``follow_symlinks`` is False
156
157* ``is_file(*, follow_symlinks=True)``: similar to
158  ``pathlib.Path.is_file()``, but the return value is cached on the
159  ``DirEntry`` object; doesn't require a system call in most cases;
160  don't follow symbolic links if ``follow_symlinks`` is False
161
162* ``is_symlink()``: similar to ``pathlib.Path.is_symlink()``, but the
163  return value is cached on the ``DirEntry`` object; doesn't require a
164  system call in most cases
165
166* ``stat(*, follow_symlinks=True)``: like ``os.stat()``, but the
167  return value is cached on the ``DirEntry`` object; does not require a
168  system call on Windows (except for symlinks); don't follow symbolic links
169  (like ``os.lstat()``) if ``follow_symlinks`` is False
170
171* ``inode()``: return the inode number of the entry; the return value
172  is cached on the ``DirEntry`` object
173
174Here's a very simple example of ``scandir()`` showing use of the
175``DirEntry.name`` attribute and the ``DirEntry.is_dir()`` method:
176
177.. code-block:: python
178
179    def subdirs(path):
180        """Yield directory names not starting with '.' under given path."""
181        for entry in os.scandir(path):
182            if not entry.name.startswith('.') and entry.is_dir():
183                yield entry.name
184
185This ``subdirs()`` function will be significantly faster with scandir
186than ``os.listdir()`` and ``os.path.isdir()`` on both Windows and POSIX
187systems, especially on medium-sized or large directories.
188
189
190Further reading
191---------------
192
193* `The Python docs for scandir <https://docs.python.org/3.5/library/os.html#os.scandir>`_
194* `PEP 471 <https://www.python.org/dev/peps/pep-0471/>`_, the
195  (now-accepted) Python Enhancement Proposal that proposed adding
196  ``scandir`` to the standard library -- a lot of details here,
197  including rejected ideas and previous discussion
198
199
200Flames, comments, bug reports
201-----------------------------
202
203Please send flames, comments, and questions about scandir to Ben Hoyt:
204
205http://benhoyt.com/
206
207File bug reports for the version in the Python 3.5 standard library
208`here <https://docs.python.org/3.5/bugs.html>`_, or file bug reports
209or feature requests for this module at the GitHub project page:
210
211https://github.com/benhoyt/scandir
212