1======================================================
2smart_open — utils for streaming large files in Python
3======================================================
4
5
6|License|_ |GHA|_ |Coveralls|_ |Downloads|_
7
8.. |License| image:: https://img.shields.io/pypi/l/smart_open.svg
9.. |GHA| image:: https://github.com/RaRe-Technologies/smart_open/workflows/Test/badge.svg
10.. |Coveralls| image:: https://coveralls.io/repos/github/RaRe-Technologies/smart_open/badge.svg?branch=develop
11.. |Downloads| image:: https://pepy.tech/badge/smart-open/month
12.. _License: https://github.com/RaRe-Technologies/smart_open/blob/master/LICENSE
13.. _GHA: https://github.com/RaRe-Technologies/smart_open/actions?query=workflow%3ATest
14.. _Coveralls: https://coveralls.io/github/RaRe-Technologies/smart_open?branch=HEAD
15.. _Downloads: https://pypi.org/project/smart-open/
16
17
18What?
19=====
20
21``smart_open`` is a Python 3 library for **efficient streaming of very large files** from/to storages such as S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.
22
23``smart_open`` is a drop-in replacement for Python's built-in ``open()``: it can do anything ``open`` can (100% compatible, falls back to native ``open`` wherever possible), plus lots of nifty extra stuff on top.
24
25**Python 2.7 is no longer supported. If you need Python 2.7, please use** `smart_open 1.10.1 <https://github.com/RaRe-Technologies/smart_open/releases/tag/1.10.0>`_, **the last version to support Python 2.**
26
27Why?
28====
29
30Working with large remote files, for example using Amazon's `boto3 <https://boto3.amazonaws.com/v1/documentation/api/latest/index.html>`_ Python library, is a pain.
31``boto3``'s ``Object.upload_fileobj()`` and ``Object.download_fileobj()`` methods require gotcha-prone boilerplate to use successfully, such as constructing file-like object wrappers.
32``smart_open`` shields you from that. It builds on boto3 and other remote storage libraries, but offers a **clean unified Pythonic API**. The result is less code for you to write and fewer bugs to make.
33
34
35How?
36=====
37
38``smart_open`` is well-tested, well-documented, and has a simple Pythonic API:
39
40
41.. _doctools_before_examples:
42
43.. code-block:: python
44
45  >>> from smart_open import open
46  >>>
47  >>> # stream lines from an S3 object
48  >>> for line in open('s3://commoncrawl/robots.txt'):
49  ...    print(repr(line))
50  ...    break
51  'User-Agent: *\n'
52
53  >>> # stream from/to compressed files, with transparent (de)compression:
54  >>> for line in open('smart_open/tests/test_data/1984.txt.gz', encoding='utf-8'):
55  ...    print(repr(line))
56  'It was a bright cold day in April, and the clocks were striking thirteen.\n'
57  'Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n'
58  'wind, slipped quickly through the glass doors of Victory Mansions, though not\n'
59  'quickly enough to prevent a swirl of gritty dust from entering along with him.\n'
60
61  >>> # can use context managers too:
62  >>> with open('smart_open/tests/test_data/1984.txt.gz') as fin:
63  ...    with open('smart_open/tests/test_data/1984.txt.bz2', 'w') as fout:
64  ...        for line in fin:
65  ...           fout.write(line)
66  74
67  80
68  78
69  79
70
71  >>> # can use any IOBase operations, like seek
72  >>> with open('s3://commoncrawl/robots.txt', 'rb') as fin:
73  ...     for line in fin:
74  ...         print(repr(line.decode('utf-8')))
75  ...         break
76  ...     offset = fin.seek(0)  # seek to the beginning
77  ...     print(fin.read(4))
78  'User-Agent: *\n'
79  b'User'
80
81  >>> # stream from HTTP
82  >>> for line in open('http://example.com/index.html'):
83  ...     print(repr(line))
84  ...     break
85  '<!doctype html>\n'
86
87.. _doctools_after_examples:
88
89Other examples of URLs that ``smart_open`` accepts::
90
91    s3://my_bucket/my_key
92    s3://my_key:my_secret@my_bucket/my_key
93    s3://my_key:my_secret@my_server:my_port@my_bucket/my_key
94    gs://my_bucket/my_blob
95    azure://my_bucket/my_blob
96    hdfs:///path/file
97    hdfs://path/file
98    webhdfs://host:port/path/file
99    ./local/path/file
100    ~/local/path/file
101    local/path/file
102    ./local/path/file.gz
103    file:///home/user/file
104    file:///home/user/file.bz2
105    [ssh|scp|sftp]://username@host//path/file
106    [ssh|scp|sftp]://username@host/path/file
107    [ssh|scp|sftp]://username:password@host/path/file
108
109
110Documentation
111=============
112
113Installation
114------------
115
116``smart_open`` supports a wide range of storage solutions, including AWS S3, Google Cloud and Azure.
117Each individual solution has its own dependencies.
118By default, ``smart_open`` does not install any dependencies, in order to keep the installation size small.
119You can install these dependencies explicitly using::
120
121    pip install smart_open[azure] # Install Azure deps
122    pip install smart_open[gcs] # Install GCS deps
123    pip install smart_open[s3] # Install S3 deps
124
125Or, if you don't mind installing a large number of third party libraries, you can install all dependencies using::
126
127    pip install smart_open[all]
128
129Be warned that this option increases the installation size significantly, e.g. over 100MB.
130
131If you're upgrading from ``smart_open`` versions 2.x and below, please check out the `Migration Guide <MIGRATING_FROM_OLDER_VERSIONS.rst>`_.
132
133Built-in help
134-------------
135
136For detailed API info, see the online help:
137
138.. code-block:: python
139
140    help('smart_open')
141
142or click `here <https://github.com/RaRe-Technologies/smart_open/blob/master/help.txt>`__ to view the help in your browser.
143
144More examples
145-------------
146
147For the sake of simplicity, the examples below assume you have all the dependencies installed, i.e. you have done::
148
149    pip install smart_open[all]
150
151.. code-block:: python
152
153    >>> import os, boto3
154    >>>
155    >>> # stream content *into* S3 (write mode) using a custom session
156    >>> session = boto3.Session(
157    ...     aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],
158    ...     aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'],
159    ... )
160    >>> url = 's3://smart-open-py37-benchmark-results/test.txt'
161    >>> with open(url, 'wb', transport_params={'client': session.client('s3')}) as fout:
162    ...     bytes_written = fout.write(b'hello world!')
163    ...     print(bytes_written)
164    12
165
166.. code-block:: python
167
168    # stream from HDFS
169    for line in open('hdfs://user/hadoop/my_file.txt', encoding='utf8'):
170        print(line)
171
172    # stream from WebHDFS
173    for line in open('webhdfs://host:port/user/hadoop/my_file.txt'):
174        print(line)
175
176    # stream content *into* HDFS (write mode):
177    with open('hdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout:
178        fout.write(b'hello world')
179
180    # stream content *into* WebHDFS (write mode):
181    with open('webhdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout:
182        fout.write(b'hello world')
183
184    # stream from a completely custom s3 server, like s3proxy:
185    for line in open('s3u://user:secret@host:port@mybucket/mykey.txt'):
186        print(line)
187
188    # Stream to Digital Ocean Spaces bucket providing credentials from boto3 profile
189    session = boto3.Session(profile_name='digitalocean')
190    client = session.client('s3', endpoint_url='https://ams3.digitaloceanspaces.com')
191    transport_params = {'client': client}
192    with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout:
193        fout.write(b'here we stand')
194
195    # stream from GCS
196    for line in open('gs://my_bucket/my_file.txt'):
197        print(line)
198
199    # stream content *into* GCS (write mode):
200    with open('gs://my_bucket/my_file.txt', 'wb') as fout:
201        fout.write(b'hello world')
202
203    # stream from Azure Blob Storage
204    connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
205    transport_params = {
206        'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),
207    }
208    for line in open('azure://mycontainer/myfile.txt', transport_params=transport_params):
209        print(line)
210
211    # stream content *into* Azure Blob Storage (write mode):
212    connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
213    transport_params = {
214        'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),
215    }
216    with open('azure://mycontainer/my_file.txt', 'wb', transport_params=transport_params) as fout:
217        fout.write(b'hello world')
218
219Compression Handling
220--------------------
221
222The top-level `compression` parameter controls compression/decompression behavior when reading and writing.
223The supported values for this parameter are:
224
225- ``infer_from_extension`` (default behavior)
226- ``disable``
227- ``.gz``
228- ``.bz2``
229
230By default, ``smart_open`` determines the compression algorithm to use based on the file extension.
231
232.. code-block:: python
233
234    >>> from smart_open import open, register_compressor
235    >>> with open('smart_open/tests/test_data/1984.txt.gz') as fin:
236    ...     print(fin.read(32))
237    It was a bright cold day in Apri
238
239You can override this behavior to either disable compression, or explicitly specify the algorithm to use.
240To disable compression:
241
242.. code-block:: python
243
244    >>> from smart_open import open, register_compressor
245    >>> with open('smart_open/tests/test_data/1984.txt.gz', 'rb', compression='disable') as fin:
246    ...     print(fin.read(32))
247    b'\x1f\x8b\x08\x08\x85F\x94\\\x00\x031984.txt\x005\x8f=r\xc3@\x08\x85{\x9d\xe2\x1d@'
248
249
250To specify the algorithm explicitly (e.g. for non-standard file extensions):
251
252.. code-block:: python
253
254    >>> from smart_open import open, register_compressor
255    >>> with open('smart_open/tests/test_data/1984.txt.gzip', compression='.gz') as fin:
256    ...     print(fin.read(32))
257    It was a bright cold day in Apri
258
259You can also easily add support for other file extensions and compression formats.
260For example, to open xz-compressed files:
261
262.. code-block:: python
263
264    >>> import lzma, os
265    >>> from smart_open import open, register_compressor
266
267    >>> def _handle_xz(file_obj, mode):
268    ...      return lzma.LZMAFile(filename=file_obj, mode=mode, format=lzma.FORMAT_XZ)
269
270    >>> register_compressor('.xz', _handle_xz)
271
272    >>> with open('smart_open/tests/test_data/1984.txt.xz') as fin:
273    ...     print(fin.read(32))
274    It was a bright cold day in Apri
275
276``lzma`` is in the standard library in Python 3.3 and greater.
277For 2.7, use `backports.lzma`_.
278
279.. _backports.lzma: https://pypi.org/project/backports.lzma/
280
281Transport-specific Options
282--------------------------
283
284``smart_open`` supports a wide range of transport options out of the box, including:
285
286- S3
287- HTTP, HTTPS (read-only)
288- SSH, SCP and SFTP
289- WebHDFS
290- GCS
291- Azure Blob Storage
292
293Each option involves setting up its own set of parameters.
294For example, for accessing S3, you often need to set up authentication, like API keys or a profile name.
295``smart_open``'s ``open`` function accepts a keyword argument ``transport_params`` which accepts additional parameters for the transport layer.
296Here are some examples of using this parameter:
297
298.. code-block:: python
299
300  >>> import boto3
301  >>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(client=boto3.client('s3')))
302  >>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(buffer_size=1024))
303
304For the full list of keyword arguments supported by each transport option, see the documentation:
305
306.. code-block:: python
307
308  help('smart_open.open')
309
310S3 Credentials
311--------------
312
313``smart_open`` uses the ``boto3`` library to talk to S3.
314``boto3`` has several `mechanisms <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html>`__ for determining the credentials to use.
315By default, ``smart_open`` will defer to ``boto3`` and let the latter take care of the credentials.
316There are several ways to override this behavior.
317
318The first is to pass a ``boto3.Client`` object as a transport parameter to the ``open`` function.
319You can customize the credentials when constructing the session for the client.
320``smart_open`` will then use the session when talking to S3.
321
322.. code-block:: python
323
324    session = boto3.Session(
325        aws_access_key_id=ACCESS_KEY,
326        aws_secret_access_key=SECRET_KEY,
327        aws_session_token=SESSION_TOKEN,
328    )
329    client = session.client('s3', endpoint_url=..., config=...)
330    fin = open('s3://bucket/key', transport_params=dict(client=client))
331
332Your second option is to specify the credentials within the S3 URL itself:
333
334.. code-block:: python
335
336    fin = open('s3://aws_access_key_id:aws_secret_access_key@bucket/key', ...)
337
338*Important*: The two methods above are **mutually exclusive**. If you pass an AWS client *and* the URL contains credentials, ``smart_open`` will ignore the latter.
339
340*Important*: ``smart_open`` ignores configuration files from the older ``boto`` library.
341Port your old ``boto`` settings to ``boto3`` in order to use them with ``smart_open``.
342
343Iterating Over an S3 Bucket's Contents
344--------------------------------------
345
346Since going over all (or select) keys in an S3 bucket is a very common operation, there's also an extra function ``smart_open.s3.iter_bucket()`` that does this efficiently, **processing the bucket keys in parallel** (using multiprocessing):
347
348.. code-block:: python
349
350  >>> from smart_open import s3
351  >>> # get data corresponding to 2010 and later under "silo-open-data/annual/monthly_rain"
352  >>> # we use workers=1 for reproducibility; you should use as many workers as you have cores
353  >>> bucket = 'silo-open-data'
354  >>> prefix = 'annual/monthly_rain/'
355  >>> for key, content in s3.iter_bucket(bucket, prefix=prefix, accept_key=lambda key: '/201' in key, workers=1, key_limit=3):
356  ...     print(key, round(len(content) / 2**20))
357  annual/monthly_rain/2010.monthly_rain.nc 13
358  annual/monthly_rain/2011.monthly_rain.nc 13
359  annual/monthly_rain/2012.monthly_rain.nc 13
360
361GCS Credentials
362---------------
363``smart_open`` uses the ``google-cloud-storage`` library to talk to GCS.
364``google-cloud-storage`` uses the ``google-cloud`` package under the hood to handle authentication.
365There are several `options <https://googleapis.dev/python/google-api-core/latest/auth.html>`__ to provide
366credentials.
367By default, ``smart_open`` will defer to ``google-cloud-storage`` and let it take care of the credentials.
368
369To override this behavior, pass a ``google.cloud.storage.Client`` object as a transport parameter to the ``open`` function.
370You can `customize the credentials <https://googleapis.dev/python/storage/latest/client.html>`__
371when constructing the client. ``smart_open`` will then use the client when talking to GCS. To follow allow with
372the example below, `refer to Google's guide <https://cloud.google.com/storage/docs/reference/libraries#setting_up_authentication>`__
373to setting up GCS authentication with a service account.
374
375.. code-block:: python
376
377    import os
378    from google.cloud.storage import Client
379    service_account_path = os.environ['GOOGLE_APPLICATION_CREDENTIALS']
380    client = Client.from_service_account_json(service_account_path)
381    fin = open('gs://gcp-public-data-landsat/index.csv.gz', transport_params=dict(client=client))
382
383If you need more credential options, you can create an explicit ``google.auth.credentials.Credentials`` object
384and pass it to the Client. To create an API token for use in the example below, refer to the
385`GCS authentication guide <https://cloud.google.com/storage/docs/authentication#apiauth>`__.
386
387.. code-block:: python
388
389	import os
390	from google.auth.credentials import Credentials
391	from google.cloud.storage import Client
392	token = os.environ['GOOGLE_API_TOKEN']
393	credentials = Credentials(token=token)
394	client = Client(credentials=credentials)
395	fin = open('gs://gcp-public-data-landsat/index.csv.gz', transport_params=dict(client=client))
396
397Azure Credentials
398-----------------
399
400``smart_open`` uses the ``azure-storage-blob`` library to talk to Azure Blob Storage.
401By default, ``smart_open`` will defer to ``azure-storage-blob`` and let it take care of the credentials.
402
403Azure Blob Storage does not have any ways of inferring credentials therefore, passing a ``azure.storage.blob.BlobServiceClient``
404object as a transport parameter to the ``open`` function is required.
405You can `customize the credentials <https://docs.microsoft.com/en-us/azure/storage/common/storage-samples-python#authentication>`__
406when constructing the client. ``smart_open`` will then use the client when talking to. To follow allow with
407the example below, `refer to Azure's guide <https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python#copy-your-credentials-from-the-azure-portal>`__
408to setting up authentication.
409
410.. code-block:: python
411
412    import os
413    from azure.storage.blob import BlobServiceClient
414    azure_storage_connection_string = os.environ['AZURE_STORAGE_CONNECTION_STRING']
415    client = BlobServiceClient.from_connection_string(azure_storage_connection_string)
416    fin = open('azure://my_container/my_blob.txt', transport_params=dict(client=client))
417
418If you need more credential options, refer to the
419`Azure Storage authentication guide <https://docs.microsoft.com/en-us/azure/storage/common/storage-samples-python#authentication>`__.
420
421File-like Binary Streams
422------------------------
423
424The ``open`` function also accepts file-like objects.
425This is useful when you already have a `binary file <https://docs.python.org/3/glossary.html#term-binary-file>`_ open, and would like to wrap it with transparent decompression:
426
427
428.. code-block:: python
429
430    >>> import io, gzip
431    >>>
432    >>> # Prepare some gzipped binary data in memory, as an example.
433    >>> # Any binary file will do; we're using BytesIO here for simplicity.
434    >>> buf = io.BytesIO()
435    >>> with gzip.GzipFile(fileobj=buf, mode='w') as fout:
436    ...     _ = fout.write(b'this is a bytestring')
437    >>> _ = buf.seek(0)
438    >>>
439    >>> # Use case starts here.
440    >>> buf.name = 'file.gz'  # add a .name attribute so smart_open knows what compressor to use
441    >>> import smart_open
442    >>> smart_open.open(buf, 'rb').read()  # will gzip-decompress transparently!
443    b'this is a bytestring'
444
445
446In this case, ``smart_open`` relied on the ``.name`` attribute of our `binary I/O stream <https://docs.python.org/3/library/io.html#binary-i-o>`_ ``buf`` object to determine which decompressor to use.
447If your file object doesn't have one, set the ``.name`` attribute to an appropriate value.
448Furthermore, that value has to end with a **known** file extension (see the ``register_compressor`` function).
449Otherwise, the transparent decompression will not occur.
450
451Drop-in replacement of ``pathlib.Path.open``
452--------------------------------------------
453
454``smart_open.open`` can also be used with ``Path`` objects.
455The built-in `Path.open()` is not able to read text from compressed files, so use ``patch_pathlib`` to replace it with `smart_open.open()` instead.
456This can be helpful when e.g. working with compressed files.
457
458.. code-block:: python
459
460    >>> from pathlib import Path
461    >>> from smart_open.smart_open_lib import patch_pathlib
462    >>>
463    >>> _ = patch_pathlib()  # replace `Path.open` with `smart_open.open`
464    >>>
465    >>> path = Path("smart_open/tests/test_data/crime-and-punishment.txt.gz")
466    >>>
467    >>> with path.open("r") as infile:
468    ...     print(infile.readline()[:41])
469    В начале июля, в чрезвычайно жаркое время
470
471How do I ...?
472=============
473
474See `this document <howto.md>`__.
475
476Extending ``smart_open``
477========================
478
479See `this document <extending.md>`__.
480
481Testing ``smart_open``
482======================
483
484``smart_open`` comes with a comprehensive suite of unit tests.
485Before you can run the test suite, install the test dependencies::
486
487    pip install -e .[test]
488
489Now, you can run the unit tests::
490
491    pytest smart_open
492
493The tests are also run automatically with `Travis CI <https://travis-ci.org/RaRe-Technologies/smart_open>`_ on every commit push & pull request.
494
495Comments, bug reports
496=====================
497
498``smart_open`` lives on `Github <https://github.com/RaRe-Technologies/smart_open>`_. You can file
499issues or pull requests there. Suggestions, pull requests and improvements welcome!
500
501----------------
502
503``smart_open`` is open source software released under the `MIT license <https://github.com/piskvorky/smart_open/blob/master/LICENSE>`_.
504Copyright (c) 2015-now `Radim Řehůřek <https://radimrehurek.com>`_.
505