1====================================================== 2smart_open — utils for streaming large files in Python 3====================================================== 4 5 6|License|_ |GHA|_ |Coveralls|_ |Downloads|_ 7 8.. |License| image:: https://img.shields.io/pypi/l/smart_open.svg 9.. |GHA| image:: https://github.com/RaRe-Technologies/smart_open/workflows/Test/badge.svg 10.. |Coveralls| image:: https://coveralls.io/repos/github/RaRe-Technologies/smart_open/badge.svg?branch=develop 11.. |Downloads| image:: https://pepy.tech/badge/smart-open/month 12.. _License: https://github.com/RaRe-Technologies/smart_open/blob/master/LICENSE 13.. _GHA: https://github.com/RaRe-Technologies/smart_open/actions?query=workflow%3ATest 14.. _Coveralls: https://coveralls.io/github/RaRe-Technologies/smart_open?branch=HEAD 15.. _Downloads: https://pypi.org/project/smart-open/ 16 17 18What? 19===== 20 21``smart_open`` is a Python 3 library for **efficient streaming of very large files** from/to storages such as S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats. 22 23``smart_open`` is a drop-in replacement for Python's built-in ``open()``: it can do anything ``open`` can (100% compatible, falls back to native ``open`` wherever possible), plus lots of nifty extra stuff on top. 24 25**Python 2.7 is no longer supported. If you need Python 2.7, please use** `smart_open 1.10.1 <https://github.com/RaRe-Technologies/smart_open/releases/tag/1.10.0>`_, **the last version to support Python 2.** 26 27Why? 28==== 29 30Working with large remote files, for example using Amazon's `boto3 <https://boto3.amazonaws.com/v1/documentation/api/latest/index.html>`_ Python library, is a pain. 31``boto3``'s ``Object.upload_fileobj()`` and ``Object.download_fileobj()`` methods require gotcha-prone boilerplate to use successfully, such as constructing file-like object wrappers. 32``smart_open`` shields you from that. It builds on boto3 and other remote storage libraries, but offers a **clean unified Pythonic API**. The result is less code for you to write and fewer bugs to make. 33 34 35How? 36===== 37 38``smart_open`` is well-tested, well-documented, and has a simple Pythonic API: 39 40 41.. _doctools_before_examples: 42 43.. code-block:: python 44 45 >>> from smart_open import open 46 >>> 47 >>> # stream lines from an S3 object 48 >>> for line in open('s3://commoncrawl/robots.txt'): 49 ... print(repr(line)) 50 ... break 51 'User-Agent: *\n' 52 53 >>> # stream from/to compressed files, with transparent (de)compression: 54 >>> for line in open('smart_open/tests/test_data/1984.txt.gz', encoding='utf-8'): 55 ... print(repr(line)) 56 'It was a bright cold day in April, and the clocks were striking thirteen.\n' 57 'Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n' 58 'wind, slipped quickly through the glass doors of Victory Mansions, though not\n' 59 'quickly enough to prevent a swirl of gritty dust from entering along with him.\n' 60 61 >>> # can use context managers too: 62 >>> with open('smart_open/tests/test_data/1984.txt.gz') as fin: 63 ... with open('smart_open/tests/test_data/1984.txt.bz2', 'w') as fout: 64 ... for line in fin: 65 ... fout.write(line) 66 74 67 80 68 78 69 79 70 71 >>> # can use any IOBase operations, like seek 72 >>> with open('s3://commoncrawl/robots.txt', 'rb') as fin: 73 ... for line in fin: 74 ... print(repr(line.decode('utf-8'))) 75 ... break 76 ... offset = fin.seek(0) # seek to the beginning 77 ... print(fin.read(4)) 78 'User-Agent: *\n' 79 b'User' 80 81 >>> # stream from HTTP 82 >>> for line in open('http://example.com/index.html'): 83 ... print(repr(line)) 84 ... break 85 '<!doctype html>\n' 86 87.. _doctools_after_examples: 88 89Other examples of URLs that ``smart_open`` accepts:: 90 91 s3://my_bucket/my_key 92 s3://my_key:my_secret@my_bucket/my_key 93 s3://my_key:my_secret@my_server:my_port@my_bucket/my_key 94 gs://my_bucket/my_blob 95 azure://my_bucket/my_blob 96 hdfs:///path/file 97 hdfs://path/file 98 webhdfs://host:port/path/file 99 ./local/path/file 100 ~/local/path/file 101 local/path/file 102 ./local/path/file.gz 103 file:///home/user/file 104 file:///home/user/file.bz2 105 [ssh|scp|sftp]://username@host//path/file 106 [ssh|scp|sftp]://username@host/path/file 107 [ssh|scp|sftp]://username:password@host/path/file 108 109 110Documentation 111============= 112 113Installation 114------------ 115 116``smart_open`` supports a wide range of storage solutions, including AWS S3, Google Cloud and Azure. 117Each individual solution has its own dependencies. 118By default, ``smart_open`` does not install any dependencies, in order to keep the installation size small. 119You can install these dependencies explicitly using:: 120 121 pip install smart_open[azure] # Install Azure deps 122 pip install smart_open[gcs] # Install GCS deps 123 pip install smart_open[s3] # Install S3 deps 124 125Or, if you don't mind installing a large number of third party libraries, you can install all dependencies using:: 126 127 pip install smart_open[all] 128 129Be warned that this option increases the installation size significantly, e.g. over 100MB. 130 131If you're upgrading from ``smart_open`` versions 2.x and below, please check out the `Migration Guide <MIGRATING_FROM_OLDER_VERSIONS.rst>`_. 132 133Built-in help 134------------- 135 136For detailed API info, see the online help: 137 138.. code-block:: python 139 140 help('smart_open') 141 142or click `here <https://github.com/RaRe-Technologies/smart_open/blob/master/help.txt>`__ to view the help in your browser. 143 144More examples 145------------- 146 147For the sake of simplicity, the examples below assume you have all the dependencies installed, i.e. you have done:: 148 149 pip install smart_open[all] 150 151.. code-block:: python 152 153 >>> import os, boto3 154 >>> 155 >>> # stream content *into* S3 (write mode) using a custom session 156 >>> session = boto3.Session( 157 ... aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'], 158 ... aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'], 159 ... ) 160 >>> url = 's3://smart-open-py37-benchmark-results/test.txt' 161 >>> with open(url, 'wb', transport_params={'client': session.client('s3')}) as fout: 162 ... bytes_written = fout.write(b'hello world!') 163 ... print(bytes_written) 164 12 165 166.. code-block:: python 167 168 # stream from HDFS 169 for line in open('hdfs://user/hadoop/my_file.txt', encoding='utf8'): 170 print(line) 171 172 # stream from WebHDFS 173 for line in open('webhdfs://host:port/user/hadoop/my_file.txt'): 174 print(line) 175 176 # stream content *into* HDFS (write mode): 177 with open('hdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout: 178 fout.write(b'hello world') 179 180 # stream content *into* WebHDFS (write mode): 181 with open('webhdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout: 182 fout.write(b'hello world') 183 184 # stream from a completely custom s3 server, like s3proxy: 185 for line in open('s3u://user:secret@host:port@mybucket/mykey.txt'): 186 print(line) 187 188 # Stream to Digital Ocean Spaces bucket providing credentials from boto3 profile 189 session = boto3.Session(profile_name='digitalocean') 190 client = session.client('s3', endpoint_url='https://ams3.digitaloceanspaces.com') 191 transport_params = {'client': client} 192 with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout: 193 fout.write(b'here we stand') 194 195 # stream from GCS 196 for line in open('gs://my_bucket/my_file.txt'): 197 print(line) 198 199 # stream content *into* GCS (write mode): 200 with open('gs://my_bucket/my_file.txt', 'wb') as fout: 201 fout.write(b'hello world') 202 203 # stream from Azure Blob Storage 204 connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING'] 205 transport_params = { 206 'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str), 207 } 208 for line in open('azure://mycontainer/myfile.txt', transport_params=transport_params): 209 print(line) 210 211 # stream content *into* Azure Blob Storage (write mode): 212 connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING'] 213 transport_params = { 214 'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str), 215 } 216 with open('azure://mycontainer/my_file.txt', 'wb', transport_params=transport_params) as fout: 217 fout.write(b'hello world') 218 219Compression Handling 220-------------------- 221 222The top-level `compression` parameter controls compression/decompression behavior when reading and writing. 223The supported values for this parameter are: 224 225- ``infer_from_extension`` (default behavior) 226- ``disable`` 227- ``.gz`` 228- ``.bz2`` 229 230By default, ``smart_open`` determines the compression algorithm to use based on the file extension. 231 232.. code-block:: python 233 234 >>> from smart_open import open, register_compressor 235 >>> with open('smart_open/tests/test_data/1984.txt.gz') as fin: 236 ... print(fin.read(32)) 237 It was a bright cold day in Apri 238 239You can override this behavior to either disable compression, or explicitly specify the algorithm to use. 240To disable compression: 241 242.. code-block:: python 243 244 >>> from smart_open import open, register_compressor 245 >>> with open('smart_open/tests/test_data/1984.txt.gz', 'rb', compression='disable') as fin: 246 ... print(fin.read(32)) 247 b'\x1f\x8b\x08\x08\x85F\x94\\\x00\x031984.txt\x005\x8f=r\xc3@\x08\x85{\x9d\xe2\x1d@' 248 249 250To specify the algorithm explicitly (e.g. for non-standard file extensions): 251 252.. code-block:: python 253 254 >>> from smart_open import open, register_compressor 255 >>> with open('smart_open/tests/test_data/1984.txt.gzip', compression='.gz') as fin: 256 ... print(fin.read(32)) 257 It was a bright cold day in Apri 258 259You can also easily add support for other file extensions and compression formats. 260For example, to open xz-compressed files: 261 262.. code-block:: python 263 264 >>> import lzma, os 265 >>> from smart_open import open, register_compressor 266 267 >>> def _handle_xz(file_obj, mode): 268 ... return lzma.LZMAFile(filename=file_obj, mode=mode, format=lzma.FORMAT_XZ) 269 270 >>> register_compressor('.xz', _handle_xz) 271 272 >>> with open('smart_open/tests/test_data/1984.txt.xz') as fin: 273 ... print(fin.read(32)) 274 It was a bright cold day in Apri 275 276``lzma`` is in the standard library in Python 3.3 and greater. 277For 2.7, use `backports.lzma`_. 278 279.. _backports.lzma: https://pypi.org/project/backports.lzma/ 280 281Transport-specific Options 282-------------------------- 283 284``smart_open`` supports a wide range of transport options out of the box, including: 285 286- S3 287- HTTP, HTTPS (read-only) 288- SSH, SCP and SFTP 289- WebHDFS 290- GCS 291- Azure Blob Storage 292 293Each option involves setting up its own set of parameters. 294For example, for accessing S3, you often need to set up authentication, like API keys or a profile name. 295``smart_open``'s ``open`` function accepts a keyword argument ``transport_params`` which accepts additional parameters for the transport layer. 296Here are some examples of using this parameter: 297 298.. code-block:: python 299 300 >>> import boto3 301 >>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(client=boto3.client('s3'))) 302 >>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(buffer_size=1024)) 303 304For the full list of keyword arguments supported by each transport option, see the documentation: 305 306.. code-block:: python 307 308 help('smart_open.open') 309 310S3 Credentials 311-------------- 312 313``smart_open`` uses the ``boto3`` library to talk to S3. 314``boto3`` has several `mechanisms <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html>`__ for determining the credentials to use. 315By default, ``smart_open`` will defer to ``boto3`` and let the latter take care of the credentials. 316There are several ways to override this behavior. 317 318The first is to pass a ``boto3.Client`` object as a transport parameter to the ``open`` function. 319You can customize the credentials when constructing the session for the client. 320``smart_open`` will then use the session when talking to S3. 321 322.. code-block:: python 323 324 session = boto3.Session( 325 aws_access_key_id=ACCESS_KEY, 326 aws_secret_access_key=SECRET_KEY, 327 aws_session_token=SESSION_TOKEN, 328 ) 329 client = session.client('s3', endpoint_url=..., config=...) 330 fin = open('s3://bucket/key', transport_params=dict(client=client)) 331 332Your second option is to specify the credentials within the S3 URL itself: 333 334.. code-block:: python 335 336 fin = open('s3://aws_access_key_id:aws_secret_access_key@bucket/key', ...) 337 338*Important*: The two methods above are **mutually exclusive**. If you pass an AWS client *and* the URL contains credentials, ``smart_open`` will ignore the latter. 339 340*Important*: ``smart_open`` ignores configuration files from the older ``boto`` library. 341Port your old ``boto`` settings to ``boto3`` in order to use them with ``smart_open``. 342 343Iterating Over an S3 Bucket's Contents 344-------------------------------------- 345 346Since going over all (or select) keys in an S3 bucket is a very common operation, there's also an extra function ``smart_open.s3.iter_bucket()`` that does this efficiently, **processing the bucket keys in parallel** (using multiprocessing): 347 348.. code-block:: python 349 350 >>> from smart_open import s3 351 >>> # get data corresponding to 2010 and later under "silo-open-data/annual/monthly_rain" 352 >>> # we use workers=1 for reproducibility; you should use as many workers as you have cores 353 >>> bucket = 'silo-open-data' 354 >>> prefix = 'annual/monthly_rain/' 355 >>> for key, content in s3.iter_bucket(bucket, prefix=prefix, accept_key=lambda key: '/201' in key, workers=1, key_limit=3): 356 ... print(key, round(len(content) / 2**20)) 357 annual/monthly_rain/2010.monthly_rain.nc 13 358 annual/monthly_rain/2011.monthly_rain.nc 13 359 annual/monthly_rain/2012.monthly_rain.nc 13 360 361GCS Credentials 362--------------- 363``smart_open`` uses the ``google-cloud-storage`` library to talk to GCS. 364``google-cloud-storage`` uses the ``google-cloud`` package under the hood to handle authentication. 365There are several `options <https://googleapis.dev/python/google-api-core/latest/auth.html>`__ to provide 366credentials. 367By default, ``smart_open`` will defer to ``google-cloud-storage`` and let it take care of the credentials. 368 369To override this behavior, pass a ``google.cloud.storage.Client`` object as a transport parameter to the ``open`` function. 370You can `customize the credentials <https://googleapis.dev/python/storage/latest/client.html>`__ 371when constructing the client. ``smart_open`` will then use the client when talking to GCS. To follow allow with 372the example below, `refer to Google's guide <https://cloud.google.com/storage/docs/reference/libraries#setting_up_authentication>`__ 373to setting up GCS authentication with a service account. 374 375.. code-block:: python 376 377 import os 378 from google.cloud.storage import Client 379 service_account_path = os.environ['GOOGLE_APPLICATION_CREDENTIALS'] 380 client = Client.from_service_account_json(service_account_path) 381 fin = open('gs://gcp-public-data-landsat/index.csv.gz', transport_params=dict(client=client)) 382 383If you need more credential options, you can create an explicit ``google.auth.credentials.Credentials`` object 384and pass it to the Client. To create an API token for use in the example below, refer to the 385`GCS authentication guide <https://cloud.google.com/storage/docs/authentication#apiauth>`__. 386 387.. code-block:: python 388 389 import os 390 from google.auth.credentials import Credentials 391 from google.cloud.storage import Client 392 token = os.environ['GOOGLE_API_TOKEN'] 393 credentials = Credentials(token=token) 394 client = Client(credentials=credentials) 395 fin = open('gs://gcp-public-data-landsat/index.csv.gz', transport_params=dict(client=client)) 396 397Azure Credentials 398----------------- 399 400``smart_open`` uses the ``azure-storage-blob`` library to talk to Azure Blob Storage. 401By default, ``smart_open`` will defer to ``azure-storage-blob`` and let it take care of the credentials. 402 403Azure Blob Storage does not have any ways of inferring credentials therefore, passing a ``azure.storage.blob.BlobServiceClient`` 404object as a transport parameter to the ``open`` function is required. 405You can `customize the credentials <https://docs.microsoft.com/en-us/azure/storage/common/storage-samples-python#authentication>`__ 406when constructing the client. ``smart_open`` will then use the client when talking to. To follow allow with 407the example below, `refer to Azure's guide <https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python#copy-your-credentials-from-the-azure-portal>`__ 408to setting up authentication. 409 410.. code-block:: python 411 412 import os 413 from azure.storage.blob import BlobServiceClient 414 azure_storage_connection_string = os.environ['AZURE_STORAGE_CONNECTION_STRING'] 415 client = BlobServiceClient.from_connection_string(azure_storage_connection_string) 416 fin = open('azure://my_container/my_blob.txt', transport_params=dict(client=client)) 417 418If you need more credential options, refer to the 419`Azure Storage authentication guide <https://docs.microsoft.com/en-us/azure/storage/common/storage-samples-python#authentication>`__. 420 421File-like Binary Streams 422------------------------ 423 424The ``open`` function also accepts file-like objects. 425This is useful when you already have a `binary file <https://docs.python.org/3/glossary.html#term-binary-file>`_ open, and would like to wrap it with transparent decompression: 426 427 428.. code-block:: python 429 430 >>> import io, gzip 431 >>> 432 >>> # Prepare some gzipped binary data in memory, as an example. 433 >>> # Any binary file will do; we're using BytesIO here for simplicity. 434 >>> buf = io.BytesIO() 435 >>> with gzip.GzipFile(fileobj=buf, mode='w') as fout: 436 ... _ = fout.write(b'this is a bytestring') 437 >>> _ = buf.seek(0) 438 >>> 439 >>> # Use case starts here. 440 >>> buf.name = 'file.gz' # add a .name attribute so smart_open knows what compressor to use 441 >>> import smart_open 442 >>> smart_open.open(buf, 'rb').read() # will gzip-decompress transparently! 443 b'this is a bytestring' 444 445 446In this case, ``smart_open`` relied on the ``.name`` attribute of our `binary I/O stream <https://docs.python.org/3/library/io.html#binary-i-o>`_ ``buf`` object to determine which decompressor to use. 447If your file object doesn't have one, set the ``.name`` attribute to an appropriate value. 448Furthermore, that value has to end with a **known** file extension (see the ``register_compressor`` function). 449Otherwise, the transparent decompression will not occur. 450 451Drop-in replacement of ``pathlib.Path.open`` 452-------------------------------------------- 453 454``smart_open.open`` can also be used with ``Path`` objects. 455The built-in `Path.open()` is not able to read text from compressed files, so use ``patch_pathlib`` to replace it with `smart_open.open()` instead. 456This can be helpful when e.g. working with compressed files. 457 458.. code-block:: python 459 460 >>> from pathlib import Path 461 >>> from smart_open.smart_open_lib import patch_pathlib 462 >>> 463 >>> _ = patch_pathlib() # replace `Path.open` with `smart_open.open` 464 >>> 465 >>> path = Path("smart_open/tests/test_data/crime-and-punishment.txt.gz") 466 >>> 467 >>> with path.open("r") as infile: 468 ... print(infile.readline()[:41]) 469 В начале июля, в чрезвычайно жаркое время 470 471How do I ...? 472============= 473 474See `this document <howto.md>`__. 475 476Extending ``smart_open`` 477======================== 478 479See `this document <extending.md>`__. 480 481Testing ``smart_open`` 482====================== 483 484``smart_open`` comes with a comprehensive suite of unit tests. 485Before you can run the test suite, install the test dependencies:: 486 487 pip install -e .[test] 488 489Now, you can run the unit tests:: 490 491 pytest smart_open 492 493The tests are also run automatically with `Travis CI <https://travis-ci.org/RaRe-Technologies/smart_open>`_ on every commit push & pull request. 494 495Comments, bug reports 496===================== 497 498``smart_open`` lives on `Github <https://github.com/RaRe-Technologies/smart_open>`_. You can file 499issues or pull requests there. Suggestions, pull requests and improvements welcome! 500 501---------------- 502 503``smart_open`` is open source software released under the `MIT license <https://github.com/piskvorky/smart_open/blob/master/LICENSE>`_. 504Copyright (c) 2015-now `Radim Řehůřek <https://radimrehurek.com>`_. 505