1.. _topics-exporters: 2 3============== 4Item Exporters 5============== 6 7.. module:: scrapy.exporters 8 :synopsis: Item Exporters 9 10Once you have scraped your items, you often want to persist or export those 11items, to use the data in some other application. That is, after all, the whole 12purpose of the scraping process. 13 14For this purpose Scrapy provides a collection of Item Exporters for different 15output formats, such as XML, CSV or JSON. 16 17Using Item Exporters 18==================== 19 20If you are in a hurry, and just want to use an Item Exporter to output scraped 21data see the :ref:`topics-feed-exports`. Otherwise, if you want to know how 22Item Exporters work or need more custom functionality (not covered by the 23default exports), continue reading below. 24 25In order to use an Item Exporter, you must instantiate it with its required 26args. Each Item Exporter requires different arguments, so check each exporter 27documentation to be sure, in :ref:`topics-exporters-reference`. After you have 28instantiated your exporter, you have to: 29 301. call the method :meth:`~BaseItemExporter.start_exporting` in order to 31signal the beginning of the exporting process 32 332. call the :meth:`~BaseItemExporter.export_item` method for each item you want 34to export 35 363. and finally call the :meth:`~BaseItemExporter.finish_exporting` to signal 37the end of the exporting process 38 39Here you can see an :doc:`Item Pipeline <item-pipeline>` which uses multiple 40Item Exporters to group scraped items to different files according to the 41value of one of their fields:: 42 43 from itemadapter import ItemAdapter 44 from scrapy.exporters import XmlItemExporter 45 46 class PerYearXmlExportPipeline: 47 """Distribute items across multiple XML files according to their 'year' field""" 48 49 def open_spider(self, spider): 50 self.year_to_exporter = {} 51 52 def close_spider(self, spider): 53 for exporter, xml_file in self.year_to_exporter.values(): 54 exporter.finish_exporting() 55 xml_file.close() 56 57 def _exporter_for_item(self, item): 58 adapter = ItemAdapter(item) 59 year = adapter['year'] 60 if year not in self.year_to_exporter: 61 xml_file = open(f'{year}.xml', 'wb') 62 exporter = XmlItemExporter(xml_file) 63 exporter.start_exporting() 64 self.year_to_exporter[year] = (exporter, xml_file) 65 return self.year_to_exporter[year][0] 66 67 def process_item(self, item, spider): 68 exporter = self._exporter_for_item(item) 69 exporter.export_item(item) 70 return item 71 72 73.. _topics-exporters-field-serialization: 74 75Serialization of item fields 76============================ 77 78By default, the field values are passed unmodified to the underlying 79serialization library, and the decision of how to serialize them is delegated 80to each particular serialization library. 81 82However, you can customize how each field value is serialized *before it is 83passed to the serialization library*. 84 85There are two ways to customize how a field will be serialized, which are 86described next. 87 88.. _topics-exporters-serializers: 89 901. Declaring a serializer in the field 91-------------------------------------- 92 93If you use :class:`~.Item` you can declare a serializer in the 94:ref:`field metadata <topics-items-fields>`. The serializer must be 95a callable which receives a value and returns its serialized form. 96 97Example:: 98 99 import scrapy 100 101 def serialize_price(value): 102 return f'$ {str(value)}' 103 104 class Product(scrapy.Item): 105 name = scrapy.Field() 106 price = scrapy.Field(serializer=serialize_price) 107 108 1092. Overriding the serialize_field() method 110------------------------------------------ 111 112You can also override the :meth:`~BaseItemExporter.serialize_field()` method to 113customize how your field value will be exported. 114 115Make sure you call the base class :meth:`~BaseItemExporter.serialize_field()` method 116after your custom code. 117 118Example:: 119 120 from scrapy.exporter import XmlItemExporter 121 122 class ProductXmlExporter(XmlItemExporter): 123 124 def serialize_field(self, field, name, value): 125 if field == 'price': 126 return f'$ {str(value)}' 127 return super().serialize_field(field, name, value) 128 129.. _topics-exporters-reference: 130 131Built-in Item Exporters reference 132================================= 133 134Here is a list of the Item Exporters bundled with Scrapy. Some of them contain 135output examples, which assume you're exporting these two items:: 136 137 Item(name='Color TV', price='1200') 138 Item(name='DVD player', price='200') 139 140BaseItemExporter 141---------------- 142 143.. class:: BaseItemExporter(fields_to_export=None, export_empty_fields=False, encoding='utf-8', indent=0, dont_fail=False) 144 145 This is the (abstract) base class for all Item Exporters. It provides 146 support for common features used by all (concrete) Item Exporters, such as 147 defining what fields to export, whether to export empty fields, or which 148 encoding to use. 149 150 These features can be configured through the ``__init__`` method arguments which 151 populate their respective instance attributes: :attr:`fields_to_export`, 152 :attr:`export_empty_fields`, :attr:`encoding`, :attr:`indent`. 153 154 .. versionadded:: 2.0 155 The *dont_fail* parameter. 156 157 .. method:: export_item(item) 158 159 Exports the given item. This method must be implemented in subclasses. 160 161 .. method:: serialize_field(field, name, value) 162 163 Return the serialized value for the given field. You can override this 164 method (in your custom Item Exporters) if you want to control how a 165 particular field or value will be serialized/exported. 166 167 By default, this method looks for a serializer :ref:`declared in the item 168 field <topics-exporters-serializers>` and returns the result of applying 169 that serializer to the value. If no serializer is found, it returns the 170 value unchanged. 171 172 :param field: the field being serialized. If the source :ref:`item object 173 <item-types>` does not define field metadata, *field* is an empty 174 :class:`dict`. 175 :type field: :class:`~scrapy.item.Field` object or a :class:`dict` instance 176 177 :param name: the name of the field being serialized 178 :type name: str 179 180 :param value: the value being serialized 181 182 .. method:: start_exporting() 183 184 Signal the beginning of the exporting process. Some exporters may use 185 this to generate some required header (for example, the 186 :class:`XmlItemExporter`). You must call this method before exporting any 187 items. 188 189 .. method:: finish_exporting() 190 191 Signal the end of the exporting process. Some exporters may use this to 192 generate some required footer (for example, the 193 :class:`XmlItemExporter`). You must always call this method after you 194 have no more items to export. 195 196 .. attribute:: fields_to_export 197 198 A list with the name of the fields that will be exported, or ``None`` if 199 you want to export all fields. Defaults to ``None``. 200 201 Some exporters (like :class:`CsvItemExporter`) respect the order of the 202 fields defined in this attribute. 203 204 When using :ref:`item objects <item-types>` that do not expose all their 205 possible fields, exporters that do not support exporting a different 206 subset of fields per item will only export the fields found in the first 207 item exported. Use ``fields_to_export`` to define all the fields to be 208 exported. 209 210 .. attribute:: export_empty_fields 211 212 Whether to include empty/unpopulated item fields in the exported data. 213 Defaults to ``False``. Some exporters (like :class:`CsvItemExporter`) 214 ignore this attribute and always export all empty fields. 215 216 This option is ignored for dict items. 217 218 .. attribute:: encoding 219 220 The output character encoding. 221 222 .. attribute:: indent 223 224 Amount of spaces used to indent the output on each level. Defaults to ``0``. 225 226 * ``indent=None`` selects the most compact representation, 227 all items in the same line with no indentation 228 * ``indent<=0`` each item on its own line, no indentation 229 * ``indent>0`` each item on its own line, indented with the provided numeric value 230 231PythonItemExporter 232------------------ 233 234.. autoclass:: PythonItemExporter 235 236 237.. highlight:: none 238 239XmlItemExporter 240--------------- 241 242.. class:: XmlItemExporter(file, item_element='item', root_element='items', **kwargs) 243 244 Exports items in XML format to the specified file object. 245 246 :param file: the file-like object to use for exporting the data. Its ``write`` method should 247 accept ``bytes`` (a disk file opened in binary mode, a ``io.BytesIO`` object, etc) 248 249 :param root_element: The name of root element in the exported XML. 250 :type root_element: str 251 252 :param item_element: The name of each item element in the exported XML. 253 :type item_element: str 254 255 The additional keyword arguments of this ``__init__`` method are passed to the 256 :class:`BaseItemExporter` ``__init__`` method. 257 258 A typical output of this exporter would be:: 259 260 <?xml version="1.0" encoding="utf-8"?> 261 <items> 262 <item> 263 <name>Color TV</name> 264 <price>1200</price> 265 </item> 266 <item> 267 <name>DVD player</name> 268 <price>200</price> 269 </item> 270 </items> 271 272 Unless overridden in the :meth:`serialize_field` method, multi-valued fields are 273 exported by serializing each value inside a ``<value>`` element. This is for 274 convenience, as multi-valued fields are very common. 275 276 For example, the item:: 277 278 Item(name=['John', 'Doe'], age='23') 279 280 Would be serialized as:: 281 282 <?xml version="1.0" encoding="utf-8"?> 283 <items> 284 <item> 285 <name> 286 <value>John</value> 287 <value>Doe</value> 288 </name> 289 <age>23</age> 290 </item> 291 </items> 292 293CsvItemExporter 294--------------- 295 296.. class:: CsvItemExporter(file, include_headers_line=True, join_multivalued=',', errors=None, **kwargs) 297 298 Exports items in CSV format to the given file-like object. If the 299 :attr:`fields_to_export` attribute is set, it will be used to define the 300 CSV columns and their order. The :attr:`export_empty_fields` attribute has 301 no effect on this exporter. 302 303 :param file: the file-like object to use for exporting the data. Its ``write`` method should 304 accept ``bytes`` (a disk file opened in binary mode, a ``io.BytesIO`` object, etc) 305 306 :param include_headers_line: If enabled, makes the exporter output a header 307 line with the field names taken from 308 :attr:`BaseItemExporter.fields_to_export` or the first exported item fields. 309 :type include_headers_line: bool 310 311 :param join_multivalued: The char (or chars) that will be used for joining 312 multi-valued fields, if found. 313 :type include_headers_line: str 314 315 :param errors: The optional string that specifies how encoding and decoding 316 errors are to be handled. For more information see 317 :class:`io.TextIOWrapper`. 318 :type errors: str 319 320 The additional keyword arguments of this ``__init__`` method are passed to the 321 :class:`BaseItemExporter` ``__init__`` method, and the leftover arguments to the 322 :func:`csv.writer` function, so you can use any :func:`csv.writer` function 323 argument to customize this exporter. 324 325 A typical output of this exporter would be:: 326 327 product,price 328 Color TV,1200 329 DVD player,200 330 331PickleItemExporter 332------------------ 333 334.. class:: PickleItemExporter(file, protocol=0, **kwargs) 335 336 Exports items in pickle format to the given file-like object. 337 338 :param file: the file-like object to use for exporting the data. Its ``write`` method should 339 accept ``bytes`` (a disk file opened in binary mode, a ``io.BytesIO`` object, etc) 340 341 :param protocol: The pickle protocol to use. 342 :type protocol: int 343 344 For more information, see :mod:`pickle`. 345 346 The additional keyword arguments of this ``__init__`` method are passed to the 347 :class:`BaseItemExporter` ``__init__`` method. 348 349 Pickle isn't a human readable format, so no output examples are provided. 350 351PprintItemExporter 352------------------ 353 354.. class:: PprintItemExporter(file, **kwargs) 355 356 Exports items in pretty print format to the specified file object. 357 358 :param file: the file-like object to use for exporting the data. Its ``write`` method should 359 accept ``bytes`` (a disk file opened in binary mode, a ``io.BytesIO`` object, etc) 360 361 The additional keyword arguments of this ``__init__`` method are passed to the 362 :class:`BaseItemExporter` ``__init__`` method. 363 364 A typical output of this exporter would be:: 365 366 {'name': 'Color TV', 'price': '1200'} 367 {'name': 'DVD player', 'price': '200'} 368 369 Longer lines (when present) are pretty-formatted. 370 371JsonItemExporter 372---------------- 373 374.. class:: JsonItemExporter(file, **kwargs) 375 376 Exports items in JSON format to the specified file-like object, writing all 377 objects as a list of objects. The additional ``__init__`` method arguments are 378 passed to the :class:`BaseItemExporter` ``__init__`` method, and the leftover 379 arguments to the :class:`~json.JSONEncoder` ``__init__`` method, so you can use any 380 :class:`~json.JSONEncoder` ``__init__`` method argument to customize this exporter. 381 382 :param file: the file-like object to use for exporting the data. Its ``write`` method should 383 accept ``bytes`` (a disk file opened in binary mode, a ``io.BytesIO`` object, etc) 384 385 A typical output of this exporter would be:: 386 387 [{"name": "Color TV", "price": "1200"}, 388 {"name": "DVD player", "price": "200"}] 389 390 .. _json-with-large-data: 391 392 .. warning:: JSON is very simple and flexible serialization format, but it 393 doesn't scale well for large amounts of data since incremental (aka. 394 stream-mode) parsing is not well supported (if at all) among JSON parsers 395 (on any language), and most of them just parse the entire object in 396 memory. If you want the power and simplicity of JSON with a more 397 stream-friendly format, consider using :class:`JsonLinesItemExporter` 398 instead, or splitting the output in multiple chunks. 399 400JsonLinesItemExporter 401--------------------- 402 403.. class:: JsonLinesItemExporter(file, **kwargs) 404 405 Exports items in JSON format to the specified file-like object, writing one 406 JSON-encoded item per line. The additional ``__init__`` method arguments are passed 407 to the :class:`BaseItemExporter` ``__init__`` method, and the leftover arguments to 408 the :class:`~json.JSONEncoder` ``__init__`` method, so you can use any 409 :class:`~json.JSONEncoder` ``__init__`` method argument to customize this exporter. 410 411 :param file: the file-like object to use for exporting the data. Its ``write`` method should 412 accept ``bytes`` (a disk file opened in binary mode, a ``io.BytesIO`` object, etc) 413 414 A typical output of this exporter would be:: 415 416 {"name": "Color TV", "price": "1200"} 417 {"name": "DVD player", "price": "200"} 418 419 Unlike the one produced by :class:`JsonItemExporter`, the format produced by 420 this exporter is well suited for serializing large amounts of data. 421 422MarshalItemExporter 423------------------- 424 425.. autoclass:: MarshalItemExporter 426