1.. _news: 2 3Release notes 4============= 5 6.. _release-2.5.1: 7 8Scrapy 2.5.1 (2021-10-05) 9------------------------- 10 11* **Security bug fix:** 12 13 If you use 14 :class:`~scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware` 15 (i.e. the ``http_user`` and ``http_pass`` spider attributes) for HTTP 16 authentication, any request exposes your credentials to the request target. 17 18 To prevent unintended exposure of authentication credentials to unintended 19 domains, you must now additionally set a new, additional spider attribute, 20 ``http_auth_domain``, and point it to the specific domain to which the 21 authentication credentials must be sent. 22 23 If the ``http_auth_domain`` spider attribute is not set, the domain of the 24 first request will be considered the HTTP authentication target, and 25 authentication credentials will only be sent in requests targeting that 26 domain. 27 28 If you need to send the same HTTP authentication credentials to multiple 29 domains, you can use :func:`w3lib.http.basic_auth_header` instead to 30 set the value of the ``Authorization`` header of your requests. 31 32 If you *really* want your spider to send the same HTTP authentication 33 credentials to any domain, set the ``http_auth_domain`` spider attribute 34 to ``None``. 35 36 Finally, if you are a user of `scrapy-splash`_, know that this version of 37 Scrapy breaks compatibility with scrapy-splash 0.7.2 and earlier. You will 38 need to upgrade scrapy-splash to a greater version for it to continue to 39 work. 40 41.. _scrapy-splash: https://github.com/scrapy-plugins/scrapy-splash 42 43 44.. _release-2.5.0: 45 46Scrapy 2.5.0 (2021-04-06) 47------------------------- 48 49Highlights: 50 51- Official Python 3.9 support 52 53- Experimental :ref:`HTTP/2 support <http2>` 54 55- New :func:`~scrapy.downloadermiddlewares.retry.get_retry_request` function 56 to retry requests from spider callbacks 57 58- New :class:`~scrapy.signals.headers_received` signal that allows stopping 59 downloads early 60 61- New :class:`Response.protocol <scrapy.http.Response.protocol>` attribute 62 63Deprecation removals 64~~~~~~~~~~~~~~~~~~~~ 65 66- Removed all code that :ref:`was deprecated in 1.7.0 <1.7-deprecations>` and 67 had not :ref:`already been removed in 2.4.0 <2.4-deprecation-removals>`. 68 (:issue:`4901`) 69 70- Removed support for the ``SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE`` environment 71 variable, :ref:`deprecated in 1.8.0 <1.8-deprecations>`. (:issue:`4912`) 72 73 74Deprecations 75~~~~~~~~~~~~ 76 77- The :mod:`scrapy.utils.py36` module is now deprecated in favor of 78 :mod:`scrapy.utils.asyncgen`. (:issue:`4900`) 79 80 81New features 82~~~~~~~~~~~~ 83 84- Experimental :ref:`HTTP/2 support <http2>` through a new download handler 85 that can be assigned to the ``https`` protocol in the 86 :setting:`DOWNLOAD_HANDLERS` setting. 87 (:issue:`1854`, :issue:`4769`, :issue:`5058`, :issue:`5059`, :issue:`5066`) 88 89- The new :func:`scrapy.downloadermiddlewares.retry.get_retry_request` 90 function may be used from spider callbacks or middlewares to handle the 91 retrying of a request beyond the scenarios that 92 :class:`~scrapy.downloadermiddlewares.retry.RetryMiddleware` supports. 93 (:issue:`3590`, :issue:`3685`, :issue:`4902`) 94 95- The new :class:`~scrapy.signals.headers_received` signal gives early access 96 to response headers and allows :ref:`stopping downloads 97 <topics-stop-response-download>`. 98 (:issue:`1772`, :issue:`4897`) 99 100- The new :attr:`Response.protocol <scrapy.http.Response.protocol>` 101 attribute gives access to the string that identifies the protocol used to 102 download a response. (:issue:`4878`) 103 104- :ref:`Stats <topics-stats>` now include the following entries that indicate 105 the number of successes and failures in storing 106 :ref:`feeds <topics-feed-exports>`:: 107 108 feedexport/success_count/<storage type> 109 feedexport/failed_count/<storage type> 110 111 Where ``<storage type>`` is the feed storage backend class name, such as 112 :class:`~scrapy.extensions.feedexport.FileFeedStorage` or 113 :class:`~scrapy.extensions.feedexport.FTPFeedStorage`. 114 115 (:issue:`3947`, :issue:`4850`) 116 117- The :class:`~scrapy.spidermiddlewares.urllength.UrlLengthMiddleware` spider 118 middleware now logs ignored URLs with ``INFO`` :ref:`logging level 119 <levels>` instead of ``DEBUG``, and it now includes the following entry 120 into :ref:`stats <topics-stats>` to keep track of the number of ignored 121 URLs:: 122 123 urllength/request_ignored_count 124 125 (:issue:`5036`) 126 127- The 128 :class:`~scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware` 129 downloader middleware now logs the number of decompressed responses and the 130 total count of resulting bytes:: 131 132 httpcompression/response_bytes 133 httpcompression/response_count 134 135 (:issue:`4797`, :issue:`4799`) 136 137 138Bug fixes 139~~~~~~~~~ 140 141- Fixed installation on PyPy installing PyDispatcher in addition to 142 PyPyDispatcher, which could prevent Scrapy from working depending on which 143 package got imported. (:issue:`4710`, :issue:`4814`) 144 145- When inspecting a callback to check if it is a generator that also returns 146 a value, an exception is no longer raised if the callback has a docstring 147 with lower indentation than the following code. 148 (:issue:`4477`, :issue:`4935`) 149 150- The `Content-Length <https://tools.ietf.org/html/rfc2616#section-14.13>`_ 151 header is no longer omitted from responses when using the default, HTTP/1.1 152 download handler (see :setting:`DOWNLOAD_HANDLERS`). 153 (:issue:`5009`, :issue:`5034`, :issue:`5045`, :issue:`5057`, :issue:`5062`) 154 155- Setting the :reqmeta:`handle_httpstatus_all` request meta key to ``False`` 156 now has the same effect as not setting it at all, instead of having the 157 same effect as setting it to ``True``. 158 (:issue:`3851`, :issue:`4694`) 159 160 161Documentation 162~~~~~~~~~~~~~ 163 164- Added instructions to :ref:`install Scrapy in Windows using pip 165 <intro-install-windows>`. 166 (:issue:`4715`, :issue:`4736`) 167 168- Logging documentation now includes :ref:`additional ways to filter logs 169 <topics-logging-advanced-customization>`. 170 (:issue:`4216`, :issue:`4257`, :issue:`4965`) 171 172- Covered how to deal with long lists of allowed domains in the :ref:`FAQ 173 <faq>`. (:issue:`2263`, :issue:`3667`) 174 175- Covered scrapy-bench_ in :ref:`benchmarking`. 176 (:issue:`4996`, :issue:`5016`) 177 178- Clarified that one :ref:`extension <topics-extensions>` instance is created 179 per crawler. 180 (:issue:`5014`) 181 182- Fixed some errors in examples. 183 (:issue:`4829`, :issue:`4830`, :issue:`4907`, :issue:`4909`, 184 :issue:`5008`) 185 186- Fixed some external links, typos, and so on. 187 (:issue:`4892`, :issue:`4899`, :issue:`4936`, :issue:`4942`, :issue:`5005`, 188 :issue:`5063`) 189 190- The :ref:`list of Request.meta keys <topics-request-meta>` is now sorted 191 alphabetically. 192 (:issue:`5061`, :issue:`5065`) 193 194- Updated references to Scrapinghub, which is now called Zyte. 195 (:issue:`4973`, :issue:`5072`) 196 197- Added a mention to contributors in the README. (:issue:`4956`) 198 199- Reduced the top margin of lists. (:issue:`4974`) 200 201 202Quality Assurance 203~~~~~~~~~~~~~~~~~ 204 205- Made Python 3.9 support official (:issue:`4757`, :issue:`4759`) 206 207- Extended typing hints (:issue:`4895`) 208 209- Fixed deprecated uses of the Twisted API. 210 (:issue:`4940`, :issue:`4950`, :issue:`5073`) 211 212- Made our tests run with the new pip resolver. 213 (:issue:`4710`, :issue:`4814`) 214 215- Added tests to ensure that :ref:`coroutine support <coroutine-support>` 216 is tested. (:issue:`4987`) 217 218- Migrated from Travis CI to GitHub Actions. (:issue:`4924`) 219 220- Fixed CI issues. 221 (:issue:`4986`, :issue:`5020`, :issue:`5022`, :issue:`5027`, :issue:`5052`, 222 :issue:`5053`) 223 224- Implemented code refactorings, style fixes and cleanups. 225 (:issue:`4911`, :issue:`4982`, :issue:`5001`, :issue:`5002`, :issue:`5076`) 226 227 228.. _release-2.4.1: 229 230Scrapy 2.4.1 (2020-11-17) 231------------------------- 232 233- Fixed :ref:`feed exports <topics-feed-exports>` overwrite support (:issue:`4845`, :issue:`4857`, :issue:`4859`) 234 235- Fixed the AsyncIO event loop handling, which could make code hang 236 (:issue:`4855`, :issue:`4872`) 237 238- Fixed the IPv6-capable DNS resolver 239 :class:`~scrapy.resolver.CachingHostnameResolver` for download handlers 240 that call 241 :meth:`reactor.resolve <twisted.internet.interfaces.IReactorCore.resolve>` 242 (:issue:`4802`, :issue:`4803`) 243 244- Fixed the output of the :command:`genspider` command showing placeholders 245 instead of the import path of the generated spider module (:issue:`4874`) 246 247- Migrated Windows CI from Azure Pipelines to GitHub Actions (:issue:`4869`, 248 :issue:`4876`) 249 250 251.. _release-2.4.0: 252 253Scrapy 2.4.0 (2020-10-11) 254------------------------- 255 256Highlights: 257 258* Python 3.5 support has been dropped. 259 260* The ``file_path`` method of :ref:`media pipelines <topics-media-pipeline>` 261 can now access the source :ref:`item <topics-items>`. 262 263 This allows you to set a download file path based on item data. 264 265* The new ``item_export_kwargs`` key of the :setting:`FEEDS` setting allows 266 to define keyword parameters to pass to :ref:`item exporter classes 267 <topics-exporters>` 268 269* You can now choose whether :ref:`feed exports <topics-feed-exports>` 270 overwrite or append to the output file. 271 272 For example, when using the :command:`crawl` or :command:`runspider` 273 commands, you can use the ``-O`` option instead of ``-o`` to overwrite the 274 output file. 275 276* Zstd-compressed responses are now supported if zstandard_ is installed. 277 278* In settings, where the import path of a class is required, it is now 279 possible to pass a class object instead. 280 281Modified requirements 282~~~~~~~~~~~~~~~~~~~~~ 283 284* Python 3.6 or greater is now required; support for Python 3.5 has been 285 dropped 286 287 As a result: 288 289 - When using PyPy, PyPy 7.2.0 or greater :ref:`is now required 290 <faq-python-versions>` 291 292 - For Amazon S3 storage support in :ref:`feed exports 293 <topics-feed-storage-s3>` or :ref:`media pipelines 294 <media-pipelines-s3>`, botocore_ 1.4.87 or greater is now required 295 296 - To use the :ref:`images pipeline <images-pipeline>`, Pillow_ 4.0.0 or 297 greater is now required 298 299 (:issue:`4718`, :issue:`4732`, :issue:`4733`, :issue:`4742`, :issue:`4743`, 300 :issue:`4764`) 301 302 303Backward-incompatible changes 304~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 305 306* :class:`~scrapy.downloadermiddlewares.cookies.CookiesMiddleware` once again 307 discards cookies defined in :attr:`Request.headers 308 <scrapy.http.Request.headers>`. 309 310 We decided to revert this bug fix, introduced in Scrapy 2.2.0, because it 311 was reported that the current implementation could break existing code. 312 313 If you need to set cookies for a request, use the :class:`Request.cookies 314 <scrapy.http.Request>` parameter. 315 316 A future version of Scrapy will include a new, better implementation of the 317 reverted bug fix. 318 319 (:issue:`4717`, :issue:`4823`) 320 321 322.. _2.4-deprecation-removals: 323 324Deprecation removals 325~~~~~~~~~~~~~~~~~~~~ 326 327* :class:`scrapy.extensions.feedexport.S3FeedStorage` no longer reads the 328 values of ``access_key`` and ``secret_key`` from the running project 329 settings when they are not passed to its ``__init__`` method; you must 330 either pass those parameters to its ``__init__`` method or use 331 :class:`S3FeedStorage.from_crawler 332 <scrapy.extensions.feedexport.S3FeedStorage.from_crawler>` 333 (:issue:`4356`, :issue:`4411`, :issue:`4688`) 334 335* :attr:`Rule.process_request <scrapy.spiders.crawl.Rule.process_request>` 336 no longer admits callables which expect a single ``request`` parameter, 337 rather than both ``request`` and ``response`` (:issue:`4818`) 338 339 340Deprecations 341~~~~~~~~~~~~ 342 343* In custom :ref:`media pipelines <topics-media-pipeline>`, signatures that 344 do not accept a keyword-only ``item`` parameter in any of the methods that 345 :ref:`now support this parameter <media-pipeline-item-parameter>` are now 346 deprecated (:issue:`4628`, :issue:`4686`) 347 348* In custom :ref:`feed storage backend classes <topics-feed-storage>`, 349 ``__init__`` method signatures that do not accept a keyword-only 350 ``feed_options`` parameter are now deprecated (:issue:`547`, :issue:`716`, 351 :issue:`4512`) 352 353* The :class:`scrapy.utils.python.WeakKeyCache` class is now deprecated 354 (:issue:`4684`, :issue:`4701`) 355 356* The :func:`scrapy.utils.boto.is_botocore` function is now deprecated, use 357 :func:`scrapy.utils.boto.is_botocore_available` instead (:issue:`4734`, 358 :issue:`4776`) 359 360 361New features 362~~~~~~~~~~~~ 363 364.. _media-pipeline-item-parameter: 365 366* The following methods of :ref:`media pipelines <topics-media-pipeline>` now 367 accept an ``item`` keyword-only parameter containing the source 368 :ref:`item <topics-items>`: 369 370 - In :class:`scrapy.pipelines.files.FilesPipeline`: 371 372 - :meth:`~scrapy.pipelines.files.FilesPipeline.file_downloaded` 373 374 - :meth:`~scrapy.pipelines.files.FilesPipeline.file_path` 375 376 - :meth:`~scrapy.pipelines.files.FilesPipeline.media_downloaded` 377 378 - :meth:`~scrapy.pipelines.files.FilesPipeline.media_to_download` 379 380 - In :class:`scrapy.pipelines.images.ImagesPipeline`: 381 382 - :meth:`~scrapy.pipelines.images.ImagesPipeline.file_downloaded` 383 384 - :meth:`~scrapy.pipelines.images.ImagesPipeline.file_path` 385 386 - :meth:`~scrapy.pipelines.images.ImagesPipeline.get_images` 387 388 - :meth:`~scrapy.pipelines.images.ImagesPipeline.image_downloaded` 389 390 - :meth:`~scrapy.pipelines.images.ImagesPipeline.media_downloaded` 391 392 - :meth:`~scrapy.pipelines.images.ImagesPipeline.media_to_download` 393 394 (:issue:`4628`, :issue:`4686`) 395 396* The new ``item_export_kwargs`` key of the :setting:`FEEDS` setting allows 397 to define keyword parameters to pass to :ref:`item exporter classes 398 <topics-exporters>` (:issue:`4606`, :issue:`4768`) 399 400* :ref:`Feed exports <topics-feed-exports>` gained overwrite support: 401 402 * When using the :command:`crawl` or :command:`runspider` commands, you 403 can use the ``-O`` option instead of ``-o`` to overwrite the output 404 file 405 406 * You can use the ``overwrite`` key in the :setting:`FEEDS` setting to 407 configure whether to overwrite the output file (``True``) or append to 408 its content (``False``) 409 410 * The ``__init__`` and ``from_crawler`` methods of :ref:`feed storage 411 backend classes <topics-feed-storage>` now receive a new keyword-only 412 parameter, ``feed_options``, which is a dictionary of :ref:`feed 413 options <feed-options>` 414 415 (:issue:`547`, :issue:`716`, :issue:`4512`) 416 417* Zstd-compressed responses are now supported if zstandard_ is installed 418 (:issue:`4831`) 419 420* In settings, where the import path of a class is required, it is now 421 possible to pass a class object instead (:issue:`3870`, :issue:`3873`). 422 423 This includes also settings where only part of its value is made of an 424 import path, such as :setting:`DOWNLOADER_MIDDLEWARES` or 425 :setting:`DOWNLOAD_HANDLERS`. 426 427* :ref:`Downloader middlewares <topics-downloader-middleware>` can now 428 override :class:`response.request <scrapy.http.Response.request>`. 429 430 If a :ref:`downloader middleware <topics-downloader-middleware>` returns 431 a :class:`~scrapy.http.Response` object from 432 :meth:`~scrapy.downloadermiddlewares.DownloaderMiddleware.process_response` 433 or 434 :meth:`~scrapy.downloadermiddlewares.DownloaderMiddleware.process_exception` 435 with a custom :class:`~scrapy.http.Request` object assigned to 436 :class:`response.request <scrapy.http.Response.request>`: 437 438 - The response is handled by the callback of that custom 439 :class:`~scrapy.http.Request` object, instead of being handled by the 440 callback of the original :class:`~scrapy.http.Request` object 441 442 - That custom :class:`~scrapy.http.Request` object is now sent as the 443 ``request`` argument to the :signal:`response_received` signal, instead 444 of the original :class:`~scrapy.http.Request` object 445 446 (:issue:`4529`, :issue:`4632`) 447 448* When using the :ref:`FTP feed storage backend <topics-feed-storage-ftp>`: 449 450 - It is now possible to set the new ``overwrite`` :ref:`feed option 451 <feed-options>` to ``False`` to append to an existing file instead of 452 overwriting it 453 454 - The FTP password can now be omitted if it is not necessary 455 456 (:issue:`547`, :issue:`716`, :issue:`4512`) 457 458* The ``__init__`` method of :class:`~scrapy.exporters.CsvItemExporter` now 459 supports an ``errors`` parameter to indicate how to handle encoding errors 460 (:issue:`4755`) 461 462* When :ref:`using asyncio <using-asyncio>`, it is now possible to 463 :ref:`set a custom asyncio loop <using-custom-loops>` (:issue:`4306`, 464 :issue:`4414`) 465 466* Serialized requests (see :ref:`topics-jobs`) now support callbacks that are 467 spider methods that delegate on other callable (:issue:`4756`) 468 469* When a response is larger than :setting:`DOWNLOAD_MAXSIZE`, the logged 470 message is now a warning, instead of an error (:issue:`3874`, 471 :issue:`3886`, :issue:`4752`) 472 473 474Bug fixes 475~~~~~~~~~ 476 477* The :command:`genspider` command no longer overwrites existing files 478 unless the ``--force`` option is used (:issue:`4561`, :issue:`4616`, 479 :issue:`4623`) 480 481* Cookies with an empty value are no longer considered invalid cookies 482 (:issue:`4772`) 483 484* The :command:`runspider` command now supports files with the ``.pyw`` file 485 extension (:issue:`4643`, :issue:`4646`) 486 487* The :class:`~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware` 488 middleware now simply ignores unsupported proxy values (:issue:`3331`, 489 :issue:`4778`) 490 491* Checks for generator callbacks with a ``return`` statement no longer warn 492 about ``return`` statements in nested functions (:issue:`4720`, 493 :issue:`4721`) 494 495* The system file mode creation mask no longer affects the permissions of 496 files generated using the :command:`startproject` command (:issue:`4722`) 497 498* :func:`scrapy.utils.iterators.xmliter` now supports namespaced node names 499 (:issue:`861`, :issue:`4746`) 500 501* :class:`~scrapy.Request` objects can now have ``about:`` URLs, which can 502 work when using a headless browser (:issue:`4835`) 503 504 505Documentation 506~~~~~~~~~~~~~ 507 508* The :setting:`FEED_URI_PARAMS` setting is now documented (:issue:`4671`, 509 :issue:`4724`) 510 511* Improved the documentation of 512 :ref:`link extractors <topics-link-extractors>` with an usage example from 513 a spider callback and reference documentation for the 514 :class:`~scrapy.link.Link` class (:issue:`4751`, :issue:`4775`) 515 516* Clarified the impact of :setting:`CONCURRENT_REQUESTS` when using the 517 :class:`~scrapy.extensions.closespider.CloseSpider` extension 518 (:issue:`4836`) 519 520* Removed references to Python 2’s ``unicode`` type (:issue:`4547`, 521 :issue:`4703`) 522 523* We now have an :ref:`official deprecation policy <deprecation-policy>` 524 (:issue:`4705`) 525 526* Our :ref:`documentation policies <documentation-policies>` now cover usage 527 of Sphinx’s :rst:dir:`versionadded` and :rst:dir:`versionchanged` 528 directives, and we have removed usages referencing Scrapy 1.4.0 and earlier 529 versions (:issue:`3971`, :issue:`4310`) 530 531* Other documentation cleanups (:issue:`4090`, :issue:`4782`, :issue:`4800`, 532 :issue:`4801`, :issue:`4809`, :issue:`4816`, :issue:`4825`) 533 534 535Quality assurance 536~~~~~~~~~~~~~~~~~ 537 538* Extended typing hints (:issue:`4243`, :issue:`4691`) 539 540* Added tests for the :command:`check` command (:issue:`4663`) 541 542* Fixed test failures on Debian (:issue:`4726`, :issue:`4727`, :issue:`4735`) 543 544* Improved Windows test coverage (:issue:`4723`) 545 546* Switched to :ref:`formatted string literals <f-strings>` where possible 547 (:issue:`4307`, :issue:`4324`, :issue:`4672`) 548 549* Modernized :func:`super` usage (:issue:`4707`) 550 551* Other code and test cleanups (:issue:`1790`, :issue:`3288`, :issue:`4165`, 552 :issue:`4564`, :issue:`4651`, :issue:`4714`, :issue:`4738`, :issue:`4745`, 553 :issue:`4747`, :issue:`4761`, :issue:`4765`, :issue:`4804`, :issue:`4817`, 554 :issue:`4820`, :issue:`4822`, :issue:`4839`) 555 556 557.. _release-2.3.0: 558 559Scrapy 2.3.0 (2020-08-04) 560------------------------- 561 562Highlights: 563 564* :ref:`Feed exports <topics-feed-exports>` now support :ref:`Google Cloud 565 Storage <topics-feed-storage-gcs>` as a storage backend 566 567* The new :setting:`FEED_EXPORT_BATCH_ITEM_COUNT` setting allows to deliver 568 output items in batches of up to the specified number of items. 569 570 It also serves as a workaround for :ref:`delayed file delivery 571 <delayed-file-delivery>`, which causes Scrapy to only start item delivery 572 after the crawl has finished when using certain storage backends 573 (:ref:`S3 <topics-feed-storage-s3>`, :ref:`FTP <topics-feed-storage-ftp>`, 574 and now :ref:`GCS <topics-feed-storage-gcs>`). 575 576* The base implementation of :ref:`item loaders <topics-loaders>` has been 577 moved into a separate library, :doc:`itemloaders <itemloaders:index>`, 578 allowing usage from outside Scrapy and a separate release schedule 579 580Deprecation removals 581~~~~~~~~~~~~~~~~~~~~ 582 583* Removed the following classes and their parent modules from 584 ``scrapy.linkextractors``: 585 586 * ``htmlparser.HtmlParserLinkExtractor`` 587 * ``regex.RegexLinkExtractor`` 588 * ``sgml.BaseSgmlLinkExtractor`` 589 * ``sgml.SgmlLinkExtractor`` 590 591 Use 592 :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>` 593 instead (:issue:`4356`, :issue:`4679`) 594 595 596Deprecations 597~~~~~~~~~~~~ 598 599* The ``scrapy.utils.python.retry_on_eintr`` function is now deprecated 600 (:issue:`4683`) 601 602 603New features 604~~~~~~~~~~~~ 605 606* :ref:`Feed exports <topics-feed-exports>` support :ref:`Google Cloud 607 Storage <topics-feed-storage-gcs>` (:issue:`685`, :issue:`3608`) 608 609* New :setting:`FEED_EXPORT_BATCH_ITEM_COUNT` setting for batch deliveries 610 (:issue:`4250`, :issue:`4434`) 611 612* The :command:`parse` command now allows specifying an output file 613 (:issue:`4317`, :issue:`4377`) 614 615* :meth:`Request.from_curl <scrapy.http.Request.from_curl>` and 616 :func:`~scrapy.utils.curl.curl_to_request_kwargs` now also support 617 ``--data-raw`` (:issue:`4612`) 618 619* A ``parse`` callback may now be used in built-in spider subclasses, such 620 as :class:`~scrapy.spiders.CrawlSpider` (:issue:`712`, :issue:`732`, 621 :issue:`781`, :issue:`4254` ) 622 623 624Bug fixes 625~~~~~~~~~ 626 627* Fixed the :ref:`CSV exporting <topics-feed-format-csv>` of 628 :ref:`dataclass items <dataclass-items>` and :ref:`attr.s items 629 <attrs-items>` (:issue:`4667`, :issue:`4668`) 630 631* :meth:`Request.from_curl <scrapy.http.Request.from_curl>` and 632 :func:`~scrapy.utils.curl.curl_to_request_kwargs` now set the request 633 method to ``POST`` when a request body is specified and no request method 634 is specified (:issue:`4612`) 635 636* The processing of ANSI escape sequences in enabled in Windows 10.0.14393 637 and later, where it is required for colored output (:issue:`4393`, 638 :issue:`4403`) 639 640 641Documentation 642~~~~~~~~~~~~~ 643 644* Updated the `OpenSSL cipher list format`_ link in the documentation about 645 the :setting:`DOWNLOADER_CLIENT_TLS_CIPHERS` setting (:issue:`4653`) 646 647* Simplified the code example in :ref:`topics-loaders-dataclass` 648 (:issue:`4652`) 649 650.. _OpenSSL cipher list format: https://www.openssl.org/docs/manmaster/man1/openssl-ciphers.html#CIPHER-LIST-FORMAT 651 652 653Quality assurance 654~~~~~~~~~~~~~~~~~ 655 656* The base implementation of :ref:`item loaders <topics-loaders>` has been 657 moved into :doc:`itemloaders <itemloaders:index>` (:issue:`4005`, 658 :issue:`4516`) 659 660* Fixed a silenced error in some scheduler tests (:issue:`4644`, 661 :issue:`4645`) 662 663* Renewed the localhost certificate used for SSL tests (:issue:`4650`) 664 665* Removed cookie-handling code specific to Python 2 (:issue:`4682`) 666 667* Stopped using Python 2 unicode literal syntax (:issue:`4704`) 668 669* Stopped using a backlash for line continuation (:issue:`4673`) 670 671* Removed unneeded entries from the MyPy exception list (:issue:`4690`) 672 673* Automated tests now pass on Windows as part of our continuous integration 674 system (:issue:`4458`) 675 676* Automated tests now pass on the latest PyPy version for supported Python 677 versions in our continuous integration system (:issue:`4504`) 678 679 680.. _release-2.2.1: 681 682Scrapy 2.2.1 (2020-07-17) 683------------------------- 684 685* The :command:`startproject` command no longer makes unintended changes to 686 the permissions of files in the destination folder, such as removing 687 execution permissions (:issue:`4662`, :issue:`4666`) 688 689 690.. _release-2.2.0: 691 692Scrapy 2.2.0 (2020-06-24) 693------------------------- 694 695Highlights: 696 697* Python 3.5.2+ is required now 698* :ref:`dataclass objects <dataclass-items>` and 699 :ref:`attrs objects <attrs-items>` are now valid :ref:`item types 700 <item-types>` 701* New :meth:`TextResponse.json <scrapy.http.TextResponse.json>` method 702* New :signal:`bytes_received` signal that allows canceling response download 703* :class:`~scrapy.downloadermiddlewares.cookies.CookiesMiddleware` fixes 704 705Backward-incompatible changes 706~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 707 708* Support for Python 3.5.0 and 3.5.1 has been dropped; Scrapy now refuses to 709 run with a Python version lower than 3.5.2, which introduced 710 :class:`typing.Type` (:issue:`4615`) 711 712 713Deprecations 714~~~~~~~~~~~~ 715 716* :meth:`TextResponse.body_as_unicode 717 <scrapy.http.TextResponse.body_as_unicode>` is now deprecated, use 718 :attr:`TextResponse.text <scrapy.http.TextResponse.text>` instead 719 (:issue:`4546`, :issue:`4555`, :issue:`4579`) 720 721* :class:`scrapy.item.BaseItem` is now deprecated, use 722 :class:`scrapy.item.Item` instead (:issue:`4534`) 723 724 725New features 726~~~~~~~~~~~~ 727 728* :ref:`dataclass objects <dataclass-items>` and 729 :ref:`attrs objects <attrs-items>` are now valid :ref:`item types 730 <item-types>`, and a new itemadapter_ library makes it easy to 731 write code that :ref:`supports any item type <supporting-item-types>` 732 (:issue:`2749`, :issue:`2807`, :issue:`3761`, :issue:`3881`, :issue:`4642`) 733 734* A new :meth:`TextResponse.json <scrapy.http.TextResponse.json>` method 735 allows to deserialize JSON responses (:issue:`2444`, :issue:`4460`, 736 :issue:`4574`) 737 738* A new :signal:`bytes_received` signal allows monitoring response download 739 progress and :ref:`stopping downloads <topics-stop-response-download>` 740 (:issue:`4205`, :issue:`4559`) 741 742* The dictionaries in the result list of a :ref:`media pipeline 743 <topics-media-pipeline>` now include a new key, ``status``, which indicates 744 if the file was downloaded or, if the file was not downloaded, why it was 745 not downloaded; see :meth:`FilesPipeline.get_media_requests 746 <scrapy.pipelines.files.FilesPipeline.get_media_requests>` for more 747 information (:issue:`2893`, :issue:`4486`) 748 749* When using :ref:`Google Cloud Storage <media-pipeline-gcs>` for 750 a :ref:`media pipeline <topics-media-pipeline>`, a warning is now logged if 751 the configured credentials do not grant the required permissions 752 (:issue:`4346`, :issue:`4508`) 753 754* :ref:`Link extractors <topics-link-extractors>` are now serializable, 755 as long as you do not use :ref:`lambdas <lambda>` for parameters; for 756 example, you can now pass link extractors in :attr:`Request.cb_kwargs 757 <scrapy.http.Request.cb_kwargs>` or 758 :attr:`Request.meta <scrapy.http.Request.meta>` when :ref:`persisting 759 scheduled requests <topics-jobs>` (:issue:`4554`) 760 761* Upgraded the :ref:`pickle protocol <pickle-protocols>` that Scrapy uses 762 from protocol 2 to protocol 4, improving serialization capabilities and 763 performance (:issue:`4135`, :issue:`4541`) 764 765* :func:`scrapy.utils.misc.create_instance` now raises a :exc:`TypeError` 766 exception if the resulting instance is ``None`` (:issue:`4528`, 767 :issue:`4532`) 768 769.. _itemadapter: https://github.com/scrapy/itemadapter 770 771 772Bug fixes 773~~~~~~~~~ 774 775* :class:`~scrapy.downloadermiddlewares.cookies.CookiesMiddleware` no longer 776 discards cookies defined in :attr:`Request.headers 777 <scrapy.http.Request.headers>` (:issue:`1992`, :issue:`2400`) 778 779* :class:`~scrapy.downloadermiddlewares.cookies.CookiesMiddleware` no longer 780 re-encodes cookies defined as :class:`bytes` in the ``cookies`` parameter 781 of the ``__init__`` method of :class:`~scrapy.http.Request` 782 (:issue:`2400`, :issue:`3575`) 783 784* When :setting:`FEEDS` defines multiple URIs, :setting:`FEED_STORE_EMPTY` is 785 ``False`` and the crawl yields no items, Scrapy no longer stops feed 786 exports after the first URI (:issue:`4621`, :issue:`4626`) 787 788* :class:`~scrapy.spiders.Spider` callbacks defined using :doc:`coroutine 789 syntax <topics/coroutines>` no longer need to return an iterable, and may 790 instead return a :class:`~scrapy.http.Request` object, an 791 :ref:`item <topics-items>`, or ``None`` (:issue:`4609`) 792 793* The :command:`startproject` command now ensures that the generated project 794 folders and files have the right permissions (:issue:`4604`) 795 796* Fix a :exc:`KeyError` exception being sometimes raised from 797 :class:`scrapy.utils.datatypes.LocalWeakReferencedCache` (:issue:`4597`, 798 :issue:`4599`) 799 800* When :setting:`FEEDS` defines multiple URIs, log messages about items being 801 stored now contain information from the corresponding feed, instead of 802 always containing information about only one of the feeds (:issue:`4619`, 803 :issue:`4629`) 804 805 806Documentation 807~~~~~~~~~~~~~ 808 809* Added a new section about :ref:`accessing cb_kwargs from errbacks 810 <errback-cb_kwargs>` (:issue:`4598`, :issue:`4634`) 811 812* Covered chompjs_ in :ref:`topics-parsing-javascript` (:issue:`4556`, 813 :issue:`4562`) 814 815* Removed from :doc:`topics/coroutines` the warning about the API being 816 experimental (:issue:`4511`, :issue:`4513`) 817 818* Removed references to unsupported versions of :doc:`Twisted 819 <twisted:index>` (:issue:`4533`) 820 821* Updated the description of the :ref:`screenshot pipeline example 822 <ScreenshotPipeline>`, which now uses :doc:`coroutine syntax 823 <topics/coroutines>` instead of returning a 824 :class:`~twisted.internet.defer.Deferred` (:issue:`4514`, :issue:`4593`) 825 826* Removed a misleading import line from the 827 :func:`scrapy.utils.log.configure_logging` code example (:issue:`4510`, 828 :issue:`4587`) 829 830* The display-on-hover behavior of internal documentation references now also 831 covers links to :ref:`commands <topics-commands>`, :attr:`Request.meta 832 <scrapy.http.Request.meta>` keys, :ref:`settings <topics-settings>` and 833 :ref:`signals <topics-signals>` (:issue:`4495`, :issue:`4563`) 834 835* It is again possible to download the documentation for offline reading 836 (:issue:`4578`, :issue:`4585`) 837 838* Removed backslashes preceding ``*args`` and ``**kwargs`` in some function 839 and method signatures (:issue:`4592`, :issue:`4596`) 840 841.. _chompjs: https://github.com/Nykakin/chompjs 842 843 844Quality assurance 845~~~~~~~~~~~~~~~~~ 846 847* Adjusted the code base further to our :ref:`style guidelines 848 <coding-style>` (:issue:`4237`, :issue:`4525`, :issue:`4538`, 849 :issue:`4539`, :issue:`4540`, :issue:`4542`, :issue:`4543`, :issue:`4544`, 850 :issue:`4545`, :issue:`4557`, :issue:`4558`, :issue:`4566`, :issue:`4568`, 851 :issue:`4572`) 852 853* Removed remnants of Python 2 support (:issue:`4550`, :issue:`4553`, 854 :issue:`4568`) 855 856* Improved code sharing between the :command:`crawl` and :command:`runspider` 857 commands (:issue:`4548`, :issue:`4552`) 858 859* Replaced ``chain(*iterable)`` with ``chain.from_iterable(iterable)`` 860 (:issue:`4635`) 861 862* You may now run the :mod:`asyncio` tests with Tox on any Python version 863 (:issue:`4521`) 864 865* Updated test requirements to reflect an incompatibility with pytest 5.4 and 866 5.4.1 (:issue:`4588`) 867 868* Improved :class:`~scrapy.spiderloader.SpiderLoader` test coverage for 869 scenarios involving duplicate spider names (:issue:`4549`, :issue:`4560`) 870 871* Configured Travis CI to also run the tests with Python 3.5.2 872 (:issue:`4518`, :issue:`4615`) 873 874* Added a `Pylint <https://www.pylint.org/>`_ job to Travis CI 875 (:issue:`3727`) 876 877* Added a `Mypy <http://mypy-lang.org/>`_ job to Travis CI (:issue:`4637`) 878 879* Made use of set literals in tests (:issue:`4573`) 880 881* Cleaned up the Travis CI configuration (:issue:`4517`, :issue:`4519`, 882 :issue:`4522`, :issue:`4537`) 883 884 885.. _release-2.1.0: 886 887Scrapy 2.1.0 (2020-04-24) 888------------------------- 889 890Highlights: 891 892* New :setting:`FEEDS` setting to export to multiple feeds 893* New :attr:`Response.ip_address <scrapy.http.Response.ip_address>` attribute 894 895Backward-incompatible changes 896~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 897 898* :exc:`AssertionError` exceptions triggered by :ref:`assert <assert>` 899 statements have been replaced by new exception types, to support running 900 Python in optimized mode (see :option:`-O`) without changing Scrapy’s 901 behavior in any unexpected ways. 902 903 If you catch an :exc:`AssertionError` exception from Scrapy, update your 904 code to catch the corresponding new exception. 905 906 (:issue:`4440`) 907 908 909Deprecation removals 910~~~~~~~~~~~~~~~~~~~~ 911 912* The ``LOG_UNSERIALIZABLE_REQUESTS`` setting is no longer supported, use 913 :setting:`SCHEDULER_DEBUG` instead (:issue:`4385`) 914 915* The ``REDIRECT_MAX_METAREFRESH_DELAY`` setting is no longer supported, use 916 :setting:`METAREFRESH_MAXDELAY` instead (:issue:`4385`) 917 918* The :class:`~scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware` 919 middleware has been removed, including the entire 920 :class:`scrapy.downloadermiddlewares.chunked` module; chunked transfers 921 work out of the box (:issue:`4431`) 922 923* The ``spiders`` property has been removed from 924 :class:`~scrapy.crawler.Crawler`, use :class:`CrawlerRunner.spider_loader 925 <scrapy.crawler.CrawlerRunner.spider_loader>` or instantiate 926 :setting:`SPIDER_LOADER_CLASS` with your settings instead (:issue:`4398`) 927 928* The ``MultiValueDict``, ``MultiValueDictKeyError``, and ``SiteNode`` 929 classes have been removed from :mod:`scrapy.utils.datatypes` 930 (:issue:`4400`) 931 932 933Deprecations 934~~~~~~~~~~~~ 935 936* The ``FEED_FORMAT`` and ``FEED_URI`` settings have been deprecated in 937 favor of the new :setting:`FEEDS` setting (:issue:`1336`, :issue:`3858`, 938 :issue:`4507`) 939 940 941New features 942~~~~~~~~~~~~ 943 944* A new setting, :setting:`FEEDS`, allows configuring multiple output feeds 945 with different settings each (:issue:`1336`, :issue:`3858`, :issue:`4507`) 946 947* The :command:`crawl` and :command:`runspider` commands now support multiple 948 ``-o`` parameters (:issue:`1336`, :issue:`3858`, :issue:`4507`) 949 950* The :command:`crawl` and :command:`runspider` commands now support 951 specifying an output format by appending ``:<format>`` to the output file 952 (:issue:`1336`, :issue:`3858`, :issue:`4507`) 953 954* The new :attr:`Response.ip_address <scrapy.http.Response.ip_address>` 955 attribute gives access to the IP address that originated a response 956 (:issue:`3903`, :issue:`3940`) 957 958* A warning is now issued when a value in 959 :attr:`~scrapy.spiders.Spider.allowed_domains` includes a port 960 (:issue:`50`, :issue:`3198`, :issue:`4413`) 961 962* Zsh completion now excludes used option aliases from the completion list 963 (:issue:`4438`) 964 965 966Bug fixes 967~~~~~~~~~ 968 969* :ref:`Request serialization <request-serialization>` no longer breaks for 970 callbacks that are spider attributes which are assigned a function with a 971 different name (:issue:`4500`) 972 973* ``None`` values in :attr:`~scrapy.spiders.Spider.allowed_domains` no longer 974 cause a :exc:`TypeError` exception (:issue:`4410`) 975 976* Zsh completion no longer allows options after arguments (:issue:`4438`) 977 978* zope.interface 5.0.0 and later versions are now supported 979 (:issue:`4447`, :issue:`4448`) 980 981* :meth:`Spider.make_requests_from_url 982 <scrapy.spiders.Spider.make_requests_from_url>`, deprecated in Scrapy 983 1.4.0, now issues a warning when used (:issue:`4412`) 984 985 986Documentation 987~~~~~~~~~~~~~ 988 989* Improved the documentation about signals that allow their handlers to 990 return a :class:`~twisted.internet.defer.Deferred` (:issue:`4295`, 991 :issue:`4390`) 992 993* Our PyPI entry now includes links for our documentation, our source code 994 repository and our issue tracker (:issue:`4456`) 995 996* Covered the `curl2scrapy <https://michael-shub.github.io/curl2scrapy/>`_ 997 service in the documentation (:issue:`4206`, :issue:`4455`) 998 999* Removed references to the Guppy library, which only works in Python 2 1000 (:issue:`4285`, :issue:`4343`) 1001 1002* Extended use of InterSphinx to link to Python 3 documentation 1003 (:issue:`4444`, :issue:`4445`) 1004 1005* Added support for Sphinx 3.0 and later (:issue:`4475`, :issue:`4480`, 1006 :issue:`4496`, :issue:`4503`) 1007 1008 1009Quality assurance 1010~~~~~~~~~~~~~~~~~ 1011 1012* Removed warnings about using old, removed settings (:issue:`4404`) 1013 1014* Removed a warning about importing 1015 :class:`~twisted.internet.testing.StringTransport` from 1016 ``twisted.test.proto_helpers`` in Twisted 19.7.0 or newer (:issue:`4409`) 1017 1018* Removed outdated Debian package build files (:issue:`4384`) 1019 1020* Removed :class:`object` usage as a base class (:issue:`4430`) 1021 1022* Removed code that added support for old versions of Twisted that we no 1023 longer support (:issue:`4472`) 1024 1025* Fixed code style issues (:issue:`4468`, :issue:`4469`, :issue:`4471`, 1026 :issue:`4481`) 1027 1028* Removed :func:`twisted.internet.defer.returnValue` calls (:issue:`4443`, 1029 :issue:`4446`, :issue:`4489`) 1030 1031 1032.. _release-2.0.1: 1033 1034Scrapy 2.0.1 (2020-03-18) 1035------------------------- 1036 1037* :meth:`Response.follow_all <scrapy.http.Response.follow_all>` now supports 1038 an empty URL iterable as input (:issue:`4408`, :issue:`4420`) 1039 1040* Removed top-level :mod:`~twisted.internet.reactor` imports to prevent 1041 errors about the wrong Twisted reactor being installed when setting a 1042 different Twisted reactor using :setting:`TWISTED_REACTOR` (:issue:`4401`, 1043 :issue:`4406`) 1044 1045* Fixed tests (:issue:`4422`) 1046 1047 1048.. _release-2.0.0: 1049 1050Scrapy 2.0.0 (2020-03-03) 1051------------------------- 1052 1053Highlights: 1054 1055* Python 2 support has been removed 1056* :doc:`Partial <topics/coroutines>` :ref:`coroutine syntax <async>` support 1057 and :doc:`experimental <topics/asyncio>` :mod:`asyncio` support 1058* New :meth:`Response.follow_all <scrapy.http.Response.follow_all>` method 1059* :ref:`FTP support <media-pipeline-ftp>` for media pipelines 1060* New :attr:`Response.certificate <scrapy.http.Response.certificate>` 1061 attribute 1062* IPv6 support through :setting:`DNS_RESOLVER` 1063 1064Backward-incompatible changes 1065~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1066 1067* Python 2 support has been removed, following `Python 2 end-of-life on 1068 January 1, 2020`_ (:issue:`4091`, :issue:`4114`, :issue:`4115`, 1069 :issue:`4121`, :issue:`4138`, :issue:`4231`, :issue:`4242`, :issue:`4304`, 1070 :issue:`4309`, :issue:`4373`) 1071 1072* Retry gaveups (see :setting:`RETRY_TIMES`) are now logged as errors instead 1073 of as debug information (:issue:`3171`, :issue:`3566`) 1074 1075* File extensions that 1076 :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>` 1077 ignores by default now also include ``7z``, ``7zip``, ``apk``, ``bz2``, 1078 ``cdr``, ``dmg``, ``ico``, ``iso``, ``tar``, ``tar.gz``, ``webm``, and 1079 ``xz`` (:issue:`1837`, :issue:`2067`, :issue:`4066`) 1080 1081* The :setting:`METAREFRESH_IGNORE_TAGS` setting is now an empty list by 1082 default, following web browser behavior (:issue:`3844`, :issue:`4311`) 1083 1084* The 1085 :class:`~scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware` 1086 now includes spaces after commas in the value of the ``Accept-Encoding`` 1087 header that it sets, following web browser behavior (:issue:`4293`) 1088 1089* The ``__init__`` method of custom download handlers (see 1090 :setting:`DOWNLOAD_HANDLERS`) or subclasses of the following downloader 1091 handlers no longer receives a ``settings`` parameter: 1092 1093 * :class:`scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler` 1094 1095 * :class:`scrapy.core.downloader.handlers.file.FileDownloadHandler` 1096 1097 Use the ``from_settings`` or ``from_crawler`` class methods to expose such 1098 a parameter to your custom download handlers. 1099 1100 (:issue:`4126`) 1101 1102* We have refactored the :class:`scrapy.core.scheduler.Scheduler` class and 1103 related queue classes (see :setting:`SCHEDULER_PRIORITY_QUEUE`, 1104 :setting:`SCHEDULER_DISK_QUEUE` and :setting:`SCHEDULER_MEMORY_QUEUE`) to 1105 make it easier to implement custom scheduler queue classes. See 1106 :ref:`2-0-0-scheduler-queue-changes` below for details. 1107 1108* Overridden settings are now logged in a different format. This is more in 1109 line with similar information logged at startup (:issue:`4199`) 1110 1111.. _Python 2 end-of-life on January 1, 2020: https://www.python.org/doc/sunset-python-2/ 1112 1113 1114Deprecation removals 1115~~~~~~~~~~~~~~~~~~~~ 1116 1117* The :ref:`Scrapy shell <topics-shell>` no longer provides a `sel` proxy 1118 object, use :meth:`response.selector <scrapy.http.Response.selector>` 1119 instead (:issue:`4347`) 1120 1121* LevelDB support has been removed (:issue:`4112`) 1122 1123* The following functions have been removed from :mod:`scrapy.utils.python`: 1124 ``isbinarytext``, ``is_writable``, ``setattr_default``, ``stringify_dict`` 1125 (:issue:`4362`) 1126 1127 1128Deprecations 1129~~~~~~~~~~~~ 1130 1131* Using environment variables prefixed with ``SCRAPY_`` to override settings 1132 is deprecated (:issue:`4300`, :issue:`4374`, :issue:`4375`) 1133 1134* :class:`scrapy.linkextractors.FilteringLinkExtractor` is deprecated, use 1135 :class:`scrapy.linkextractors.LinkExtractor 1136 <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>` instead (:issue:`4045`) 1137 1138* The ``noconnect`` query string argument of proxy URLs is deprecated and 1139 should be removed from proxy URLs (:issue:`4198`) 1140 1141* The :meth:`next <scrapy.utils.python.MutableChain.next>` method of 1142 :class:`scrapy.utils.python.MutableChain` is deprecated, use the global 1143 :func:`next` function or :meth:`MutableChain.__next__ 1144 <scrapy.utils.python.MutableChain.__next__>` instead (:issue:`4153`) 1145 1146 1147New features 1148~~~~~~~~~~~~ 1149 1150* Added :doc:`partial support <topics/coroutines>` for Python’s 1151 :ref:`coroutine syntax <async>` and :doc:`experimental support 1152 <topics/asyncio>` for :mod:`asyncio` and :mod:`asyncio`-powered libraries 1153 (:issue:`4010`, :issue:`4259`, :issue:`4269`, :issue:`4270`, :issue:`4271`, 1154 :issue:`4316`, :issue:`4318`) 1155 1156* The new :meth:`Response.follow_all <scrapy.http.Response.follow_all>` 1157 method offers the same functionality as 1158 :meth:`Response.follow <scrapy.http.Response.follow>` but supports an 1159 iterable of URLs as input and returns an iterable of requests 1160 (:issue:`2582`, :issue:`4057`, :issue:`4286`) 1161 1162* :ref:`Media pipelines <topics-media-pipeline>` now support :ref:`FTP 1163 storage <media-pipeline-ftp>` (:issue:`3928`, :issue:`3961`) 1164 1165* The new :attr:`Response.certificate <scrapy.http.Response.certificate>` 1166 attribute exposes the SSL certificate of the server as a 1167 :class:`twisted.internet.ssl.Certificate` object for HTTPS responses 1168 (:issue:`2726`, :issue:`4054`) 1169 1170* A new :setting:`DNS_RESOLVER` setting allows enabling IPv6 support 1171 (:issue:`1031`, :issue:`4227`) 1172 1173* A new :setting:`SCRAPER_SLOT_MAX_ACTIVE_SIZE` setting allows configuring 1174 the existing soft limit that pauses request downloads when the total 1175 response data being processed is too high (:issue:`1410`, :issue:`3551`) 1176 1177* A new :setting:`TWISTED_REACTOR` setting allows customizing the 1178 :mod:`~twisted.internet.reactor` that Scrapy uses, allowing to 1179 :doc:`enable asyncio support <topics/asyncio>` or deal with a 1180 :ref:`common macOS issue <faq-specific-reactor>` (:issue:`2905`, 1181 :issue:`4294`) 1182 1183* Scheduler disk and memory queues may now use the class methods 1184 ``from_crawler`` or ``from_settings`` (:issue:`3884`) 1185 1186* The new :attr:`Response.cb_kwargs <scrapy.http.Response.cb_kwargs>` 1187 attribute serves as a shortcut for :attr:`Response.request.cb_kwargs 1188 <scrapy.http.Request.cb_kwargs>` (:issue:`4331`) 1189 1190* :meth:`Response.follow <scrapy.http.Response.follow>` now supports a 1191 ``flags`` parameter, for consistency with :class:`~scrapy.http.Request` 1192 (:issue:`4277`, :issue:`4279`) 1193 1194* :ref:`Item loader processors <topics-loaders-processors>` can now be 1195 regular functions, they no longer need to be methods (:issue:`3899`) 1196 1197* :class:`~scrapy.spiders.Rule` now accepts an ``errback`` parameter 1198 (:issue:`4000`) 1199 1200* :class:`~scrapy.http.Request` no longer requires a ``callback`` parameter 1201 when an ``errback`` parameter is specified (:issue:`3586`, :issue:`4008`) 1202 1203* :class:`~scrapy.logformatter.LogFormatter` now supports some additional 1204 methods: 1205 1206 * :class:`~scrapy.logformatter.LogFormatter.download_error` for 1207 download errors 1208 1209 * :class:`~scrapy.logformatter.LogFormatter.item_error` for exceptions 1210 raised during item processing by :ref:`item pipelines 1211 <topics-item-pipeline>` 1212 1213 * :class:`~scrapy.logformatter.LogFormatter.spider_error` for exceptions 1214 raised from :ref:`spider callbacks <topics-spiders>` 1215 1216 (:issue:`374`, :issue:`3986`, :issue:`3989`, :issue:`4176`, :issue:`4188`) 1217 1218* The :setting:`FEED_URI` setting now supports :class:`pathlib.Path` values 1219 (:issue:`3731`, :issue:`4074`) 1220 1221* A new :signal:`request_left_downloader` signal is sent when a request 1222 leaves the downloader (:issue:`4303`) 1223 1224* Scrapy logs a warning when it detects a request callback or errback that 1225 uses ``yield`` but also returns a value, since the returned value would be 1226 lost (:issue:`3484`, :issue:`3869`) 1227 1228* :class:`~scrapy.spiders.Spider` objects now raise an :exc:`AttributeError` 1229 exception if they do not have a :class:`~scrapy.spiders.Spider.start_urls` 1230 attribute nor reimplement :class:`~scrapy.spiders.Spider.start_requests`, 1231 but have a ``start_url`` attribute (:issue:`4133`, :issue:`4170`) 1232 1233* :class:`~scrapy.exporters.BaseItemExporter` subclasses may now use 1234 ``super().__init__(**kwargs)`` instead of ``self._configure(kwargs)`` in 1235 their ``__init__`` method, passing ``dont_fail=True`` to the parent 1236 ``__init__`` method if needed, and accessing ``kwargs`` at ``self._kwargs`` 1237 after calling their parent ``__init__`` method (:issue:`4193`, 1238 :issue:`4370`) 1239 1240* A new ``keep_fragments`` parameter of 1241 :func:`scrapy.utils.request.request_fingerprint` allows to generate 1242 different fingerprints for requests with different fragments in their URL 1243 (:issue:`4104`) 1244 1245* Download handlers (see :setting:`DOWNLOAD_HANDLERS`) may now use the 1246 ``from_settings`` and ``from_crawler`` class methods that other Scrapy 1247 components already supported (:issue:`4126`) 1248 1249* :class:`scrapy.utils.python.MutableChain.__iter__` now returns ``self``, 1250 `allowing it to be used as a sequence <https://lgtm.com/rules/4850080/>`_ 1251 (:issue:`4153`) 1252 1253 1254Bug fixes 1255~~~~~~~~~ 1256 1257* The :command:`crawl` command now also exits with exit code 1 when an 1258 exception happens before the crawling starts (:issue:`4175`, :issue:`4207`) 1259 1260* :class:`LinkExtractor.extract_links 1261 <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor.extract_links>` no longer 1262 re-encodes the query string or URLs from non-UTF-8 responses in UTF-8 1263 (:issue:`998`, :issue:`1403`, :issue:`1949`, :issue:`4321`) 1264 1265* The first spider middleware (see :setting:`SPIDER_MIDDLEWARES`) now also 1266 processes exceptions raised from callbacks that are generators 1267 (:issue:`4260`, :issue:`4272`) 1268 1269* Redirects to URLs starting with 3 slashes (``///``) are now supported 1270 (:issue:`4032`, :issue:`4042`) 1271 1272* :class:`~scrapy.http.Request` no longer accepts strings as ``url`` simply 1273 because they have a colon (:issue:`2552`, :issue:`4094`) 1274 1275* The correct encoding is now used for attach names in 1276 :class:`~scrapy.mail.MailSender` (:issue:`4229`, :issue:`4239`) 1277 1278* :class:`~scrapy.dupefilters.RFPDupeFilter`, the default 1279 :setting:`DUPEFILTER_CLASS`, no longer writes an extra ``\r`` character on 1280 each line in Windows, which made the size of the ``requests.seen`` file 1281 unnecessarily large on that platform (:issue:`4283`) 1282 1283* Z shell auto-completion now looks for ``.html`` files, not ``.http`` files, 1284 and covers the ``-h`` command-line switch (:issue:`4122`, :issue:`4291`) 1285 1286* Adding items to a :class:`scrapy.utils.datatypes.LocalCache` object 1287 without a ``limit`` defined no longer raises a :exc:`TypeError` exception 1288 (:issue:`4123`) 1289 1290* Fixed a typo in the message of the :exc:`ValueError` exception raised when 1291 :func:`scrapy.utils.misc.create_instance` gets both ``settings`` and 1292 ``crawler`` set to ``None`` (:issue:`4128`) 1293 1294 1295Documentation 1296~~~~~~~~~~~~~ 1297 1298* API documentation now links to an online, syntax-highlighted view of the 1299 corresponding source code (:issue:`4148`) 1300 1301* Links to unexisting documentation pages now allow access to the sidebar 1302 (:issue:`4152`, :issue:`4169`) 1303 1304* Cross-references within our documentation now display a tooltip when 1305 hovered (:issue:`4173`, :issue:`4183`) 1306 1307* Improved the documentation about :meth:`LinkExtractor.extract_links 1308 <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor.extract_links>` and 1309 simplified :ref:`topics-link-extractors` (:issue:`4045`) 1310 1311* Clarified how :class:`ItemLoader.item <scrapy.loader.ItemLoader.item>` 1312 works (:issue:`3574`, :issue:`4099`) 1313 1314* Clarified that :func:`logging.basicConfig` should not be used when also 1315 using :class:`~scrapy.crawler.CrawlerProcess` (:issue:`2149`, 1316 :issue:`2352`, :issue:`3146`, :issue:`3960`) 1317 1318* Clarified the requirements for :class:`~scrapy.http.Request` objects 1319 :ref:`when using persistence <request-serialization>` (:issue:`4124`, 1320 :issue:`4139`) 1321 1322* Clarified how to install a :ref:`custom image pipeline 1323 <media-pipeline-example>` (:issue:`4034`, :issue:`4252`) 1324 1325* Fixed the signatures of the ``file_path`` method in :ref:`media pipeline 1326 <topics-media-pipeline>` examples (:issue:`4290`) 1327 1328* Covered a backward-incompatible change in Scrapy 1.7.0 affecting custom 1329 :class:`scrapy.core.scheduler.Scheduler` subclasses (:issue:`4274`) 1330 1331* Improved the ``README.rst`` and ``CODE_OF_CONDUCT.md`` files 1332 (:issue:`4059`) 1333 1334* Documentation examples are now checked as part of our test suite and we 1335 have fixed some of the issues detected (:issue:`4142`, :issue:`4146`, 1336 :issue:`4171`, :issue:`4184`, :issue:`4190`) 1337 1338* Fixed logic issues, broken links and typos (:issue:`4247`, :issue:`4258`, 1339 :issue:`4282`, :issue:`4288`, :issue:`4305`, :issue:`4308`, :issue:`4323`, 1340 :issue:`4338`, :issue:`4359`, :issue:`4361`) 1341 1342* Improved consistency when referring to the ``__init__`` method of an object 1343 (:issue:`4086`, :issue:`4088`) 1344 1345* Fixed an inconsistency between code and output in :ref:`intro-overview` 1346 (:issue:`4213`) 1347 1348* Extended :mod:`~sphinx.ext.intersphinx` usage (:issue:`4147`, 1349 :issue:`4172`, :issue:`4185`, :issue:`4194`, :issue:`4197`) 1350 1351* We now use a recent version of Python to build the documentation 1352 (:issue:`4140`, :issue:`4249`) 1353 1354* Cleaned up documentation (:issue:`4143`, :issue:`4275`) 1355 1356 1357Quality assurance 1358~~~~~~~~~~~~~~~~~ 1359 1360* Re-enabled proxy ``CONNECT`` tests (:issue:`2545`, :issue:`4114`) 1361 1362* Added Bandit_ security checks to our test suite (:issue:`4162`, 1363 :issue:`4181`) 1364 1365* Added Flake8_ style checks to our test suite and applied many of the 1366 corresponding changes (:issue:`3944`, :issue:`3945`, :issue:`4137`, 1367 :issue:`4157`, :issue:`4167`, :issue:`4174`, :issue:`4186`, :issue:`4195`, 1368 :issue:`4238`, :issue:`4246`, :issue:`4355`, :issue:`4360`, :issue:`4365`) 1369 1370* Improved test coverage (:issue:`4097`, :issue:`4218`, :issue:`4236`) 1371 1372* Started reporting slowest tests, and improved the performance of some of 1373 them (:issue:`4163`, :issue:`4164`) 1374 1375* Fixed broken tests and refactored some tests (:issue:`4014`, :issue:`4095`, 1376 :issue:`4244`, :issue:`4268`, :issue:`4372`) 1377 1378* Modified the :doc:`tox <tox:index>` configuration to allow running tests 1379 with any Python version, run Bandit_ and Flake8_ tests by default, and 1380 enforce a minimum tox version programmatically (:issue:`4179`) 1381 1382* Cleaned up code (:issue:`3937`, :issue:`4208`, :issue:`4209`, 1383 :issue:`4210`, :issue:`4212`, :issue:`4369`, :issue:`4376`, :issue:`4378`) 1384 1385.. _Bandit: https://bandit.readthedocs.io/ 1386.. _Flake8: https://flake8.pycqa.org/en/latest/ 1387 1388 1389.. _2-0-0-scheduler-queue-changes: 1390 1391Changes to scheduler queue classes 1392~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1393 1394The following changes may impact any custom queue classes of all types: 1395 1396* The ``push`` method no longer receives a second positional parameter 1397 containing ``request.priority * -1``. If you need that value, get it 1398 from the first positional parameter, ``request``, instead, or use 1399 the new :meth:`~scrapy.core.scheduler.ScrapyPriorityQueue.priority` 1400 method in :class:`scrapy.core.scheduler.ScrapyPriorityQueue` 1401 subclasses. 1402 1403The following changes may impact custom priority queue classes: 1404 1405* In the ``__init__`` method or the ``from_crawler`` or ``from_settings`` 1406 class methods: 1407 1408 * The parameter that used to contain a factory function, 1409 ``qfactory``, is now passed as a keyword parameter named 1410 ``downstream_queue_cls``. 1411 1412 * A new keyword parameter has been added: ``key``. It is a string 1413 that is always an empty string for memory queues and indicates the 1414 :setting:`JOB_DIR` value for disk queues. 1415 1416 * The parameter for disk queues that contains data from the previous 1417 crawl, ``startprios`` or ``slot_startprios``, is now passed as a 1418 keyword parameter named ``startprios``. 1419 1420 * The ``serialize`` parameter is no longer passed. The disk queue 1421 class must take care of request serialization on its own before 1422 writing to disk, using the 1423 :func:`~scrapy.utils.reqser.request_to_dict` and 1424 :func:`~scrapy.utils.reqser.request_from_dict` functions from the 1425 :mod:`scrapy.utils.reqser` module. 1426 1427The following changes may impact custom disk and memory queue classes: 1428 1429* The signature of the ``__init__`` method is now 1430 ``__init__(self, crawler, key)``. 1431 1432The following changes affect specifically the 1433:class:`~scrapy.core.scheduler.ScrapyPriorityQueue` and 1434:class:`~scrapy.core.scheduler.DownloaderAwarePriorityQueue` classes from 1435:mod:`scrapy.core.scheduler` and may affect subclasses: 1436 1437* In the ``__init__`` method, most of the changes described above apply. 1438 1439 ``__init__`` may still receive all parameters as positional parameters, 1440 however: 1441 1442 * ``downstream_queue_cls``, which replaced ``qfactory``, must be 1443 instantiated differently. 1444 1445 ``qfactory`` was instantiated with a priority value (integer). 1446 1447 Instances of ``downstream_queue_cls`` should be created using 1448 the new 1449 :meth:`ScrapyPriorityQueue.qfactory <scrapy.core.scheduler.ScrapyPriorityQueue.qfactory>` 1450 or 1451 :meth:`DownloaderAwarePriorityQueue.pqfactory <scrapy.core.scheduler.DownloaderAwarePriorityQueue.pqfactory>` 1452 methods. 1453 1454 * The new ``key`` parameter displaced the ``startprios`` 1455 parameter 1 position to the right. 1456 1457* The following class attributes have been added: 1458 1459 * :attr:`~scrapy.core.scheduler.ScrapyPriorityQueue.crawler` 1460 1461 * :attr:`~scrapy.core.scheduler.ScrapyPriorityQueue.downstream_queue_cls` 1462 (details above) 1463 1464 * :attr:`~scrapy.core.scheduler.ScrapyPriorityQueue.key` (details above) 1465 1466* The ``serialize`` attribute has been removed (details above) 1467 1468The following changes affect specifically the 1469:class:`~scrapy.core.scheduler.ScrapyPriorityQueue` class and may affect 1470subclasses: 1471 1472* A new :meth:`~scrapy.core.scheduler.ScrapyPriorityQueue.priority` 1473 method has been added which, given a request, returns 1474 ``request.priority * -1``. 1475 1476 It is used in :meth:`~scrapy.core.scheduler.ScrapyPriorityQueue.push` 1477 to make up for the removal of its ``priority`` parameter. 1478 1479* The ``spider`` attribute has been removed. Use 1480 :attr:`crawler.spider <scrapy.core.scheduler.ScrapyPriorityQueue.crawler>` 1481 instead. 1482 1483The following changes affect specifically the 1484:class:`~scrapy.core.scheduler.DownloaderAwarePriorityQueue` class and may 1485affect subclasses: 1486 1487* A new :attr:`~scrapy.core.scheduler.DownloaderAwarePriorityQueue.pqueues` 1488 attribute offers a mapping of downloader slot names to the 1489 corresponding instances of 1490 :attr:`~scrapy.core.scheduler.DownloaderAwarePriorityQueue.downstream_queue_cls`. 1491 1492(:issue:`3884`) 1493 1494 1495.. _release-1.8.0: 1496 1497Scrapy 1.8.0 (2019-10-28) 1498------------------------- 1499 1500Highlights: 1501 1502* Dropped Python 3.4 support and updated minimum requirements; made Python 3.8 1503 support official 1504* New :meth:`Request.from_curl <scrapy.http.Request.from_curl>` class method 1505* New :setting:`ROBOTSTXT_PARSER` and :setting:`ROBOTSTXT_USER_AGENT` settings 1506* New :setting:`DOWNLOADER_CLIENT_TLS_CIPHERS` and 1507 :setting:`DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING` settings 1508 1509Backward-incompatible changes 1510~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1511 1512* Python 3.4 is no longer supported, and some of the minimum requirements of 1513 Scrapy have also changed: 1514 1515 * :doc:`cssselect <cssselect:index>` 0.9.1 1516 * cryptography_ 2.0 1517 * lxml_ 3.5.0 1518 * pyOpenSSL_ 16.2.0 1519 * queuelib_ 1.4.2 1520 * service_identity_ 16.0.0 1521 * six_ 1.10.0 1522 * Twisted_ 17.9.0 (16.0.0 with Python 2) 1523 * zope.interface_ 4.1.3 1524 1525 (:issue:`3892`) 1526 1527* ``JSONRequest`` is now called :class:`~scrapy.http.JsonRequest` for 1528 consistency with similar classes (:issue:`3929`, :issue:`3982`) 1529 1530* If you are using a custom context factory 1531 (:setting:`DOWNLOADER_CLIENTCONTEXTFACTORY`), its ``__init__`` method must 1532 accept two new parameters: ``tls_verbose_logging`` and ``tls_ciphers`` 1533 (:issue:`2111`, :issue:`3392`, :issue:`3442`, :issue:`3450`) 1534 1535* :class:`~scrapy.loader.ItemLoader` now turns the values of its input item 1536 into lists: 1537 1538 >>> item = MyItem() 1539 >>> item['field'] = 'value1' 1540 >>> loader = ItemLoader(item=item) 1541 >>> item['field'] 1542 ['value1'] 1543 1544 This is needed to allow adding values to existing fields 1545 (``loader.add_value('field', 'value2')``). 1546 1547 (:issue:`3804`, :issue:`3819`, :issue:`3897`, :issue:`3976`, :issue:`3998`, 1548 :issue:`4036`) 1549 1550See also :ref:`1.8-deprecation-removals` below. 1551 1552 1553New features 1554~~~~~~~~~~~~ 1555 1556* A new :meth:`Request.from_curl <scrapy.http.Request.from_curl>` class 1557 method allows :ref:`creating a request from a cURL command 1558 <requests-from-curl>` (:issue:`2985`, :issue:`3862`) 1559 1560* A new :setting:`ROBOTSTXT_PARSER` setting allows choosing which robots.txt_ 1561 parser to use. It includes built-in support for 1562 :ref:`RobotFileParser <python-robotfileparser>`, 1563 :ref:`Protego <protego-parser>` (default), :ref:`Reppy <reppy-parser>`, and 1564 :ref:`Robotexclusionrulesparser <rerp-parser>`, and allows you to 1565 :ref:`implement support for additional parsers 1566 <support-for-new-robots-parser>` (:issue:`754`, :issue:`2669`, 1567 :issue:`3796`, :issue:`3935`, :issue:`3969`, :issue:`4006`) 1568 1569* A new :setting:`ROBOTSTXT_USER_AGENT` setting allows defining a separate 1570 user agent string to use for robots.txt_ parsing (:issue:`3931`, 1571 :issue:`3966`) 1572 1573* :class:`~scrapy.spiders.Rule` no longer requires a :class:`LinkExtractor 1574 <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>` parameter 1575 (:issue:`781`, :issue:`4016`) 1576 1577* Use the new :setting:`DOWNLOADER_CLIENT_TLS_CIPHERS` setting to customize 1578 the TLS/SSL ciphers used by the default HTTP/1.1 downloader (:issue:`3392`, 1579 :issue:`3442`) 1580 1581* Set the new :setting:`DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING` setting to 1582 ``True`` to enable debug-level messages about TLS connection parameters 1583 after establishing HTTPS connections (:issue:`2111`, :issue:`3450`) 1584 1585* Callbacks that receive keyword arguments 1586 (see :attr:`Request.cb_kwargs <scrapy.http.Request.cb_kwargs>`) can now be 1587 tested using the new :class:`@cb_kwargs 1588 <scrapy.contracts.default.CallbackKeywordArgumentsContract>` 1589 :ref:`spider contract <topics-contracts>` (:issue:`3985`, :issue:`3988`) 1590 1591* When a :class:`@scrapes <scrapy.contracts.default.ScrapesContract>` spider 1592 contract fails, all missing fields are now reported (:issue:`766`, 1593 :issue:`3939`) 1594 1595* :ref:`Custom log formats <custom-log-formats>` can now drop messages by 1596 having the corresponding methods of the configured :setting:`LOG_FORMATTER` 1597 return ``None`` (:issue:`3984`, :issue:`3987`) 1598 1599* A much improved completion definition is now available for Zsh_ 1600 (:issue:`4069`) 1601 1602 1603Bug fixes 1604~~~~~~~~~ 1605 1606* :meth:`ItemLoader.load_item() <scrapy.loader.ItemLoader.load_item>` no 1607 longer makes later calls to :meth:`ItemLoader.get_output_value() 1608 <scrapy.loader.ItemLoader.get_output_value>` or 1609 :meth:`ItemLoader.load_item() <scrapy.loader.ItemLoader.load_item>` return 1610 empty data (:issue:`3804`, :issue:`3819`, :issue:`3897`, :issue:`3976`, 1611 :issue:`3998`, :issue:`4036`) 1612 1613* Fixed :class:`~scrapy.statscollectors.DummyStatsCollector` raising a 1614 :exc:`TypeError` exception (:issue:`4007`, :issue:`4052`) 1615 1616* :meth:`FilesPipeline.file_path 1617 <scrapy.pipelines.files.FilesPipeline.file_path>` and 1618 :meth:`ImagesPipeline.file_path 1619 <scrapy.pipelines.images.ImagesPipeline.file_path>` no longer choose 1620 file extensions that are not `registered with IANA`_ (:issue:`1287`, 1621 :issue:`3953`, :issue:`3954`) 1622 1623* When using botocore_ to persist files in S3, all botocore-supported headers 1624 are properly mapped now (:issue:`3904`, :issue:`3905`) 1625 1626* FTP passwords in :setting:`FEED_URI` containing percent-escaped characters 1627 are now properly decoded (:issue:`3941`) 1628 1629* A memory-handling and error-handling issue in 1630 :func:`scrapy.utils.ssl.get_temp_key_info` has been fixed (:issue:`3920`) 1631 1632 1633Documentation 1634~~~~~~~~~~~~~ 1635 1636* The documentation now covers how to define and configure a :ref:`custom log 1637 format <custom-log-formats>` (:issue:`3616`, :issue:`3660`) 1638 1639* API documentation added for :class:`~scrapy.exporters.MarshalItemExporter` 1640 and :class:`~scrapy.exporters.PythonItemExporter` (:issue:`3973`) 1641 1642* API documentation added for :class:`~scrapy.item.BaseItem` and 1643 :class:`~scrapy.item.ItemMeta` (:issue:`3999`) 1644 1645* Minor documentation fixes (:issue:`2998`, :issue:`3398`, :issue:`3597`, 1646 :issue:`3894`, :issue:`3934`, :issue:`3978`, :issue:`3993`, :issue:`4022`, 1647 :issue:`4028`, :issue:`4033`, :issue:`4046`, :issue:`4050`, :issue:`4055`, 1648 :issue:`4056`, :issue:`4061`, :issue:`4072`, :issue:`4071`, :issue:`4079`, 1649 :issue:`4081`, :issue:`4089`, :issue:`4093`) 1650 1651 1652.. _1.8-deprecation-removals: 1653 1654Deprecation removals 1655~~~~~~~~~~~~~~~~~~~~ 1656 1657* ``scrapy.xlib`` has been removed (:issue:`4015`) 1658 1659 1660.. _1.8-deprecations: 1661 1662Deprecations 1663~~~~~~~~~~~~ 1664 1665* The LevelDB_ storage backend 1666 (``scrapy.extensions.httpcache.LeveldbCacheStorage``) of 1667 :class:`~scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware` is 1668 deprecated (:issue:`4085`, :issue:`4092`) 1669 1670* Use of the undocumented ``SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE`` environment 1671 variable is deprecated (:issue:`3910`) 1672 1673* ``scrapy.item.DictItem`` is deprecated, use :class:`~scrapy.item.Item` 1674 instead (:issue:`3999`) 1675 1676 1677Other changes 1678~~~~~~~~~~~~~ 1679 1680* Minimum versions of optional Scrapy requirements that are covered by 1681 continuous integration tests have been updated: 1682 1683 * botocore_ 1.3.23 1684 * Pillow_ 3.4.2 1685 1686 Lower versions of these optional requirements may work, but it is not 1687 guaranteed (:issue:`3892`) 1688 1689* GitHub templates for bug reports and feature requests (:issue:`3126`, 1690 :issue:`3471`, :issue:`3749`, :issue:`3754`) 1691 1692* Continuous integration fixes (:issue:`3923`) 1693 1694* Code cleanup (:issue:`3391`, :issue:`3907`, :issue:`3946`, :issue:`3950`, 1695 :issue:`4023`, :issue:`4031`) 1696 1697 1698.. _release-1.7.4: 1699 1700Scrapy 1.7.4 (2019-10-21) 1701------------------------- 1702 1703Revert the fix for :issue:`3804` (:issue:`3819`), which has a few undesired 1704side effects (:issue:`3897`, :issue:`3976`). 1705 1706As a result, when an item loader is initialized with an item, 1707:meth:`ItemLoader.load_item() <scrapy.loader.ItemLoader.load_item>` once again 1708makes later calls to :meth:`ItemLoader.get_output_value() 1709<scrapy.loader.ItemLoader.get_output_value>` or :meth:`ItemLoader.load_item() 1710<scrapy.loader.ItemLoader.load_item>` return empty data. 1711 1712 1713.. _release-1.7.3: 1714 1715Scrapy 1.7.3 (2019-08-01) 1716------------------------- 1717 1718Enforce lxml 4.3.5 or lower for Python 3.4 (:issue:`3912`, :issue:`3918`). 1719 1720 1721.. _release-1.7.2: 1722 1723Scrapy 1.7.2 (2019-07-23) 1724------------------------- 1725 1726Fix Python 2 support (:issue:`3889`, :issue:`3893`, :issue:`3896`). 1727 1728 1729.. _release-1.7.1: 1730 1731Scrapy 1.7.1 (2019-07-18) 1732------------------------- 1733 1734Re-packaging of Scrapy 1.7.0, which was missing some changes in PyPI. 1735 1736 1737.. _release-1.7.0: 1738 1739Scrapy 1.7.0 (2019-07-18) 1740------------------------- 1741 1742.. note:: Make sure you install Scrapy 1.7.1. The Scrapy 1.7.0 package in PyPI 1743 is the result of an erroneous commit tagging and does not include all 1744 the changes described below. 1745 1746Highlights: 1747 1748* Improvements for crawls targeting multiple domains 1749* A cleaner way to pass arguments to callbacks 1750* A new class for JSON requests 1751* Improvements for rule-based spiders 1752* New features for feed exports 1753 1754Backward-incompatible changes 1755~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1756 1757* ``429`` is now part of the :setting:`RETRY_HTTP_CODES` setting by default 1758 1759 This change is **backward incompatible**. If you don’t want to retry 1760 ``429``, you must override :setting:`RETRY_HTTP_CODES` accordingly. 1761 1762* :class:`~scrapy.crawler.Crawler`, 1763 :class:`CrawlerRunner.crawl <scrapy.crawler.CrawlerRunner.crawl>` and 1764 :class:`CrawlerRunner.create_crawler <scrapy.crawler.CrawlerRunner.create_crawler>` 1765 no longer accept a :class:`~scrapy.spiders.Spider` subclass instance, they 1766 only accept a :class:`~scrapy.spiders.Spider` subclass now. 1767 1768 :class:`~scrapy.spiders.Spider` subclass instances were never meant to 1769 work, and they were not working as one would expect: instead of using the 1770 passed :class:`~scrapy.spiders.Spider` subclass instance, their 1771 :class:`~scrapy.spiders.Spider.from_crawler` method was called to generate 1772 a new instance. 1773 1774* Non-default values for the :setting:`SCHEDULER_PRIORITY_QUEUE` setting 1775 may stop working. Scheduler priority queue classes now need to handle 1776 :class:`~scrapy.http.Request` objects instead of arbitrary Python data 1777 structures. 1778 1779* An additional ``crawler`` parameter has been added to the ``__init__`` 1780 method of the :class:`~scrapy.core.scheduler.Scheduler` class. Custom 1781 scheduler subclasses which don't accept arbitrary parameters in their 1782 ``__init__`` method might break because of this change. 1783 1784 For more information, see :setting:`SCHEDULER`. 1785 1786See also :ref:`1.7-deprecation-removals` below. 1787 1788 1789New features 1790~~~~~~~~~~~~ 1791 1792* A new scheduler priority queue, 1793 ``scrapy.pqueues.DownloaderAwarePriorityQueue``, may be 1794 :ref:`enabled <broad-crawls-scheduler-priority-queue>` for a significant 1795 scheduling improvement on crawls targetting multiple web domains, at the 1796 cost of no :setting:`CONCURRENT_REQUESTS_PER_IP` support (:issue:`3520`) 1797 1798* A new :attr:`Request.cb_kwargs <scrapy.http.Request.cb_kwargs>` attribute 1799 provides a cleaner way to pass keyword arguments to callback methods 1800 (:issue:`1138`, :issue:`3563`) 1801 1802* A new :class:`JSONRequest <scrapy.http.JsonRequest>` class offers a more 1803 convenient way to build JSON requests (:issue:`3504`, :issue:`3505`) 1804 1805* A ``process_request`` callback passed to the :class:`~scrapy.spiders.Rule` 1806 ``__init__`` method now receives the :class:`~scrapy.http.Response` object that 1807 originated the request as its second argument (:issue:`3682`) 1808 1809* A new ``restrict_text`` parameter for the 1810 :attr:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>` 1811 ``__init__`` method allows filtering links by linking text (:issue:`3622`, 1812 :issue:`3635`) 1813 1814* A new :setting:`FEED_STORAGE_S3_ACL` setting allows defining a custom ACL 1815 for feeds exported to Amazon S3 (:issue:`3607`) 1816 1817* A new :setting:`FEED_STORAGE_FTP_ACTIVE` setting allows using FTP’s active 1818 connection mode for feeds exported to FTP servers (:issue:`3829`) 1819 1820* A new :setting:`METAREFRESH_IGNORE_TAGS` setting allows overriding which 1821 HTML tags are ignored when searching a response for HTML meta tags that 1822 trigger a redirect (:issue:`1422`, :issue:`3768`) 1823 1824* A new :reqmeta:`redirect_reasons` request meta key exposes the reason 1825 (status code, meta refresh) behind every followed redirect (:issue:`3581`, 1826 :issue:`3687`) 1827 1828* The ``SCRAPY_CHECK`` variable is now set to the ``true`` string during runs 1829 of the :command:`check` command, which allows :ref:`detecting contract 1830 check runs from code <detecting-contract-check-runs>` (:issue:`3704`, 1831 :issue:`3739`) 1832 1833* A new :meth:`Item.deepcopy() <scrapy.item.Item.deepcopy>` method makes it 1834 easier to :ref:`deep-copy items <copying-items>` (:issue:`1493`, 1835 :issue:`3671`) 1836 1837* :class:`~scrapy.extensions.corestats.CoreStats` also logs 1838 ``elapsed_time_seconds`` now (:issue:`3638`) 1839 1840* Exceptions from :class:`~scrapy.loader.ItemLoader` :ref:`input and output 1841 processors <topics-loaders-processors>` are now more verbose 1842 (:issue:`3836`, :issue:`3840`) 1843 1844* :class:`~scrapy.crawler.Crawler`, 1845 :class:`CrawlerRunner.crawl <scrapy.crawler.CrawlerRunner.crawl>` and 1846 :class:`CrawlerRunner.create_crawler <scrapy.crawler.CrawlerRunner.create_crawler>` 1847 now fail gracefully if they receive a :class:`~scrapy.spiders.Spider` 1848 subclass instance instead of the subclass itself (:issue:`2283`, 1849 :issue:`3610`, :issue:`3872`) 1850 1851 1852Bug fixes 1853~~~~~~~~~ 1854 1855* :meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_spider_exception` 1856 is now also invoked for generators (:issue:`220`, :issue:`2061`) 1857 1858* System exceptions like KeyboardInterrupt_ are no longer caught 1859 (:issue:`3726`) 1860 1861* :meth:`ItemLoader.load_item() <scrapy.loader.ItemLoader.load_item>` no 1862 longer makes later calls to :meth:`ItemLoader.get_output_value() 1863 <scrapy.loader.ItemLoader.get_output_value>` or 1864 :meth:`ItemLoader.load_item() <scrapy.loader.ItemLoader.load_item>` return 1865 empty data (:issue:`3804`, :issue:`3819`) 1866 1867* The images pipeline (:class:`~scrapy.pipelines.images.ImagesPipeline`) no 1868 longer ignores these Amazon S3 settings: :setting:`AWS_ENDPOINT_URL`, 1869 :setting:`AWS_REGION_NAME`, :setting:`AWS_USE_SSL`, :setting:`AWS_VERIFY` 1870 (:issue:`3625`) 1871 1872* Fixed a memory leak in ``scrapy.pipelines.media.MediaPipeline`` affecting, 1873 for example, non-200 responses and exceptions from custom middlewares 1874 (:issue:`3813`) 1875 1876* Requests with private callbacks are now correctly unserialized from disk 1877 (:issue:`3790`) 1878 1879* :meth:`FormRequest.from_response() <scrapy.http.FormRequest.from_response>` 1880 now handles invalid methods like major web browsers (:issue:`3777`, 1881 :issue:`3794`) 1882 1883 1884Documentation 1885~~~~~~~~~~~~~ 1886 1887* A new topic, :ref:`topics-dynamic-content`, covers recommended approaches 1888 to read dynamically-loaded data (:issue:`3703`) 1889 1890* :ref:`topics-broad-crawls` now features information about memory usage 1891 (:issue:`1264`, :issue:`3866`) 1892 1893* The documentation of :class:`~scrapy.spiders.Rule` now covers how to access 1894 the text of a link when using :class:`~scrapy.spiders.CrawlSpider` 1895 (:issue:`3711`, :issue:`3712`) 1896 1897* A new section, :ref:`httpcache-storage-custom`, covers writing a custom 1898 cache storage backend for 1899 :class:`~scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware` 1900 (:issue:`3683`, :issue:`3692`) 1901 1902* A new :ref:`FAQ <faq>` entry, :ref:`faq-split-item`, explains what to do 1903 when you want to split an item into multiple items from an item pipeline 1904 (:issue:`2240`, :issue:`3672`) 1905 1906* Updated the :ref:`FAQ entry about crawl order <faq-bfo-dfo>` to explain why 1907 the first few requests rarely follow the desired order (:issue:`1739`, 1908 :issue:`3621`) 1909 1910* The :setting:`LOGSTATS_INTERVAL` setting (:issue:`3730`), the 1911 :meth:`FilesPipeline.file_path <scrapy.pipelines.files.FilesPipeline.file_path>` 1912 and 1913 :meth:`ImagesPipeline.file_path <scrapy.pipelines.images.ImagesPipeline.file_path>` 1914 methods (:issue:`2253`, :issue:`3609`) and the 1915 :meth:`Crawler.stop() <scrapy.crawler.Crawler.stop>` method (:issue:`3842`) 1916 are now documented 1917 1918* Some parts of the documentation that were confusing or misleading are now 1919 clearer (:issue:`1347`, :issue:`1789`, :issue:`2289`, :issue:`3069`, 1920 :issue:`3615`, :issue:`3626`, :issue:`3668`, :issue:`3670`, :issue:`3673`, 1921 :issue:`3728`, :issue:`3762`, :issue:`3861`, :issue:`3882`) 1922 1923* Minor documentation fixes (:issue:`3648`, :issue:`3649`, :issue:`3662`, 1924 :issue:`3674`, :issue:`3676`, :issue:`3694`, :issue:`3724`, :issue:`3764`, 1925 :issue:`3767`, :issue:`3791`, :issue:`3797`, :issue:`3806`, :issue:`3812`) 1926 1927.. _1.7-deprecation-removals: 1928 1929Deprecation removals 1930~~~~~~~~~~~~~~~~~~~~ 1931 1932The following deprecated APIs have been removed (:issue:`3578`): 1933 1934* ``scrapy.conf`` (use :attr:`Crawler.settings 1935 <scrapy.crawler.Crawler.settings>`) 1936 1937* From ``scrapy.core.downloader.handlers``: 1938 1939 * ``http.HttpDownloadHandler`` (use ``http10.HTTP10DownloadHandler``) 1940 1941* ``scrapy.loader.ItemLoader._get_values`` (use ``_get_xpathvalues``) 1942 1943* ``scrapy.loader.XPathItemLoader`` (use :class:`~scrapy.loader.ItemLoader`) 1944 1945* ``scrapy.log`` (see :ref:`topics-logging`) 1946 1947* From ``scrapy.pipelines``: 1948 1949 * ``files.FilesPipeline.file_key`` (use ``file_path``) 1950 1951 * ``images.ImagesPipeline.file_key`` (use ``file_path``) 1952 1953 * ``images.ImagesPipeline.image_key`` (use ``file_path``) 1954 1955 * ``images.ImagesPipeline.thumb_key`` (use ``thumb_path``) 1956 1957* From both ``scrapy.selector`` and ``scrapy.selector.lxmlsel``: 1958 1959 * ``HtmlXPathSelector`` (use :class:`~scrapy.selector.Selector`) 1960 1961 * ``XmlXPathSelector`` (use :class:`~scrapy.selector.Selector`) 1962 1963 * ``XPathSelector`` (use :class:`~scrapy.selector.Selector`) 1964 1965 * ``XPathSelectorList`` (use :class:`~scrapy.selector.Selector`) 1966 1967* From ``scrapy.selector.csstranslator``: 1968 1969 * ``ScrapyGenericTranslator`` (use parsel.csstranslator.GenericTranslator_) 1970 1971 * ``ScrapyHTMLTranslator`` (use parsel.csstranslator.HTMLTranslator_) 1972 1973 * ``ScrapyXPathExpr`` (use parsel.csstranslator.XPathExpr_) 1974 1975* From :class:`~scrapy.selector.Selector`: 1976 1977 * ``_root`` (both the ``__init__`` method argument and the object property, use 1978 ``root``) 1979 1980 * ``extract_unquoted`` (use ``getall``) 1981 1982 * ``select`` (use ``xpath``) 1983 1984* From :class:`~scrapy.selector.SelectorList`: 1985 1986 * ``extract_unquoted`` (use ``getall``) 1987 1988 * ``select`` (use ``xpath``) 1989 1990 * ``x`` (use ``xpath``) 1991 1992* ``scrapy.spiders.BaseSpider`` (use :class:`~scrapy.spiders.Spider`) 1993 1994* From :class:`~scrapy.spiders.Spider` (and subclasses): 1995 1996 * ``DOWNLOAD_DELAY`` (use :ref:`download_delay 1997 <spider-download_delay-attribute>`) 1998 1999 * ``set_crawler`` (use :meth:`~scrapy.spiders.Spider.from_crawler`) 2000 2001* ``scrapy.spiders.spiders`` (use :class:`~scrapy.spiderloader.SpiderLoader`) 2002 2003* ``scrapy.telnet`` (use :mod:`scrapy.extensions.telnet`) 2004 2005* From ``scrapy.utils.python``: 2006 2007 * ``str_to_unicode`` (use ``to_unicode``) 2008 2009 * ``unicode_to_str`` (use ``to_bytes``) 2010 2011* ``scrapy.utils.response.body_or_str`` 2012 2013The following deprecated settings have also been removed (:issue:`3578`): 2014 2015* ``SPIDER_MANAGER_CLASS`` (use :setting:`SPIDER_LOADER_CLASS`) 2016 2017 2018.. _1.7-deprecations: 2019 2020Deprecations 2021~~~~~~~~~~~~ 2022 2023* The ``queuelib.PriorityQueue`` value for the 2024 :setting:`SCHEDULER_PRIORITY_QUEUE` setting is deprecated. Use 2025 ``scrapy.pqueues.ScrapyPriorityQueue`` instead. 2026 2027* ``process_request`` callbacks passed to :class:`~scrapy.spiders.Rule` that 2028 do not accept two arguments are deprecated. 2029 2030* The following modules are deprecated: 2031 2032 * ``scrapy.utils.http`` (use `w3lib.http`_) 2033 2034 * ``scrapy.utils.markup`` (use `w3lib.html`_) 2035 2036 * ``scrapy.utils.multipart`` (use `urllib3`_) 2037 2038* The ``scrapy.utils.datatypes.MergeDict`` class is deprecated for Python 3 2039 code bases. Use :class:`~collections.ChainMap` instead. (:issue:`3878`) 2040 2041* The ``scrapy.utils.gz.is_gzipped`` function is deprecated. Use 2042 ``scrapy.utils.gz.gzip_magic_number`` instead. 2043 2044.. _urllib3: https://urllib3.readthedocs.io/en/latest/index.html 2045.. _w3lib.html: https://w3lib.readthedocs.io/en/latest/w3lib.html#module-w3lib.html 2046.. _w3lib.http: https://w3lib.readthedocs.io/en/latest/w3lib.html#module-w3lib.http 2047 2048 2049Other changes 2050~~~~~~~~~~~~~ 2051 2052* It is now possible to run all tests from the same tox_ environment in 2053 parallel; the documentation now covers :ref:`this and other ways to run 2054 tests <running-tests>` (:issue:`3707`) 2055 2056* It is now possible to generate an API documentation coverage report 2057 (:issue:`3806`, :issue:`3810`, :issue:`3860`) 2058 2059* The :ref:`documentation policies <documentation-policies>` now require 2060 docstrings_ (:issue:`3701`) that follow `PEP 257`_ (:issue:`3748`) 2061 2062* Internal fixes and cleanup (:issue:`3629`, :issue:`3643`, :issue:`3684`, 2063 :issue:`3698`, :issue:`3734`, :issue:`3735`, :issue:`3736`, :issue:`3737`, 2064 :issue:`3809`, :issue:`3821`, :issue:`3825`, :issue:`3827`, :issue:`3833`, 2065 :issue:`3857`, :issue:`3877`) 2066 2067.. _release-1.6.0: 2068 2069Scrapy 1.6.0 (2019-01-30) 2070------------------------- 2071 2072Highlights: 2073 2074* better Windows support; 2075* Python 3.7 compatibility; 2076* big documentation improvements, including a switch 2077 from ``.extract_first()`` + ``.extract()`` API to ``.get()`` + ``.getall()`` 2078 API; 2079* feed exports, FilePipeline and MediaPipeline improvements; 2080* better extensibility: :signal:`item_error` and 2081 :signal:`request_reached_downloader` signals; ``from_crawler`` support 2082 for feed exporters, feed storages and dupefilters. 2083* ``scrapy.contracts`` fixes and new features; 2084* telnet console security improvements, first released as a 2085 backport in :ref:`release-1.5.2`; 2086* clean-up of the deprecated code; 2087* various bug fixes, small new features and usability improvements across 2088 the codebase. 2089 2090Selector API changes 2091~~~~~~~~~~~~~~~~~~~~ 2092 2093While these are not changes in Scrapy itself, but rather in the parsel_ 2094library which Scrapy uses for xpath/css selectors, these changes are 2095worth mentioning here. Scrapy now depends on parsel >= 1.5, and 2096Scrapy documentation is updated to follow recent ``parsel`` API conventions. 2097 2098Most visible change is that ``.get()`` and ``.getall()`` selector 2099methods are now preferred over ``.extract_first()`` and ``.extract()``. 2100We feel that these new methods result in a more concise and readable code. 2101See :ref:`old-extraction-api` for more details. 2102 2103.. note:: 2104 There are currently **no plans** to deprecate ``.extract()`` 2105 and ``.extract_first()`` methods. 2106 2107Another useful new feature is the introduction of ``Selector.attrib`` and 2108``SelectorList.attrib`` properties, which make it easier to get 2109attributes of HTML elements. See :ref:`selecting-attributes`. 2110 2111CSS selectors are cached in parsel >= 1.5, which makes them faster 2112when the same CSS path is used many times. This is very common in 2113case of Scrapy spiders: callbacks are usually called several times, 2114on different pages. 2115 2116If you're using custom ``Selector`` or ``SelectorList`` subclasses, 2117a **backward incompatible** change in parsel may affect your code. 2118See `parsel changelog`_ for a detailed description, as well as for the 2119full list of improvements. 2120 2121.. _parsel changelog: https://parsel.readthedocs.io/en/latest/history.html 2122 2123Telnet console 2124~~~~~~~~~~~~~~ 2125 2126**Backward incompatible**: Scrapy's telnet console now requires username 2127and password. See :ref:`topics-telnetconsole` for more details. This change 2128fixes a **security issue**; see :ref:`release-1.5.2` release notes for details. 2129 2130New extensibility features 2131~~~~~~~~~~~~~~~~~~~~~~~~~~ 2132 2133* ``from_crawler`` support is added to feed exporters and feed storages. This, 2134 among other things, allows to access Scrapy settings from custom feed 2135 storages and exporters (:issue:`1605`, :issue:`3348`). 2136* ``from_crawler`` support is added to dupefilters (:issue:`2956`); this allows 2137 to access e.g. settings or a spider from a dupefilter. 2138* :signal:`item_error` is fired when an error happens in a pipeline 2139 (:issue:`3256`); 2140* :signal:`request_reached_downloader` is fired when Downloader gets 2141 a new Request; this signal can be useful e.g. for custom Schedulers 2142 (:issue:`3393`). 2143* new SitemapSpider :meth:`~.SitemapSpider.sitemap_filter` method which allows 2144 to select sitemap entries based on their attributes in SitemapSpider 2145 subclasses (:issue:`3512`). 2146* Lazy loading of Downloader Handlers is now optional; this enables better 2147 initialization error handling in custom Downloader Handlers (:issue:`3394`). 2148 2149New FilePipeline and MediaPipeline features 2150~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2151 2152* Expose more options for S3FilesStore: :setting:`AWS_ENDPOINT_URL`, 2153 :setting:`AWS_USE_SSL`, :setting:`AWS_VERIFY`, :setting:`AWS_REGION_NAME`. 2154 For example, this allows to use alternative or self-hosted 2155 AWS-compatible providers (:issue:`2609`, :issue:`3548`). 2156* ACL support for Google Cloud Storage: :setting:`FILES_STORE_GCS_ACL` and 2157 :setting:`IMAGES_STORE_GCS_ACL` (:issue:`3199`). 2158 2159``scrapy.contracts`` improvements 2160~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2161 2162* Exceptions in contracts code are handled better (:issue:`3377`); 2163* ``dont_filter=True`` is used for contract requests, which allows to test 2164 different callbacks with the same URL (:issue:`3381`); 2165* ``request_cls`` attribute in Contract subclasses allow to use different 2166 Request classes in contracts, for example FormRequest (:issue:`3383`). 2167* Fixed errback handling in contracts, e.g. for cases where a contract 2168 is executed for URL which returns non-200 response (:issue:`3371`). 2169 2170Usability improvements 2171~~~~~~~~~~~~~~~~~~~~~~ 2172 2173* more stats for RobotsTxtMiddleware (:issue:`3100`) 2174* INFO log level is used to show telnet host/port (:issue:`3115`) 2175* a message is added to IgnoreRequest in RobotsTxtMiddleware (:issue:`3113`) 2176* better validation of ``url`` argument in ``Response.follow`` (:issue:`3131`) 2177* non-zero exit code is returned from Scrapy commands when error happens 2178 on spider initialization (:issue:`3226`) 2179* Link extraction improvements: "ftp" is added to scheme list (:issue:`3152`); 2180 "flv" is added to common video extensions (:issue:`3165`) 2181* better error message when an exporter is disabled (:issue:`3358`); 2182* ``scrapy shell --help`` mentions syntax required for local files 2183 (``./file.html``) - :issue:`3496`. 2184* Referer header value is added to RFPDupeFilter log messages (:issue:`3588`) 2185 2186Bug fixes 2187~~~~~~~~~ 2188 2189* fixed issue with extra blank lines in .csv exports under Windows 2190 (:issue:`3039`); 2191* proper handling of pickling errors in Python 3 when serializing objects 2192 for disk queues (:issue:`3082`) 2193* flags are now preserved when copying Requests (:issue:`3342`); 2194* FormRequest.from_response clickdata shouldn't ignore elements with 2195 ``input[type=image]`` (:issue:`3153`). 2196* FormRequest.from_response should preserve duplicate keys (:issue:`3247`) 2197 2198Documentation improvements 2199~~~~~~~~~~~~~~~~~~~~~~~~~~ 2200 2201* Docs are re-written to suggest .get/.getall API instead of 2202 .extract/.extract_first. Also, :ref:`topics-selectors` docs are updated 2203 and re-structured to match latest parsel docs; they now contain more topics, 2204 such as :ref:`selecting-attributes` or :ref:`topics-selectors-css-extensions` 2205 (:issue:`3390`). 2206* :ref:`topics-developer-tools` is a new tutorial which replaces 2207 old Firefox and Firebug tutorials (:issue:`3400`). 2208* SCRAPY_PROJECT environment variable is documented (:issue:`3518`); 2209* troubleshooting section is added to install instructions (:issue:`3517`); 2210* improved links to beginner resources in the tutorial 2211 (:issue:`3367`, :issue:`3468`); 2212* fixed :setting:`RETRY_HTTP_CODES` default values in docs (:issue:`3335`); 2213* remove unused ``DEPTH_STATS`` option from docs (:issue:`3245`); 2214* other cleanups (:issue:`3347`, :issue:`3350`, :issue:`3445`, :issue:`3544`, 2215 :issue:`3605`). 2216 2217Deprecation removals 2218~~~~~~~~~~~~~~~~~~~~ 2219 2220Compatibility shims for pre-1.0 Scrapy module names are removed 2221(:issue:`3318`): 2222 2223* ``scrapy.command`` 2224* ``scrapy.contrib`` (with all submodules) 2225* ``scrapy.contrib_exp`` (with all submodules) 2226* ``scrapy.dupefilter`` 2227* ``scrapy.linkextractor`` 2228* ``scrapy.project`` 2229* ``scrapy.spider`` 2230* ``scrapy.spidermanager`` 2231* ``scrapy.squeue`` 2232* ``scrapy.stats`` 2233* ``scrapy.statscol`` 2234* ``scrapy.utils.decorator`` 2235 2236See :ref:`module-relocations` for more information, or use suggestions 2237from Scrapy 1.5.x deprecation warnings to update your code. 2238 2239Other deprecation removals: 2240 2241* Deprecated scrapy.interfaces.ISpiderManager is removed; please use 2242 scrapy.interfaces.ISpiderLoader. 2243* Deprecated ``CrawlerSettings`` class is removed (:issue:`3327`). 2244* Deprecated ``Settings.overrides`` and ``Settings.defaults`` attributes 2245 are removed (:issue:`3327`, :issue:`3359`). 2246 2247Other improvements, cleanups 2248~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2249 2250* All Scrapy tests now pass on Windows; Scrapy testing suite is executed 2251 in a Windows environment on CI (:issue:`3315`). 2252* Python 3.7 support (:issue:`3326`, :issue:`3150`, :issue:`3547`). 2253* Testing and CI fixes (:issue:`3526`, :issue:`3538`, :issue:`3308`, 2254 :issue:`3311`, :issue:`3309`, :issue:`3305`, :issue:`3210`, :issue:`3299`) 2255* ``scrapy.http.cookies.CookieJar.clear`` accepts "domain", "path" and "name" 2256 optional arguments (:issue:`3231`). 2257* additional files are included to sdist (:issue:`3495`); 2258* code style fixes (:issue:`3405`, :issue:`3304`); 2259* unneeded .strip() call is removed (:issue:`3519`); 2260* collections.deque is used to store MiddlewareManager methods instead 2261 of a list (:issue:`3476`) 2262 2263.. _release-1.5.2: 2264 2265Scrapy 1.5.2 (2019-01-22) 2266------------------------- 2267 2268* *Security bugfix*: Telnet console extension can be easily exploited by rogue 2269 websites POSTing content to http://localhost:6023, we haven't found a way to 2270 exploit it from Scrapy, but it is very easy to trick a browser to do so and 2271 elevates the risk for local development environment. 2272 2273 *The fix is backward incompatible*, it enables telnet user-password 2274 authentication by default with a random generated password. If you can't 2275 upgrade right away, please consider setting :setting:`TELNETCONSOLE_PORT` 2276 out of its default value. 2277 2278 See :ref:`telnet console <topics-telnetconsole>` documentation for more info 2279 2280* Backport CI build failure under GCE environment due to boto import error. 2281 2282.. _release-1.5.1: 2283 2284Scrapy 1.5.1 (2018-07-12) 2285------------------------- 2286 2287This is a maintenance release with important bug fixes, but no new features: 2288 2289* ``O(N^2)`` gzip decompression issue which affected Python 3 and PyPy 2290 is fixed (:issue:`3281`); 2291* skipping of TLS validation errors is improved (:issue:`3166`); 2292* Ctrl-C handling is fixed in Python 3.5+ (:issue:`3096`); 2293* testing fixes (:issue:`3092`, :issue:`3263`); 2294* documentation improvements (:issue:`3058`, :issue:`3059`, :issue:`3089`, 2295 :issue:`3123`, :issue:`3127`, :issue:`3189`, :issue:`3224`, :issue:`3280`, 2296 :issue:`3279`, :issue:`3201`, :issue:`3260`, :issue:`3284`, :issue:`3298`, 2297 :issue:`3294`). 2298 2299 2300.. _release-1.5.0: 2301 2302Scrapy 1.5.0 (2017-12-29) 2303------------------------- 2304 2305This release brings small new features and improvements across the codebase. 2306Some highlights: 2307 2308* Google Cloud Storage is supported in FilesPipeline and ImagesPipeline. 2309* Crawling with proxy servers becomes more efficient, as connections 2310 to proxies can be reused now. 2311* Warnings, exception and logging messages are improved to make debugging 2312 easier. 2313* ``scrapy parse`` command now allows to set custom request meta via 2314 ``--meta`` argument. 2315* Compatibility with Python 3.6, PyPy and PyPy3 is improved; 2316 PyPy and PyPy3 are now supported officially, by running tests on CI. 2317* Better default handling of HTTP 308, 522 and 524 status codes. 2318* Documentation is improved, as usual. 2319 2320Backward Incompatible Changes 2321~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2322 2323* Scrapy 1.5 drops support for Python 3.3. 2324* Default Scrapy User-Agent now uses https link to scrapy.org (:issue:`2983`). 2325 **This is technically backward-incompatible**; override 2326 :setting:`USER_AGENT` if you relied on old value. 2327* Logging of settings overridden by ``custom_settings`` is fixed; 2328 **this is technically backward-incompatible** because the logger 2329 changes from ``[scrapy.utils.log]`` to ``[scrapy.crawler]``. If you're 2330 parsing Scrapy logs, please update your log parsers (:issue:`1343`). 2331* LinkExtractor now ignores ``m4v`` extension by default, this is change 2332 in behavior. 2333* 522 and 524 status codes are added to ``RETRY_HTTP_CODES`` (:issue:`2851`) 2334 2335New features 2336~~~~~~~~~~~~ 2337 2338- Support ``<link>`` tags in ``Response.follow`` (:issue:`2785`) 2339- Support for ``ptpython`` REPL (:issue:`2654`) 2340- Google Cloud Storage support for FilesPipeline and ImagesPipeline 2341 (:issue:`2923`). 2342- New ``--meta`` option of the "scrapy parse" command allows to pass additional 2343 request.meta (:issue:`2883`) 2344- Populate spider variable when using ``shell.inspect_response`` (:issue:`2812`) 2345- Handle HTTP 308 Permanent Redirect (:issue:`2844`) 2346- Add 522 and 524 to ``RETRY_HTTP_CODES`` (:issue:`2851`) 2347- Log versions information at startup (:issue:`2857`) 2348- ``scrapy.mail.MailSender`` now works in Python 3 (it requires Twisted 17.9.0) 2349- Connections to proxy servers are reused (:issue:`2743`) 2350- Add template for a downloader middleware (:issue:`2755`) 2351- Explicit message for NotImplementedError when parse callback not defined 2352 (:issue:`2831`) 2353- CrawlerProcess got an option to disable installation of root log handler 2354 (:issue:`2921`) 2355- LinkExtractor now ignores ``m4v`` extension by default 2356- Better log messages for responses over :setting:`DOWNLOAD_WARNSIZE` and 2357 :setting:`DOWNLOAD_MAXSIZE` limits (:issue:`2927`) 2358- Show warning when a URL is put to ``Spider.allowed_domains`` instead of 2359 a domain (:issue:`2250`). 2360 2361Bug fixes 2362~~~~~~~~~ 2363 2364- Fix logging of settings overridden by ``custom_settings``; 2365 **this is technically backward-incompatible** because the logger 2366 changes from ``[scrapy.utils.log]`` to ``[scrapy.crawler]``, so please 2367 update your log parsers if needed (:issue:`1343`) 2368- Default Scrapy User-Agent now uses https link to scrapy.org (:issue:`2983`). 2369 **This is technically backward-incompatible**; override 2370 :setting:`USER_AGENT` if you relied on old value. 2371- Fix PyPy and PyPy3 test failures, support them officially 2372 (:issue:`2793`, :issue:`2935`, :issue:`2990`, :issue:`3050`, :issue:`2213`, 2373 :issue:`3048`) 2374- Fix DNS resolver when ``DNSCACHE_ENABLED=False`` (:issue:`2811`) 2375- Add ``cryptography`` for Debian Jessie tox test env (:issue:`2848`) 2376- Add verification to check if Request callback is callable (:issue:`2766`) 2377- Port ``extras/qpsclient.py`` to Python 3 (:issue:`2849`) 2378- Use getfullargspec under the scenes for Python 3 to stop DeprecationWarning 2379 (:issue:`2862`) 2380- Update deprecated test aliases (:issue:`2876`) 2381- Fix ``SitemapSpider`` support for alternate links (:issue:`2853`) 2382 2383Docs 2384~~~~ 2385 2386- Added missing bullet point for the ``AUTOTHROTTLE_TARGET_CONCURRENCY`` 2387 setting. (:issue:`2756`) 2388- Update Contributing docs, document new support channels 2389 (:issue:`2762`, issue:`3038`) 2390- Include references to Scrapy subreddit in the docs 2391- Fix broken links; use https:// for external links 2392 (:issue:`2978`, :issue:`2982`, :issue:`2958`) 2393- Document CloseSpider extension better (:issue:`2759`) 2394- Use ``pymongo.collection.Collection.insert_one()`` in MongoDB example 2395 (:issue:`2781`) 2396- Spelling mistake and typos 2397 (:issue:`2828`, :issue:`2837`, :issue:`2884`, :issue:`2924`) 2398- Clarify ``CSVFeedSpider.headers`` documentation (:issue:`2826`) 2399- Document ``DontCloseSpider`` exception and clarify ``spider_idle`` 2400 (:issue:`2791`) 2401- Update "Releases" section in README (:issue:`2764`) 2402- Fix rst syntax in ``DOWNLOAD_FAIL_ON_DATALOSS`` docs (:issue:`2763`) 2403- Small fix in description of startproject arguments (:issue:`2866`) 2404- Clarify data types in Response.body docs (:issue:`2922`) 2405- Add a note about ``request.meta['depth']`` to DepthMiddleware docs (:issue:`2374`) 2406- Add a note about ``request.meta['dont_merge_cookies']`` to CookiesMiddleware 2407 docs (:issue:`2999`) 2408- Up-to-date example of project structure (:issue:`2964`, :issue:`2976`) 2409- A better example of ItemExporters usage (:issue:`2989`) 2410- Document ``from_crawler`` methods for spider and downloader middlewares 2411 (:issue:`3019`) 2412 2413.. _release-1.4.0: 2414 2415Scrapy 1.4.0 (2017-05-18) 2416------------------------- 2417 2418Scrapy 1.4 does not bring that many breathtaking new features 2419but quite a few handy improvements nonetheless. 2420 2421Scrapy now supports anonymous FTP sessions with customizable user and 2422password via the new :setting:`FTP_USER` and :setting:`FTP_PASSWORD` settings. 2423And if you're using Twisted version 17.1.0 or above, FTP is now available 2424with Python 3. 2425 2426There's a new :meth:`response.follow <scrapy.http.TextResponse.follow>` method 2427for creating requests; **it is now a recommended way to create Requests 2428in Scrapy spiders**. This method makes it easier to write correct 2429spiders; ``response.follow`` has several advantages over creating 2430``scrapy.Request`` objects directly: 2431 2432* it handles relative URLs; 2433* it works properly with non-ascii URLs on non-UTF8 pages; 2434* in addition to absolute and relative URLs it supports Selectors; 2435 for ``<a>`` elements it can also extract their href values. 2436 2437For example, instead of this:: 2438 2439 for href in response.css('li.page a::attr(href)').extract(): 2440 url = response.urljoin(href) 2441 yield scrapy.Request(url, self.parse, encoding=response.encoding) 2442 2443One can now write this:: 2444 2445 for a in response.css('li.page a'): 2446 yield response.follow(a, self.parse) 2447 2448Link extractors are also improved. They work similarly to what a regular 2449modern browser would do: leading and trailing whitespace are removed 2450from attributes (think ``href=" http://example.com"``) when building 2451``Link`` objects. This whitespace-stripping also happens for ``action`` 2452attributes with ``FormRequest``. 2453 2454**Please also note that link extractors do not canonicalize URLs by default 2455anymore.** This was puzzling users every now and then, and it's not what 2456browsers do in fact, so we removed that extra transformation on extracted 2457links. 2458 2459For those of you wanting more control on the ``Referer:`` header that Scrapy 2460sends when following links, you can set your own ``Referrer Policy``. 2461Prior to Scrapy 1.4, the default ``RefererMiddleware`` would simply and 2462blindly set it to the URL of the response that generated the HTTP request 2463(which could leak information on your URL seeds). 2464By default, Scrapy now behaves much like your regular browser does. 2465And this policy is fully customizable with W3C standard values 2466(or with something really custom of your own if you wish). 2467See :setting:`REFERRER_POLICY` for details. 2468 2469To make Scrapy spiders easier to debug, Scrapy logs more stats by default 2470in 1.4: memory usage stats, detailed retry stats, detailed HTTP error code 2471stats. A similar change is that HTTP cache path is also visible in logs now. 2472 2473Last but not least, Scrapy now has the option to make JSON and XML items 2474more human-readable, with newlines between items and even custom indenting 2475offset, using the new :setting:`FEED_EXPORT_INDENT` setting. 2476 2477Enjoy! (Or read on for the rest of changes in this release.) 2478 2479Deprecations and Backward Incompatible Changes 2480~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2481 2482- Default to ``canonicalize=False`` in 2483 :class:`scrapy.linkextractors.LinkExtractor 2484 <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>` 2485 (:issue:`2537`, fixes :issue:`1941` and :issue:`1982`): 2486 **warning, this is technically backward-incompatible** 2487- Enable memusage extension by default (:issue:`2539`, fixes :issue:`2187`); 2488 **this is technically backward-incompatible** so please check if you have 2489 any non-default ``MEMUSAGE_***`` options set. 2490- ``EDITOR`` environment variable now takes precedence over ``EDITOR`` 2491 option defined in settings.py (:issue:`1829`); Scrapy default settings 2492 no longer depend on environment variables. **This is technically a backward 2493 incompatible change**. 2494- ``Spider.make_requests_from_url`` is deprecated 2495 (:issue:`1728`, fixes :issue:`1495`). 2496 2497New Features 2498~~~~~~~~~~~~ 2499 2500- Accept proxy credentials in :reqmeta:`proxy` request meta key (:issue:`2526`) 2501- Support `brotli`_-compressed content; requires optional `brotlipy`_ 2502 (:issue:`2535`) 2503- New :ref:`response.follow <response-follow-example>` shortcut 2504 for creating requests (:issue:`1940`) 2505- Added ``flags`` argument and attribute to :class:`Request <scrapy.http.Request>` 2506 objects (:issue:`2047`) 2507- Support Anonymous FTP (:issue:`2342`) 2508- Added ``retry/count``, ``retry/max_reached`` and ``retry/reason_count/<reason>`` 2509 stats to :class:`RetryMiddleware <scrapy.downloadermiddlewares.retry.RetryMiddleware>` 2510 (:issue:`2543`) 2511- Added ``httperror/response_ignored_count`` and ``httperror/response_ignored_status_count/<status>`` 2512 stats to :class:`HttpErrorMiddleware <scrapy.spidermiddlewares.httperror.HttpErrorMiddleware>` 2513 (:issue:`2566`) 2514- Customizable :setting:`Referrer policy <REFERRER_POLICY>` in 2515 :class:`RefererMiddleware <scrapy.spidermiddlewares.referer.RefererMiddleware>` 2516 (:issue:`2306`) 2517- New ``data:`` URI download handler (:issue:`2334`, fixes :issue:`2156`) 2518- Log cache directory when HTTP Cache is used (:issue:`2611`, fixes :issue:`2604`) 2519- Warn users when project contains duplicate spider names (fixes :issue:`2181`) 2520- ``scrapy.utils.datatypes.CaselessDict`` now accepts ``Mapping`` instances and 2521 not only dicts (:issue:`2646`) 2522- :ref:`Media downloads <topics-media-pipeline>`, with 2523 :class:`~scrapy.pipelines.files.FilesPipeline` or 2524 :class:`~scrapy.pipelines.images.ImagesPipeline`, can now optionally handle 2525 HTTP redirects using the new :setting:`MEDIA_ALLOW_REDIRECTS` setting 2526 (:issue:`2616`, fixes :issue:`2004`) 2527- Accept non-complete responses from websites using a new 2528 :setting:`DOWNLOAD_FAIL_ON_DATALOSS` setting (:issue:`2590`, fixes :issue:`2586`) 2529- Optional pretty-printing of JSON and XML items via 2530 :setting:`FEED_EXPORT_INDENT` setting (:issue:`2456`, fixes :issue:`1327`) 2531- Allow dropping fields in ``FormRequest.from_response`` formdata when 2532 ``None`` value is passed (:issue:`667`) 2533- Per-request retry times with the new :reqmeta:`max_retry_times` meta key 2534 (:issue:`2642`) 2535- ``python -m scrapy`` as a more explicit alternative to ``scrapy`` command 2536 (:issue:`2740`) 2537 2538.. _brotli: https://github.com/google/brotli 2539.. _brotlipy: https://github.com/python-hyper/brotlipy/ 2540 2541Bug fixes 2542~~~~~~~~~ 2543 2544- LinkExtractor now strips leading and trailing whitespaces from attributes 2545 (:issue:`2547`, fixes :issue:`1614`) 2546- Properly handle whitespaces in action attribute in 2547 :class:`~scrapy.http.FormRequest` (:issue:`2548`) 2548- Buffer CONNECT response bytes from proxy until all HTTP headers are received 2549 (:issue:`2495`, fixes :issue:`2491`) 2550- FTP downloader now works on Python 3, provided you use Twisted>=17.1 2551 (:issue:`2599`) 2552- Use body to choose response type after decompressing content (:issue:`2393`, 2553 fixes :issue:`2145`) 2554- Always decompress ``Content-Encoding: gzip`` at :class:`HttpCompressionMiddleware 2555 <scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware>` stage (:issue:`2391`) 2556- Respect custom log level in ``Spider.custom_settings`` (:issue:`2581`, 2557 fixes :issue:`1612`) 2558- 'make htmlview' fix for macOS (:issue:`2661`) 2559- Remove "commands" from the command list (:issue:`2695`) 2560- Fix duplicate Content-Length header for POST requests with empty body (:issue:`2677`) 2561- Properly cancel large downloads, i.e. above :setting:`DOWNLOAD_MAXSIZE` (:issue:`1616`) 2562- ImagesPipeline: fixed processing of transparent PNG images with palette 2563 (:issue:`2675`) 2564 2565Cleanups & Refactoring 2566~~~~~~~~~~~~~~~~~~~~~~ 2567 2568- Tests: remove temp files and folders (:issue:`2570`), 2569 fixed ProjectUtilsTest on macOS (:issue:`2569`), 2570 use portable pypy for Linux on Travis CI (:issue:`2710`) 2571- Separate building request from ``_requests_to_follow`` in CrawlSpider (:issue:`2562`) 2572- Remove “Python 3 progress” badge (:issue:`2567`) 2573- Add a couple more lines to ``.gitignore`` (:issue:`2557`) 2574- Remove bumpversion prerelease configuration (:issue:`2159`) 2575- Add codecov.yml file (:issue:`2750`) 2576- Set context factory implementation based on Twisted version (:issue:`2577`, 2577 fixes :issue:`2560`) 2578- Add omitted ``self`` arguments in default project middleware template (:issue:`2595`) 2579- Remove redundant ``slot.add_request()`` call in ExecutionEngine (:issue:`2617`) 2580- Catch more specific ``os.error`` exception in 2581 ``scrapy.pipelines.files.FSFilesStore`` (:issue:`2644`) 2582- Change "localhost" test server certificate (:issue:`2720`) 2583- Remove unused ``MEMUSAGE_REPORT`` setting (:issue:`2576`) 2584 2585Documentation 2586~~~~~~~~~~~~~ 2587 2588- Binary mode is required for exporters (:issue:`2564`, fixes :issue:`2553`) 2589- Mention issue with :meth:`FormRequest.from_response 2590 <scrapy.http.FormRequest.from_response>` due to bug in lxml (:issue:`2572`) 2591- Use single quotes uniformly in templates (:issue:`2596`) 2592- Document :reqmeta:`ftp_user` and :reqmeta:`ftp_password` meta keys (:issue:`2587`) 2593- Removed section on deprecated ``contrib/`` (:issue:`2636`) 2594- Recommend Anaconda when installing Scrapy on Windows 2595 (:issue:`2477`, fixes :issue:`2475`) 2596- FAQ: rewrite note on Python 3 support on Windows (:issue:`2690`) 2597- Rearrange selector sections (:issue:`2705`) 2598- Remove ``__nonzero__`` from :class:`~scrapy.selector.SelectorList` 2599 docs (:issue:`2683`) 2600- Mention how to disable request filtering in documentation of 2601 :setting:`DUPEFILTER_CLASS` setting (:issue:`2714`) 2602- Add sphinx_rtd_theme to docs setup readme (:issue:`2668`) 2603- Open file in text mode in JSON item writer example (:issue:`2729`) 2604- Clarify ``allowed_domains`` example (:issue:`2670`) 2605 2606 2607.. _release-1.3.3: 2608 2609Scrapy 1.3.3 (2017-03-10) 2610------------------------- 2611 2612Bug fixes 2613~~~~~~~~~ 2614 2615- Make ``SpiderLoader`` raise ``ImportError`` again by default for missing 2616 dependencies and wrong :setting:`SPIDER_MODULES`. 2617 These exceptions were silenced as warnings since 1.3.0. 2618 A new setting is introduced to toggle between warning or exception if needed ; 2619 see :setting:`SPIDER_LOADER_WARN_ONLY` for details. 2620 2621.. _release-1.3.2: 2622 2623Scrapy 1.3.2 (2017-02-13) 2624------------------------- 2625 2626Bug fixes 2627~~~~~~~~~ 2628 2629- Preserve request class when converting to/from dicts (utils.reqser) (:issue:`2510`). 2630- Use consistent selectors for author field in tutorial (:issue:`2551`). 2631- Fix TLS compatibility in Twisted 17+ (:issue:`2558`) 2632 2633.. _release-1.3.1: 2634 2635Scrapy 1.3.1 (2017-02-08) 2636------------------------- 2637 2638New features 2639~~~~~~~~~~~~ 2640 2641- Support ``'True'`` and ``'False'`` string values for boolean settings (:issue:`2519`); 2642 you can now do something like ``scrapy crawl myspider -s REDIRECT_ENABLED=False``. 2643- Support kwargs with ``response.xpath()`` to use :ref:`XPath variables <topics-selectors-xpath-variables>` 2644 and ad-hoc namespaces declarations ; 2645 this requires at least Parsel v1.1 (:issue:`2457`). 2646- Add support for Python 3.6 (:issue:`2485`). 2647- Run tests on PyPy (warning: some tests still fail, so PyPy is not supported yet). 2648 2649Bug fixes 2650~~~~~~~~~ 2651 2652- Enforce ``DNS_TIMEOUT`` setting (:issue:`2496`). 2653- Fix :command:`view` command ; it was a regression in v1.3.0 (:issue:`2503`). 2654- Fix tests regarding ``*_EXPIRES settings`` with Files/Images pipelines (:issue:`2460`). 2655- Fix name of generated pipeline class when using basic project template (:issue:`2466`). 2656- Fix compatibility with Twisted 17+ (:issue:`2496`, :issue:`2528`). 2657- Fix ``scrapy.Item`` inheritance on Python 3.6 (:issue:`2511`). 2658- Enforce numeric values for components order in ``SPIDER_MIDDLEWARES``, 2659 ``DOWNLOADER_MIDDLEWARES``, ``EXTENSIONS`` and ``SPIDER_CONTRACTS`` (:issue:`2420`). 2660 2661Documentation 2662~~~~~~~~~~~~~ 2663 2664- Reword Code of Conduct section and upgrade to Contributor Covenant v1.4 2665 (:issue:`2469`). 2666- Clarify that passing spider arguments converts them to spider attributes 2667 (:issue:`2483`). 2668- Document ``formid`` argument on ``FormRequest.from_response()`` (:issue:`2497`). 2669- Add .rst extension to README files (:issue:`2507`). 2670- Mention LevelDB cache storage backend (:issue:`2525`). 2671- Use ``yield`` in sample callback code (:issue:`2533`). 2672- Add note about HTML entities decoding with ``.re()/.re_first()`` (:issue:`1704`). 2673- Typos (:issue:`2512`, :issue:`2534`, :issue:`2531`). 2674 2675Cleanups 2676~~~~~~~~ 2677 2678- Remove redundant check in ``MetaRefreshMiddleware`` (:issue:`2542`). 2679- Faster checks in ``LinkExtractor`` for allow/deny patterns (:issue:`2538`). 2680- Remove dead code supporting old Twisted versions (:issue:`2544`). 2681 2682 2683.. _release-1.3.0: 2684 2685Scrapy 1.3.0 (2016-12-21) 2686------------------------- 2687 2688This release comes rather soon after 1.2.2 for one main reason: 2689it was found out that releases since 0.18 up to 1.2.2 (included) use 2690some backported code from Twisted (``scrapy.xlib.tx.*``), 2691even if newer Twisted modules are available. 2692Scrapy now uses ``twisted.web.client`` and ``twisted.internet.endpoints`` directly. 2693(See also cleanups below.) 2694 2695As it is a major change, we wanted to get the bug fix out quickly 2696while not breaking any projects using the 1.2 series. 2697 2698New Features 2699~~~~~~~~~~~~ 2700 2701- ``MailSender`` now accepts single strings as values for ``to`` and ``cc`` 2702 arguments (:issue:`2272`) 2703- ``scrapy fetch url``, ``scrapy shell url`` and ``fetch(url)`` inside 2704 Scrapy shell now follow HTTP redirections by default (:issue:`2290`); 2705 See :command:`fetch` and :command:`shell` for details. 2706- ``HttpErrorMiddleware`` now logs errors with ``INFO`` level instead of ``DEBUG``; 2707 this is technically **backward incompatible** so please check your log parsers. 2708- By default, logger names now use a long-form path, e.g. ``[scrapy.extensions.logstats]``, 2709 instead of the shorter "top-level" variant of prior releases (e.g. ``[scrapy]``); 2710 this is **backward incompatible** if you have log parsers expecting the short 2711 logger name part. You can switch back to short logger names using :setting:`LOG_SHORT_NAMES` 2712 set to ``True``. 2713 2714Dependencies & Cleanups 2715~~~~~~~~~~~~~~~~~~~~~~~ 2716 2717- Scrapy now requires Twisted >= 13.1 which is the case for many Linux 2718 distributions already. 2719- As a consequence, we got rid of ``scrapy.xlib.tx.*`` modules, which 2720 copied some of Twisted code for users stuck with an "old" Twisted version 2721- ``ChunkedTransferMiddleware`` is deprecated and removed from the default 2722 downloader middlewares. 2723 2724.. _release-1.2.3: 2725 2726Scrapy 1.2.3 (2017-03-03) 2727------------------------- 2728 2729- Packaging fix: disallow unsupported Twisted versions in setup.py 2730 2731 2732.. _release-1.2.2: 2733 2734Scrapy 1.2.2 (2016-12-06) 2735------------------------- 2736 2737Bug fixes 2738~~~~~~~~~ 2739 2740- Fix a cryptic traceback when a pipeline fails on ``open_spider()`` (:issue:`2011`) 2741- Fix embedded IPython shell variables (fixing :issue:`396` that re-appeared 2742 in 1.2.0, fixed in :issue:`2418`) 2743- A couple of patches when dealing with robots.txt: 2744 2745 - handle (non-standard) relative sitemap URLs (:issue:`2390`) 2746 - handle non-ASCII URLs and User-Agents in Python 2 (:issue:`2373`) 2747 2748Documentation 2749~~~~~~~~~~~~~ 2750 2751- Document ``"download_latency"`` key in ``Request``'s ``meta`` dict (:issue:`2033`) 2752- Remove page on (deprecated & unsupported) Ubuntu packages from ToC (:issue:`2335`) 2753- A few fixed typos (:issue:`2346`, :issue:`2369`, :issue:`2369`, :issue:`2380`) 2754 and clarifications (:issue:`2354`, :issue:`2325`, :issue:`2414`) 2755 2756Other changes 2757~~~~~~~~~~~~~ 2758 2759- Advertize `conda-forge`_ as Scrapy's official conda channel (:issue:`2387`) 2760- More helpful error messages when trying to use ``.css()`` or ``.xpath()`` 2761 on non-Text Responses (:issue:`2264`) 2762- ``startproject`` command now generates a sample ``middlewares.py`` file (:issue:`2335`) 2763- Add more dependencies' version info in ``scrapy version`` verbose output (:issue:`2404`) 2764- Remove all ``*.pyc`` files from source distribution (:issue:`2386`) 2765 2766.. _conda-forge: https://anaconda.org/conda-forge/scrapy 2767 2768 2769.. _release-1.2.1: 2770 2771Scrapy 1.2.1 (2016-10-21) 2772------------------------- 2773 2774Bug fixes 2775~~~~~~~~~ 2776 2777- Include OpenSSL's more permissive default ciphers when establishing 2778 TLS/SSL connections (:issue:`2314`). 2779- Fix "Location" HTTP header decoding on non-ASCII URL redirects (:issue:`2321`). 2780 2781Documentation 2782~~~~~~~~~~~~~ 2783 2784- Fix JsonWriterPipeline example (:issue:`2302`). 2785- Various notes: :issue:`2330` on spider names, 2786 :issue:`2329` on middleware methods processing order, 2787 :issue:`2327` on getting multi-valued HTTP headers as lists. 2788 2789Other changes 2790~~~~~~~~~~~~~ 2791 2792- Removed ``www.`` from ``start_urls`` in built-in spider templates (:issue:`2299`). 2793 2794 2795.. _release-1.2.0: 2796 2797Scrapy 1.2.0 (2016-10-03) 2798------------------------- 2799 2800New Features 2801~~~~~~~~~~~~ 2802 2803- New :setting:`FEED_EXPORT_ENCODING` setting to customize the encoding 2804 used when writing items to a file. 2805 This can be used to turn off ``\uXXXX`` escapes in JSON output. 2806 This is also useful for those wanting something else than UTF-8 2807 for XML or CSV output (:issue:`2034`). 2808- ``startproject`` command now supports an optional destination directory 2809 to override the default one based on the project name (:issue:`2005`). 2810- New :setting:`SCHEDULER_DEBUG` setting to log requests serialization 2811 failures (:issue:`1610`). 2812- JSON encoder now supports serialization of ``set`` instances (:issue:`2058`). 2813- Interpret ``application/json-amazonui-streaming`` as ``TextResponse`` (:issue:`1503`). 2814- ``scrapy`` is imported by default when using shell tools (:command:`shell`, 2815 :ref:`inspect_response <topics-shell-inspect-response>`) (:issue:`2248`). 2816 2817Bug fixes 2818~~~~~~~~~ 2819 2820- DefaultRequestHeaders middleware now runs before UserAgent middleware 2821 (:issue:`2088`). **Warning: this is technically backward incompatible**, 2822 though we consider this a bug fix. 2823- HTTP cache extension and plugins that use the ``.scrapy`` data directory now 2824 work outside projects (:issue:`1581`). **Warning: this is technically 2825 backward incompatible**, though we consider this a bug fix. 2826- ``Selector`` does not allow passing both ``response`` and ``text`` anymore 2827 (:issue:`2153`). 2828- Fixed logging of wrong callback name with ``scrapy parse`` (:issue:`2169`). 2829- Fix for an odd gzip decompression bug (:issue:`1606`). 2830- Fix for selected callbacks when using ``CrawlSpider`` with :command:`scrapy parse <parse>` 2831 (:issue:`2225`). 2832- Fix for invalid JSON and XML files when spider yields no items (:issue:`872`). 2833- Implement ``flush()`` fpr ``StreamLogger`` avoiding a warning in logs (:issue:`2125`). 2834 2835Refactoring 2836~~~~~~~~~~~ 2837 2838- ``canonicalize_url`` has been moved to `w3lib.url`_ (:issue:`2168`). 2839 2840.. _w3lib.url: https://w3lib.readthedocs.io/en/latest/w3lib.html#w3lib.url.canonicalize_url 2841 2842Tests & Requirements 2843~~~~~~~~~~~~~~~~~~~~ 2844 2845Scrapy's new requirements baseline is Debian 8 "Jessie". It was previously 2846Ubuntu 12.04 Precise. 2847What this means in practice is that we run continuous integration tests 2848with these (main) packages versions at a minimum: 2849Twisted 14.0, pyOpenSSL 0.14, lxml 3.4. 2850 2851Scrapy may very well work with older versions of these packages 2852(the code base still has switches for older Twisted versions for example) 2853but it is not guaranteed (because it's not tested anymore). 2854 2855Documentation 2856~~~~~~~~~~~~~ 2857 2858- Grammar fixes: :issue:`2128`, :issue:`1566`. 2859- Download stats badge removed from README (:issue:`2160`). 2860- New Scrapy :ref:`architecture diagram <topics-architecture>` (:issue:`2165`). 2861- Updated ``Response`` parameters documentation (:issue:`2197`). 2862- Reworded misleading :setting:`RANDOMIZE_DOWNLOAD_DELAY` description (:issue:`2190`). 2863- Add StackOverflow as a support channel (:issue:`2257`). 2864 2865.. _release-1.1.4: 2866 2867Scrapy 1.1.4 (2017-03-03) 2868------------------------- 2869 2870- Packaging fix: disallow unsupported Twisted versions in setup.py 2871 2872.. _release-1.1.3: 2873 2874Scrapy 1.1.3 (2016-09-22) 2875------------------------- 2876 2877Bug fixes 2878~~~~~~~~~ 2879 2880- Class attributes for subclasses of ``ImagesPipeline`` and ``FilesPipeline`` 2881 work as they did before 1.1.1 (:issue:`2243`, fixes :issue:`2198`) 2882 2883Documentation 2884~~~~~~~~~~~~~ 2885 2886- :ref:`Overview <intro-overview>` and :ref:`tutorial <intro-tutorial>` 2887 rewritten to use http://toscrape.com websites 2888 (:issue:`2236`, :issue:`2249`, :issue:`2252`). 2889 2890.. _release-1.1.2: 2891 2892Scrapy 1.1.2 (2016-08-18) 2893------------------------- 2894 2895Bug fixes 2896~~~~~~~~~ 2897 2898- Introduce a missing :setting:`IMAGES_STORE_S3_ACL` setting to override 2899 the default ACL policy in ``ImagesPipeline`` when uploading images to S3 2900 (note that default ACL policy is "private" -- instead of "public-read" -- 2901 since Scrapy 1.1.0) 2902- :setting:`IMAGES_EXPIRES` default value set back to 90 2903 (the regression was introduced in 1.1.1) 2904 2905.. _release-1.1.1: 2906 2907Scrapy 1.1.1 (2016-07-13) 2908------------------------- 2909 2910Bug fixes 2911~~~~~~~~~ 2912 2913- Add "Host" header in CONNECT requests to HTTPS proxies (:issue:`2069`) 2914- Use response ``body`` when choosing response class 2915 (:issue:`2001`, fixes :issue:`2000`) 2916- Do not fail on canonicalizing URLs with wrong netlocs 2917 (:issue:`2038`, fixes :issue:`2010`) 2918- a few fixes for ``HttpCompressionMiddleware`` (and ``SitemapSpider``): 2919 2920 - Do not decode HEAD responses (:issue:`2008`, fixes :issue:`1899`) 2921 - Handle charset parameter in gzip Content-Type header 2922 (:issue:`2050`, fixes :issue:`2049`) 2923 - Do not decompress gzip octet-stream responses 2924 (:issue:`2065`, fixes :issue:`2063`) 2925 2926- Catch (and ignore with a warning) exception when verifying certificate 2927 against IP-address hosts (:issue:`2094`, fixes :issue:`2092`) 2928- Make ``FilesPipeline`` and ``ImagesPipeline`` backward compatible again 2929 regarding the use of legacy class attributes for customization 2930 (:issue:`1989`, fixes :issue:`1985`) 2931 2932 2933New features 2934~~~~~~~~~~~~ 2935 2936- Enable genspider command outside project folder (:issue:`2052`) 2937- Retry HTTPS CONNECT ``TunnelError`` by default (:issue:`1974`) 2938 2939 2940Documentation 2941~~~~~~~~~~~~~ 2942 2943- ``FEED_TEMPDIR`` setting at lexicographical position (:commit:`9b3c72c`) 2944- Use idiomatic ``.extract_first()`` in overview (:issue:`1994`) 2945- Update years in copyright notice (:commit:`c2c8036`) 2946- Add information and example on errbacks (:issue:`1995`) 2947- Use "url" variable in downloader middleware example (:issue:`2015`) 2948- Grammar fixes (:issue:`2054`, :issue:`2120`) 2949- New FAQ entry on using BeautifulSoup in spider callbacks (:issue:`2048`) 2950- Add notes about Scrapy not working on Windows with Python 3 (:issue:`2060`) 2951- Encourage complete titles in pull requests (:issue:`2026`) 2952 2953Tests 2954~~~~~ 2955 2956- Upgrade py.test requirement on Travis CI and Pin pytest-cov to 2.2.1 (:issue:`2095`) 2957 2958.. _release-1.1.0: 2959 2960Scrapy 1.1.0 (2016-05-11) 2961------------------------- 2962 2963This 1.1 release brings a lot of interesting features and bug fixes: 2964 2965- Scrapy 1.1 has beta Python 3 support (requires Twisted >= 15.5). See 2966 :ref:`news_betapy3` for more details and some limitations. 2967- Hot new features: 2968 2969 - Item loaders now support nested loaders (:issue:`1467`). 2970 - ``FormRequest.from_response`` improvements (:issue:`1382`, :issue:`1137`). 2971 - Added setting :setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` and improved 2972 AutoThrottle docs (:issue:`1324`). 2973 - Added ``response.text`` to get body as unicode (:issue:`1730`). 2974 - Anonymous S3 connections (:issue:`1358`). 2975 - Deferreds in downloader middlewares (:issue:`1473`). This enables better 2976 robots.txt handling (:issue:`1471`). 2977 - HTTP caching now follows RFC2616 more closely, added settings 2978 :setting:`HTTPCACHE_ALWAYS_STORE` and 2979 :setting:`HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS` (:issue:`1151`). 2980 - Selectors were extracted to the parsel_ library (:issue:`1409`). This means 2981 you can use Scrapy Selectors without Scrapy and also upgrade the 2982 selectors engine without needing to upgrade Scrapy. 2983 - HTTPS downloader now does TLS protocol negotiation by default, 2984 instead of forcing TLS 1.0. You can also set the SSL/TLS method 2985 using the new :setting:`DOWNLOADER_CLIENT_TLS_METHOD`. 2986 2987- These bug fixes may require your attention: 2988 2989 - Don't retry bad requests (HTTP 400) by default (:issue:`1289`). 2990 If you need the old behavior, add ``400`` to :setting:`RETRY_HTTP_CODES`. 2991 - Fix shell files argument handling (:issue:`1710`, :issue:`1550`). 2992 If you try ``scrapy shell index.html`` it will try to load the URL http://index.html, 2993 use ``scrapy shell ./index.html`` to load a local file. 2994 - Robots.txt compliance is now enabled by default for newly-created projects 2995 (:issue:`1724`). Scrapy will also wait for robots.txt to be downloaded 2996 before proceeding with the crawl (:issue:`1735`). If you want to disable 2997 this behavior, update :setting:`ROBOTSTXT_OBEY` in ``settings.py`` file 2998 after creating a new project. 2999 - Exporters now work on unicode, instead of bytes by default (:issue:`1080`). 3000 If you use :class:`~scrapy.exporters.PythonItemExporter`, you may want to 3001 update your code to disable binary mode which is now deprecated. 3002 - Accept XML node names containing dots as valid (:issue:`1533`). 3003 - When uploading files or images to S3 (with ``FilesPipeline`` or 3004 ``ImagesPipeline``), the default ACL policy is now "private" instead 3005 of "public" **Warning: backward incompatible!**. 3006 You can use :setting:`FILES_STORE_S3_ACL` to change it. 3007 - We've reimplemented ``canonicalize_url()`` for more correct output, 3008 especially for URLs with non-ASCII characters (:issue:`1947`). 3009 This could change link extractors output compared to previous Scrapy versions. 3010 This may also invalidate some cache entries you could still have from pre-1.1 runs. 3011 **Warning: backward incompatible!**. 3012 3013Keep reading for more details on other improvements and bug fixes. 3014 3015.. _news_betapy3: 3016 3017Beta Python 3 Support 3018~~~~~~~~~~~~~~~~~~~~~ 3019 3020We have been `hard at work to make Scrapy run on Python 3 3021<https://github.com/scrapy/scrapy/wiki/Python-3-Porting>`_. As a result, now 3022you can run spiders on Python 3.3, 3.4 and 3.5 (Twisted >= 15.5 required). Some 3023features are still missing (and some may never be ported). 3024 3025 3026Almost all builtin extensions/middlewares are expected to work. 3027However, we are aware of some limitations in Python 3: 3028 3029- Scrapy does not work on Windows with Python 3 3030- Sending emails is not supported 3031- FTP download handler is not supported 3032- Telnet console is not supported 3033 3034Additional New Features and Enhancements 3035~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3036 3037- Scrapy now has a `Code of Conduct`_ (:issue:`1681`). 3038- Command line tool now has completion for zsh (:issue:`934`). 3039- Improvements to ``scrapy shell``: 3040 3041 - Support for bpython and configure preferred Python shell via 3042 ``SCRAPY_PYTHON_SHELL`` (:issue:`1100`, :issue:`1444`). 3043 - Support URLs without scheme (:issue:`1498`) 3044 **Warning: backward incompatible!** 3045 - Bring back support for relative file path (:issue:`1710`, :issue:`1550`). 3046 3047- Added :setting:`MEMUSAGE_CHECK_INTERVAL_SECONDS` setting to change default check 3048 interval (:issue:`1282`). 3049- Download handlers are now lazy-loaded on first request using their 3050 scheme (:issue:`1390`, :issue:`1421`). 3051- HTTPS download handlers do not force TLS 1.0 anymore; instead, 3052 OpenSSL's ``SSLv23_method()/TLS_method()`` is used allowing to try 3053 negotiating with the remote hosts the highest TLS protocol version 3054 it can (:issue:`1794`, :issue:`1629`). 3055- ``RedirectMiddleware`` now skips the status codes from 3056 ``handle_httpstatus_list`` on spider attribute 3057 or in ``Request``'s ``meta`` key (:issue:`1334`, :issue:`1364`, 3058 :issue:`1447`). 3059- Form submission: 3060 3061 - now works with ``<button>`` elements too (:issue:`1469`). 3062 - an empty string is now used for submit buttons without a value 3063 (:issue:`1472`) 3064 3065- Dict-like settings now have per-key priorities 3066 (:issue:`1135`, :issue:`1149` and :issue:`1586`). 3067- Sending non-ASCII emails (:issue:`1662`) 3068- ``CloseSpider`` and ``SpiderState`` extensions now get disabled if no relevant 3069 setting is set (:issue:`1723`, :issue:`1725`). 3070- Added method ``ExecutionEngine.close`` (:issue:`1423`). 3071- Added method ``CrawlerRunner.create_crawler`` (:issue:`1528`). 3072- Scheduler priority queue can now be customized via 3073 :setting:`SCHEDULER_PRIORITY_QUEUE` (:issue:`1822`). 3074- ``.pps`` links are now ignored by default in link extractors (:issue:`1835`). 3075- temporary data folder for FTP and S3 feed storages can be customized 3076 using a new :setting:`FEED_TEMPDIR` setting (:issue:`1847`). 3077- ``FilesPipeline`` and ``ImagesPipeline`` settings are now instance attributes 3078 instead of class attributes, enabling spider-specific behaviors (:issue:`1891`). 3079- ``JsonItemExporter`` now formats opening and closing square brackets 3080 on their own line (first and last lines of output file) (:issue:`1950`). 3081- If available, ``botocore`` is used for ``S3FeedStorage``, ``S3DownloadHandler`` 3082 and ``S3FilesStore`` (:issue:`1761`, :issue:`1883`). 3083- Tons of documentation updates and related fixes (:issue:`1291`, :issue:`1302`, 3084 :issue:`1335`, :issue:`1683`, :issue:`1660`, :issue:`1642`, :issue:`1721`, 3085 :issue:`1727`, :issue:`1879`). 3086- Other refactoring, optimizations and cleanup (:issue:`1476`, :issue:`1481`, 3087 :issue:`1477`, :issue:`1315`, :issue:`1290`, :issue:`1750`, :issue:`1881`). 3088 3089.. _`Code of Conduct`: https://github.com/scrapy/scrapy/blob/master/CODE_OF_CONDUCT.md 3090 3091 3092Deprecations and Removals 3093~~~~~~~~~~~~~~~~~~~~~~~~~ 3094 3095- Added ``to_bytes`` and ``to_unicode``, deprecated ``str_to_unicode`` and 3096 ``unicode_to_str`` functions (:issue:`778`). 3097- ``binary_is_text`` is introduced, to replace use of ``isbinarytext`` 3098 (but with inverse return value) (:issue:`1851`) 3099- The ``optional_features`` set has been removed (:issue:`1359`). 3100- The ``--lsprof`` command line option has been removed (:issue:`1689`). 3101 **Warning: backward incompatible**, but doesn't break user code. 3102- The following datatypes were deprecated (:issue:`1720`): 3103 3104 + ``scrapy.utils.datatypes.MultiValueDictKeyError`` 3105 + ``scrapy.utils.datatypes.MultiValueDict`` 3106 + ``scrapy.utils.datatypes.SiteNode`` 3107 3108- The previously bundled ``scrapy.xlib.pydispatch`` library was deprecated and 3109 replaced by `pydispatcher <https://pypi.org/project/PyDispatcher/>`_. 3110 3111 3112Relocations 3113~~~~~~~~~~~ 3114 3115- ``telnetconsole`` was relocated to ``extensions/`` (:issue:`1524`). 3116 3117 + Note: telnet is not enabled on Python 3 3118 (https://github.com/scrapy/scrapy/pull/1524#issuecomment-146985595) 3119 3120.. _parsel: https://github.com/scrapy/parsel 3121 3122 3123Bugfixes 3124~~~~~~~~ 3125 3126- Scrapy does not retry requests that got a ``HTTP 400 Bad Request`` 3127 response anymore (:issue:`1289`). **Warning: backward incompatible!** 3128- Support empty password for http_proxy config (:issue:`1274`). 3129- Interpret ``application/x-json`` as ``TextResponse`` (:issue:`1333`). 3130- Support link rel attribute with multiple values (:issue:`1201`). 3131- Fixed ``scrapy.http.FormRequest.from_response`` when there is a ``<base>`` 3132 tag (:issue:`1564`). 3133- Fixed :setting:`TEMPLATES_DIR` handling (:issue:`1575`). 3134- Various ``FormRequest`` fixes (:issue:`1595`, :issue:`1596`, :issue:`1597`). 3135- Makes ``_monkeypatches`` more robust (:issue:`1634`). 3136- Fixed bug on ``XMLItemExporter`` with non-string fields in 3137 items (:issue:`1738`). 3138- Fixed startproject command in macOS (:issue:`1635`). 3139- Fixed :class:`~scrapy.exporters.PythonItemExporter` and CSVExporter for 3140 non-string item types (:issue:`1737`). 3141- Various logging related fixes (:issue:`1294`, :issue:`1419`, :issue:`1263`, 3142 :issue:`1624`, :issue:`1654`, :issue:`1722`, :issue:`1726` and :issue:`1303`). 3143- Fixed bug in ``utils.template.render_templatefile()`` (:issue:`1212`). 3144- sitemaps extraction from ``robots.txt`` is now case-insensitive (:issue:`1902`). 3145- HTTPS+CONNECT tunnels could get mixed up when using multiple proxies 3146 to same remote host (:issue:`1912`). 3147 3148.. _release-1.0.7: 3149 3150Scrapy 1.0.7 (2017-03-03) 3151------------------------- 3152 3153- Packaging fix: disallow unsupported Twisted versions in setup.py 3154 3155.. _release-1.0.6: 3156 3157Scrapy 1.0.6 (2016-05-04) 3158------------------------- 3159 3160- FIX: RetryMiddleware is now robust to non-standard HTTP status codes (:issue:`1857`) 3161- FIX: Filestorage HTTP cache was checking wrong modified time (:issue:`1875`) 3162- DOC: Support for Sphinx 1.4+ (:issue:`1893`) 3163- DOC: Consistency in selectors examples (:issue:`1869`) 3164 3165.. _release-1.0.5: 3166 3167Scrapy 1.0.5 (2016-02-04) 3168------------------------- 3169 3170- FIX: [Backport] Ignore bogus links in LinkExtractors (fixes :issue:`907`, :commit:`108195e`) 3171- TST: Changed buildbot makefile to use 'pytest' (:commit:`1f3d90a`) 3172- DOC: Fixed typos in tutorial and media-pipeline (:commit:`808a9ea` and :commit:`803bd87`) 3173- DOC: Add AjaxCrawlMiddleware to DOWNLOADER_MIDDLEWARES_BASE in settings docs (:commit:`aa94121`) 3174 3175.. _release-1.0.4: 3176 3177Scrapy 1.0.4 (2015-12-30) 3178------------------------- 3179 3180- Ignoring xlib/tx folder, depending on Twisted version. (:commit:`7dfa979`) 3181- Run on new travis-ci infra (:commit:`6e42f0b`) 3182- Spelling fixes (:commit:`823a1cc`) 3183- escape nodename in xmliter regex (:commit:`da3c155`) 3184- test xml nodename with dots (:commit:`4418fc3`) 3185- TST don't use broken Pillow version in tests (:commit:`a55078c`) 3186- disable log on version command. closes #1426 (:commit:`86fc330`) 3187- disable log on startproject command (:commit:`db4c9fe`) 3188- Add PyPI download stats badge (:commit:`df2b944`) 3189- don't run tests twice on Travis if a PR is made from a scrapy/scrapy branch (:commit:`a83ab41`) 3190- Add Python 3 porting status badge to the README (:commit:`73ac80d`) 3191- fixed RFPDupeFilter persistence (:commit:`97d080e`) 3192- TST a test to show that dupefilter persistence is not working (:commit:`97f2fb3`) 3193- explicit close file on file:// scheme handler (:commit:`d9b4850`) 3194- Disable dupefilter in shell (:commit:`c0d0734`) 3195- DOC: Add captions to toctrees which appear in sidebar (:commit:`aa239ad`) 3196- DOC Removed pywin32 from install instructions as it's already declared as dependency. (:commit:`10eb400`) 3197- Added installation notes about using Conda for Windows and other OSes. (:commit:`1c3600a`) 3198- Fixed minor grammar issues. (:commit:`7f4ddd5`) 3199- fixed a typo in the documentation. (:commit:`b71f677`) 3200- Version 1 now exists (:commit:`5456c0e`) 3201- fix another invalid xpath error (:commit:`0a1366e`) 3202- fix ValueError: Invalid XPath: //div/[id="not-exists"]/text() on selectors.rst (:commit:`ca8d60f`) 3203- Typos corrections (:commit:`7067117`) 3204- fix typos in downloader-middleware.rst and exceptions.rst, middlware -> middleware (:commit:`32f115c`) 3205- Add note to Ubuntu install section about Debian compatibility (:commit:`23fda69`) 3206- Replace alternative macOS install workaround with virtualenv (:commit:`98b63ee`) 3207- Reference Homebrew's homepage for installation instructions (:commit:`1925db1`) 3208- Add oldest supported tox version to contributing docs (:commit:`5d10d6d`) 3209- Note in install docs about pip being already included in python>=2.7.9 (:commit:`85c980e`) 3210- Add non-python dependencies to Ubuntu install section in the docs (:commit:`fbd010d`) 3211- Add macOS installation section to docs (:commit:`d8f4cba`) 3212- DOC(ENH): specify path to rtd theme explicitly (:commit:`de73b1a`) 3213- minor: scrapy.Spider docs grammar (:commit:`1ddcc7b`) 3214- Make common practices sample code match the comments (:commit:`1b85bcf`) 3215- nextcall repetitive calls (heartbeats). (:commit:`55f7104`) 3216- Backport fix compatibility with Twisted 15.4.0 (:commit:`b262411`) 3217- pin pytest to 2.7.3 (:commit:`a6535c2`) 3218- Merge pull request #1512 from mgedmin/patch-1 (:commit:`8876111`) 3219- Merge pull request #1513 from mgedmin/patch-2 (:commit:`5d4daf8`) 3220- Typo (:commit:`f8d0682`) 3221- Fix list formatting (:commit:`5f83a93`) 3222- fix Scrapy squeue tests after recent changes to queuelib (:commit:`3365c01`) 3223- Merge pull request #1475 from rweindl/patch-1 (:commit:`2d688cd`) 3224- Update tutorial.rst (:commit:`fbc1f25`) 3225- Merge pull request #1449 from rhoekman/patch-1 (:commit:`7d6538c`) 3226- Small grammatical change (:commit:`8752294`) 3227- Add openssl version to version command (:commit:`13c45ac`) 3228 3229.. _release-1.0.3: 3230 3231Scrapy 1.0.3 (2015-08-11) 3232------------------------- 3233 3234- add service_identity to Scrapy install_requires (:commit:`cbc2501`) 3235- Workaround for travis#296 (:commit:`66af9cd`) 3236 3237.. _release-1.0.2: 3238 3239Scrapy 1.0.2 (2015-08-06) 3240------------------------- 3241 3242- Twisted 15.3.0 does not raises PicklingError serializing lambda functions (:commit:`b04dd7d`) 3243- Minor method name fix (:commit:`6f85c7f`) 3244- minor: scrapy.Spider grammar and clarity (:commit:`9c9d2e0`) 3245- Put a blurb about support channels in CONTRIBUTING (:commit:`c63882b`) 3246- Fixed typos (:commit:`a9ae7b0`) 3247- Fix doc reference. (:commit:`7c8a4fe`) 3248 3249.. _release-1.0.1: 3250 3251Scrapy 1.0.1 (2015-07-01) 3252------------------------- 3253 3254- Unquote request path before passing to FTPClient, it already escape paths (:commit:`cc00ad2`) 3255- include tests/ to source distribution in MANIFEST.in (:commit:`eca227e`) 3256- DOC Fix SelectJmes documentation (:commit:`b8567bc`) 3257- DOC Bring Ubuntu and Archlinux outside of Windows subsection (:commit:`392233f`) 3258- DOC remove version suffix from Ubuntu package (:commit:`5303c66`) 3259- DOC Update release date for 1.0 (:commit:`c89fa29`) 3260 3261.. _release-1.0.0: 3262 3263Scrapy 1.0.0 (2015-06-19) 3264------------------------- 3265 3266You will find a lot of new features and bugfixes in this major release. Make 3267sure to check our updated :ref:`overview <intro-overview>` to get a glance of 3268some of the changes, along with our brushed :ref:`tutorial <intro-tutorial>`. 3269 3270Support for returning dictionaries in spiders 3271~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3272 3273Declaring and returning Scrapy Items is no longer necessary to collect the 3274scraped data from your spider, you can now return explicit dictionaries 3275instead. 3276 3277*Classic version* 3278 3279:: 3280 3281 class MyItem(scrapy.Item): 3282 url = scrapy.Field() 3283 3284 class MySpider(scrapy.Spider): 3285 def parse(self, response): 3286 return MyItem(url=response.url) 3287 3288*New version* 3289 3290:: 3291 3292 class MySpider(scrapy.Spider): 3293 def parse(self, response): 3294 return {'url': response.url} 3295 3296Per-spider settings (GSoC 2014) 3297~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3298 3299Last Google Summer of Code project accomplished an important redesign of the 3300mechanism used for populating settings, introducing explicit priorities to 3301override any given setting. As an extension of that goal, we included a new 3302level of priority for settings that act exclusively for a single spider, 3303allowing them to redefine project settings. 3304 3305Start using it by defining a :attr:`~scrapy.spiders.Spider.custom_settings` 3306class variable in your spider:: 3307 3308 class MySpider(scrapy.Spider): 3309 custom_settings = { 3310 "DOWNLOAD_DELAY": 5.0, 3311 "RETRY_ENABLED": False, 3312 } 3313 3314Read more about settings population: :ref:`topics-settings` 3315 3316Python Logging 3317~~~~~~~~~~~~~~ 3318 3319Scrapy 1.0 has moved away from Twisted logging to support Python built in’s 3320as default logging system. We’re maintaining backward compatibility for most 3321of the old custom interface to call logging functions, but you’ll get 3322warnings to switch to the Python logging API entirely. 3323 3324*Old version* 3325 3326:: 3327 3328 from scrapy import log 3329 log.msg('MESSAGE', log.INFO) 3330 3331*New version* 3332 3333:: 3334 3335 import logging 3336 logging.info('MESSAGE') 3337 3338Logging with spiders remains the same, but on top of the 3339:meth:`~scrapy.spiders.Spider.log` method you’ll have access to a custom 3340:attr:`~scrapy.spiders.Spider.logger` created for the spider to issue log 3341events: 3342 3343:: 3344 3345 class MySpider(scrapy.Spider): 3346 def parse(self, response): 3347 self.logger.info('Response received') 3348 3349Read more in the logging documentation: :ref:`topics-logging` 3350 3351Crawler API refactoring (GSoC 2014) 3352~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3353 3354Another milestone for last Google Summer of Code was a refactoring of the 3355internal API, seeking a simpler and easier usage. Check new core interface 3356in: :ref:`topics-api` 3357 3358A common situation where you will face these changes is while running Scrapy 3359from scripts. Here’s a quick example of how to run a Spider manually with the 3360new API: 3361 3362:: 3363 3364 from scrapy.crawler import CrawlerProcess 3365 3366 process = CrawlerProcess({ 3367 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' 3368 }) 3369 process.crawl(MySpider) 3370 process.start() 3371 3372Bear in mind this feature is still under development and its API may change 3373until it reaches a stable status. 3374 3375See more examples for scripts running Scrapy: :ref:`topics-practices` 3376 3377.. _module-relocations: 3378 3379Module Relocations 3380~~~~~~~~~~~~~~~~~~ 3381 3382There’s been a large rearrangement of modules trying to improve the general 3383structure of Scrapy. Main changes were separating various subpackages into 3384new projects and dissolving both ``scrapy.contrib`` and ``scrapy.contrib_exp`` 3385into top level packages. Backward compatibility was kept among internal 3386relocations, while importing deprecated modules expect warnings indicating 3387their new place. 3388 3389Full list of relocations 3390************************ 3391 3392Outsourced packages 3393 3394.. note:: 3395 These extensions went through some minor changes, e.g. some setting names 3396 were changed. Please check the documentation in each new repository to 3397 get familiar with the new usage. 3398 3399+-------------------------------------+-------------------------------------+ 3400| Old location | New location | 3401+=====================================+=====================================+ 3402| scrapy.commands.deploy | `scrapyd-client <https://github.com | 3403| | /scrapy/scrapyd-client>`_ | 3404| | (See other alternatives here: | 3405| | :ref:`topics-deploy`) | 3406+-------------------------------------+-------------------------------------+ 3407| scrapy.contrib.djangoitem | `scrapy-djangoitem <https://github. | 3408| | com/scrapy-plugins/scrapy-djangoite | 3409| | m>`_ | 3410+-------------------------------------+-------------------------------------+ 3411| scrapy.webservice | `scrapy-jsonrpc <https://github.com | 3412| | /scrapy-plugins/scrapy-jsonrpc>`_ | 3413+-------------------------------------+-------------------------------------+ 3414 3415``scrapy.contrib_exp`` and ``scrapy.contrib`` dissolutions 3416 3417+-------------------------------------+-------------------------------------+ 3418| Old location | New location | 3419+=====================================+=====================================+ 3420| scrapy.contrib\_exp.downloadermidd\ | scrapy.downloadermiddlewares.decom\ | 3421| leware.decompression | pression | 3422+-------------------------------------+-------------------------------------+ 3423| scrapy.contrib\_exp.iterators | scrapy.utils.iterators | 3424+-------------------------------------+-------------------------------------+ 3425| scrapy.contrib.downloadermiddleware | scrapy.downloadermiddlewares | 3426+-------------------------------------+-------------------------------------+ 3427| scrapy.contrib.exporter | scrapy.exporters | 3428+-------------------------------------+-------------------------------------+ 3429| scrapy.contrib.linkextractors | scrapy.linkextractors | 3430+-------------------------------------+-------------------------------------+ 3431| scrapy.contrib.loader | scrapy.loader | 3432+-------------------------------------+-------------------------------------+ 3433| scrapy.contrib.loader.processor | scrapy.loader.processors | 3434+-------------------------------------+-------------------------------------+ 3435| scrapy.contrib.pipeline | scrapy.pipelines | 3436+-------------------------------------+-------------------------------------+ 3437| scrapy.contrib.spidermiddleware | scrapy.spidermiddlewares | 3438+-------------------------------------+-------------------------------------+ 3439| scrapy.contrib.spiders | scrapy.spiders | 3440+-------------------------------------+-------------------------------------+ 3441| * scrapy.contrib.closespider | scrapy.extensions.\* | 3442| * scrapy.contrib.corestats | | 3443| * scrapy.contrib.debug | | 3444| * scrapy.contrib.feedexport | | 3445| * scrapy.contrib.httpcache | | 3446| * scrapy.contrib.logstats | | 3447| * scrapy.contrib.memdebug | | 3448| * scrapy.contrib.memusage | | 3449| * scrapy.contrib.spiderstate | | 3450| * scrapy.contrib.statsmailer | | 3451| * scrapy.contrib.throttle | | 3452+-------------------------------------+-------------------------------------+ 3453 3454Plural renames and Modules unification 3455 3456+-------------------------------------+-------------------------------------+ 3457| Old location | New location | 3458+=====================================+=====================================+ 3459| scrapy.command | scrapy.commands | 3460+-------------------------------------+-------------------------------------+ 3461| scrapy.dupefilter | scrapy.dupefilters | 3462+-------------------------------------+-------------------------------------+ 3463| scrapy.linkextractor | scrapy.linkextractors | 3464+-------------------------------------+-------------------------------------+ 3465| scrapy.spider | scrapy.spiders | 3466+-------------------------------------+-------------------------------------+ 3467| scrapy.squeue | scrapy.squeues | 3468+-------------------------------------+-------------------------------------+ 3469| scrapy.statscol | scrapy.statscollectors | 3470+-------------------------------------+-------------------------------------+ 3471| scrapy.utils.decorator | scrapy.utils.decorators | 3472+-------------------------------------+-------------------------------------+ 3473 3474Class renames 3475 3476+-------------------------------------+-------------------------------------+ 3477| Old location | New location | 3478+=====================================+=====================================+ 3479| scrapy.spidermanager.SpiderManager | scrapy.spiderloader.SpiderLoader | 3480+-------------------------------------+-------------------------------------+ 3481 3482Settings renames 3483 3484+-------------------------------------+-------------------------------------+ 3485| Old location | New location | 3486+=====================================+=====================================+ 3487| SPIDER\_MANAGER\_CLASS | SPIDER\_LOADER\_CLASS | 3488+-------------------------------------+-------------------------------------+ 3489 3490Changelog 3491~~~~~~~~~ 3492 3493New Features and Enhancements 3494 3495- Python logging (:issue:`1060`, :issue:`1235`, :issue:`1236`, :issue:`1240`, 3496 :issue:`1259`, :issue:`1278`, :issue:`1286`) 3497- FEED_EXPORT_FIELDS option (:issue:`1159`, :issue:`1224`) 3498- Dns cache size and timeout options (:issue:`1132`) 3499- support namespace prefix in xmliter_lxml (:issue:`963`) 3500- Reactor threadpool max size setting (:issue:`1123`) 3501- Allow spiders to return dicts. (:issue:`1081`) 3502- Add Response.urljoin() helper (:issue:`1086`) 3503- look in ~/.config/scrapy.cfg for user config (:issue:`1098`) 3504- handle TLS SNI (:issue:`1101`) 3505- Selectorlist extract first (:issue:`624`, :issue:`1145`) 3506- Added JmesSelect (:issue:`1016`) 3507- add gzip compression to filesystem http cache backend (:issue:`1020`) 3508- CSS support in link extractors (:issue:`983`) 3509- httpcache dont_cache meta #19 #689 (:issue:`821`) 3510- add signal to be sent when request is dropped by the scheduler 3511 (:issue:`961`) 3512- avoid download large response (:issue:`946`) 3513- Allow to specify the quotechar in CSVFeedSpider (:issue:`882`) 3514- Add referer to "Spider error processing" log message (:issue:`795`) 3515- process robots.txt once (:issue:`896`) 3516- GSoC Per-spider settings (:issue:`854`) 3517- Add project name validation (:issue:`817`) 3518- GSoC API cleanup (:issue:`816`, :issue:`1128`, :issue:`1147`, 3519 :issue:`1148`, :issue:`1156`, :issue:`1185`, :issue:`1187`, :issue:`1258`, 3520 :issue:`1268`, :issue:`1276`, :issue:`1285`, :issue:`1284`) 3521- Be more responsive with IO operations (:issue:`1074` and :issue:`1075`) 3522- Do leveldb compaction for httpcache on closing (:issue:`1297`) 3523 3524Deprecations and Removals 3525 3526- Deprecate htmlparser link extractor (:issue:`1205`) 3527- remove deprecated code from FeedExporter (:issue:`1155`) 3528- a leftover for.15 compatibility (:issue:`925`) 3529- drop support for CONCURRENT_REQUESTS_PER_SPIDER (:issue:`895`) 3530- Drop old engine code (:issue:`911`) 3531- Deprecate SgmlLinkExtractor (:issue:`777`) 3532 3533Relocations 3534 3535- Move exporters/__init__.py to exporters.py (:issue:`1242`) 3536- Move base classes to their packages (:issue:`1218`, :issue:`1233`) 3537- Module relocation (:issue:`1181`, :issue:`1210`) 3538- rename SpiderManager to SpiderLoader (:issue:`1166`) 3539- Remove djangoitem (:issue:`1177`) 3540- remove scrapy deploy command (:issue:`1102`) 3541- dissolve contrib_exp (:issue:`1134`) 3542- Deleted bin folder from root, fixes #913 (:issue:`914`) 3543- Remove jsonrpc based webservice (:issue:`859`) 3544- Move Test cases under project root dir (:issue:`827`, :issue:`841`) 3545- Fix backward incompatibility for relocated paths in settings 3546 (:issue:`1267`) 3547 3548Documentation 3549 3550- CrawlerProcess documentation (:issue:`1190`) 3551- Favoring web scraping over screen scraping in the descriptions 3552 (:issue:`1188`) 3553- Some improvements for Scrapy tutorial (:issue:`1180`) 3554- Documenting Files Pipeline together with Images Pipeline (:issue:`1150`) 3555- deployment docs tweaks (:issue:`1164`) 3556- Added deployment section covering scrapyd-deploy and shub (:issue:`1124`) 3557- Adding more settings to project template (:issue:`1073`) 3558- some improvements to overview page (:issue:`1106`) 3559- Updated link in docs/topics/architecture.rst (:issue:`647`) 3560- DOC reorder topics (:issue:`1022`) 3561- updating list of Request.meta special keys (:issue:`1071`) 3562- DOC document download_timeout (:issue:`898`) 3563- DOC simplify extension docs (:issue:`893`) 3564- Leaks docs (:issue:`894`) 3565- DOC document from_crawler method for item pipelines (:issue:`904`) 3566- Spider_error doesn't support deferreds (:issue:`1292`) 3567- Corrections & Sphinx related fixes (:issue:`1220`, :issue:`1219`, 3568 :issue:`1196`, :issue:`1172`, :issue:`1171`, :issue:`1169`, :issue:`1160`, 3569 :issue:`1154`, :issue:`1127`, :issue:`1112`, :issue:`1105`, :issue:`1041`, 3570 :issue:`1082`, :issue:`1033`, :issue:`944`, :issue:`866`, :issue:`864`, 3571 :issue:`796`, :issue:`1260`, :issue:`1271`, :issue:`1293`, :issue:`1298`) 3572 3573Bugfixes 3574 3575- Item multi inheritance fix (:issue:`353`, :issue:`1228`) 3576- ItemLoader.load_item: iterate over copy of fields (:issue:`722`) 3577- Fix Unhandled error in Deferred (RobotsTxtMiddleware) (:issue:`1131`, 3578 :issue:`1197`) 3579- Force to read DOWNLOAD_TIMEOUT as int (:issue:`954`) 3580- scrapy.utils.misc.load_object should print full traceback (:issue:`902`) 3581- Fix bug for ".local" host name (:issue:`878`) 3582- Fix for Enabled extensions, middlewares, pipelines info not printed 3583 anymore (:issue:`879`) 3584- fix dont_merge_cookies bad behaviour when set to false on meta 3585 (:issue:`846`) 3586 3587Python 3 In Progress Support 3588 3589- disable scrapy.telnet if twisted.conch is not available (:issue:`1161`) 3590- fix Python 3 syntax errors in ajaxcrawl.py (:issue:`1162`) 3591- more python3 compatibility changes for urllib (:issue:`1121`) 3592- assertItemsEqual was renamed to assertCountEqual in Python 3. 3593 (:issue:`1070`) 3594- Import unittest.mock if available. (:issue:`1066`) 3595- updated deprecated cgi.parse_qsl to use six's parse_qsl (:issue:`909`) 3596- Prevent Python 3 port regressions (:issue:`830`) 3597- PY3: use MutableMapping for python 3 (:issue:`810`) 3598- PY3: use six.BytesIO and six.moves.cStringIO (:issue:`803`) 3599- PY3: fix xmlrpclib and email imports (:issue:`801`) 3600- PY3: use six for robotparser and urlparse (:issue:`800`) 3601- PY3: use six.iterkeys, six.iteritems, and tempfile (:issue:`799`) 3602- PY3: fix has_key and use six.moves.configparser (:issue:`798`) 3603- PY3: use six.moves.cPickle (:issue:`797`) 3604- PY3 make it possible to run some tests in Python3 (:issue:`776`) 3605 3606Tests 3607 3608- remove unnecessary lines from py3-ignores (:issue:`1243`) 3609- Fix remaining warnings from pytest while collecting tests (:issue:`1206`) 3610- Add docs build to travis (:issue:`1234`) 3611- TST don't collect tests from deprecated modules. (:issue:`1165`) 3612- install service_identity package in tests to prevent warnings 3613 (:issue:`1168`) 3614- Fix deprecated settings API in tests (:issue:`1152`) 3615- Add test for webclient with POST method and no body given (:issue:`1089`) 3616- py3-ignores.txt supports comments (:issue:`1044`) 3617- modernize some of the asserts (:issue:`835`) 3618- selector.__repr__ test (:issue:`779`) 3619 3620Code refactoring 3621 3622- CSVFeedSpider cleanup: use iterate_spider_output (:issue:`1079`) 3623- remove unnecessary check from scrapy.utils.spider.iter_spider_output 3624 (:issue:`1078`) 3625- Pydispatch pep8 (:issue:`992`) 3626- Removed unused 'load=False' parameter from walk_modules() (:issue:`871`) 3627- For consistency, use ``job_dir`` helper in ``SpiderState`` extension. 3628 (:issue:`805`) 3629- rename "sflo" local variables to less cryptic "log_observer" (:issue:`775`) 3630 3631Scrapy 0.24.6 (2015-04-20) 3632-------------------------- 3633 3634- encode invalid xpath with unicode_escape under PY2 (:commit:`07cb3e5`) 3635- fix IPython shell scope issue and load IPython user config (:commit:`2c8e573`) 3636- Fix small typo in the docs (:commit:`d694019`) 3637- Fix small typo (:commit:`f92fa83`) 3638- Converted sel.xpath() calls to response.xpath() in Extracting the data (:commit:`c2c6d15`) 3639 3640 3641Scrapy 0.24.5 (2015-02-25) 3642-------------------------- 3643 3644- Support new _getEndpoint Agent signatures on Twisted 15.0.0 (:commit:`540b9bc`) 3645- DOC a couple more references are fixed (:commit:`b4c454b`) 3646- DOC fix a reference (:commit:`e3c1260`) 3647- t.i.b.ThreadedResolver is now a new-style class (:commit:`9e13f42`) 3648- S3DownloadHandler: fix auth for requests with quoted paths/query params (:commit:`cdb9a0b`) 3649- fixed the variable types in mailsender documentation (:commit:`bb3a848`) 3650- Reset items_scraped instead of item_count (:commit:`edb07a4`) 3651- Tentative attention message about what document to read for contributions (:commit:`7ee6f7a`) 3652- mitmproxy 0.10.1 needs netlib 0.10.1 too (:commit:`874fcdd`) 3653- pin mitmproxy 0.10.1 as >0.11 does not work with tests (:commit:`c6b21f0`) 3654- Test the parse command locally instead of against an external url (:commit:`c3a6628`) 3655- Patches Twisted issue while closing the connection pool on HTTPDownloadHandler (:commit:`d0bf957`) 3656- Updates documentation on dynamic item classes. (:commit:`eeb589a`) 3657- Merge pull request #943 from Lazar-T/patch-3 (:commit:`5fdab02`) 3658- typo (:commit:`b0ae199`) 3659- pywin32 is required by Twisted. closes #937 (:commit:`5cb0cfb`) 3660- Update install.rst (:commit:`781286b`) 3661- Merge pull request #928 from Lazar-T/patch-1 (:commit:`b415d04`) 3662- comma instead of fullstop (:commit:`627b9ba`) 3663- Merge pull request #885 from jsma/patch-1 (:commit:`de909ad`) 3664- Update request-response.rst (:commit:`3f3263d`) 3665- SgmlLinkExtractor - fix for parsing <area> tag with Unicode present (:commit:`49b40f0`) 3666 3667Scrapy 0.24.4 (2014-08-09) 3668-------------------------- 3669 3670- pem file is used by mockserver and required by scrapy bench (:commit:`5eddc68`) 3671- scrapy bench needs scrapy.tests* (:commit:`d6cb999`) 3672 3673Scrapy 0.24.3 (2014-08-09) 3674-------------------------- 3675 3676- no need to waste travis-ci time on py3 for 0.24 (:commit:`8e080c1`) 3677- Update installation docs (:commit:`1d0c096`) 3678- There is a trove classifier for Scrapy framework! (:commit:`4c701d7`) 3679- update other places where w3lib version is mentioned (:commit:`d109c13`) 3680- Update w3lib requirement to 1.8.0 (:commit:`39d2ce5`) 3681- Use w3lib.html.replace_entities() (remove_entities() is deprecated) (:commit:`180d3ad`) 3682- set zip_safe=False (:commit:`a51ee8b`) 3683- do not ship tests package (:commit:`ee3b371`) 3684- scrapy.bat is not needed anymore (:commit:`c3861cf`) 3685- Modernize setup.py (:commit:`362e322`) 3686- headers can not handle non-string values (:commit:`94a5c65`) 3687- fix ftp test cases (:commit:`a274a7f`) 3688- The sum up of travis-ci builds are taking like 50min to complete (:commit:`ae1e2cc`) 3689- Update shell.rst typo (:commit:`e49c96a`) 3690- removes weird indentation in the shell results (:commit:`1ca489d`) 3691- improved explanations, clarified blog post as source, added link for XPath string functions in the spec (:commit:`65c8f05`) 3692- renamed UserTimeoutError and ServerTimeouterror #583 (:commit:`037f6ab`) 3693- adding some xpath tips to selectors docs (:commit:`2d103e0`) 3694- fix tests to account for https://github.com/scrapy/w3lib/pull/23 (:commit:`f8d366a`) 3695- get_func_args maximum recursion fix #728 (:commit:`81344ea`) 3696- Updated input/ouput processor example according to #560. (:commit:`f7c4ea8`) 3697- Fixed Python syntax in tutorial. (:commit:`db59ed9`) 3698- Add test case for tunneling proxy (:commit:`f090260`) 3699- Bugfix for leaking Proxy-Authorization header to remote host when using tunneling (:commit:`d8793af`) 3700- Extract links from XHTML documents with MIME-Type "application/xml" (:commit:`ed1f376`) 3701- Merge pull request #793 from roysc/patch-1 (:commit:`91a1106`) 3702- Fix typo in commands.rst (:commit:`743e1e2`) 3703- better testcase for settings.overrides.setdefault (:commit:`e22daaf`) 3704- Using CRLF as line marker according to http 1.1 definition (:commit:`5ec430b`) 3705 3706Scrapy 0.24.2 (2014-07-08) 3707-------------------------- 3708 3709- Use a mutable mapping to proxy deprecated settings.overrides and settings.defaults attribute (:commit:`e5e8133`) 3710- there is not support for python3 yet (:commit:`3cd6146`) 3711- Update python compatible version set to Debian packages (:commit:`fa5d76b`) 3712- DOC fix formatting in release notes (:commit:`c6a9e20`) 3713 3714Scrapy 0.24.1 (2014-06-27) 3715-------------------------- 3716 3717- Fix deprecated CrawlerSettings and increase backward compatibility with 3718 .defaults attribute (:commit:`8e3f20a`) 3719 3720 3721Scrapy 0.24.0 (2014-06-26) 3722-------------------------- 3723 3724Enhancements 3725~~~~~~~~~~~~ 3726 3727- Improve Scrapy top-level namespace (:issue:`494`, :issue:`684`) 3728- Add selector shortcuts to responses (:issue:`554`, :issue:`690`) 3729- Add new lxml based LinkExtractor to replace unmaintained SgmlLinkExtractor 3730 (:issue:`559`, :issue:`761`, :issue:`763`) 3731- Cleanup settings API - part of per-spider settings **GSoC project** (:issue:`737`) 3732- Add UTF8 encoding header to templates (:issue:`688`, :issue:`762`) 3733- Telnet console now binds to 127.0.0.1 by default (:issue:`699`) 3734- Update Debian/Ubuntu install instructions (:issue:`509`, :issue:`549`) 3735- Disable smart strings in lxml XPath evaluations (:issue:`535`) 3736- Restore filesystem based cache as default for http 3737 cache middleware (:issue:`541`, :issue:`500`, :issue:`571`) 3738- Expose current crawler in Scrapy shell (:issue:`557`) 3739- Improve testsuite comparing CSV and XML exporters (:issue:`570`) 3740- New ``offsite/filtered`` and ``offsite/domains`` stats (:issue:`566`) 3741- Support process_links as generator in CrawlSpider (:issue:`555`) 3742- Verbose logging and new stats counters for DupeFilter (:issue:`553`) 3743- Add a mimetype parameter to ``MailSender.send()`` (:issue:`602`) 3744- Generalize file pipeline log messages (:issue:`622`) 3745- Replace unencodeable codepoints with html entities in SGMLLinkExtractor (:issue:`565`) 3746- Converted SEP documents to rst format (:issue:`629`, :issue:`630`, 3747 :issue:`638`, :issue:`632`, :issue:`636`, :issue:`640`, :issue:`635`, 3748 :issue:`634`, :issue:`639`, :issue:`637`, :issue:`631`, :issue:`633`, 3749 :issue:`641`, :issue:`642`) 3750- Tests and docs for clickdata's nr index in FormRequest (:issue:`646`, :issue:`645`) 3751- Allow to disable a downloader handler just like any other component (:issue:`650`) 3752- Log when a request is discarded after too many redirections (:issue:`654`) 3753- Log error responses if they are not handled by spider callbacks 3754 (:issue:`612`, :issue:`656`) 3755- Add content-type check to http compression mw (:issue:`193`, :issue:`660`) 3756- Run pypy tests using latest pypi from ppa (:issue:`674`) 3757- Run test suite using pytest instead of trial (:issue:`679`) 3758- Build docs and check for dead links in tox environment (:issue:`687`) 3759- Make scrapy.version_info a tuple of integers (:issue:`681`, :issue:`692`) 3760- Infer exporter's output format from filename extensions 3761 (:issue:`546`, :issue:`659`, :issue:`760`) 3762- Support case-insensitive domains in ``url_is_from_any_domain()`` (:issue:`693`) 3763- Remove pep8 warnings in project and spider templates (:issue:`698`) 3764- Tests and docs for ``request_fingerprint`` function (:issue:`597`) 3765- Update SEP-19 for GSoC project ``per-spider settings`` (:issue:`705`) 3766- Set exit code to non-zero when contracts fails (:issue:`727`) 3767- Add a setting to control what class is instantiated as Downloader component 3768 (:issue:`738`) 3769- Pass response in ``item_dropped`` signal (:issue:`724`) 3770- Improve ``scrapy check`` contracts command (:issue:`733`, :issue:`752`) 3771- Document ``spider.closed()`` shortcut (:issue:`719`) 3772- Document ``request_scheduled`` signal (:issue:`746`) 3773- Add a note about reporting security issues (:issue:`697`) 3774- Add LevelDB http cache storage backend (:issue:`626`, :issue:`500`) 3775- Sort spider list output of ``scrapy list`` command (:issue:`742`) 3776- Multiple documentation enhancements and fixes 3777 (:issue:`575`, :issue:`587`, :issue:`590`, :issue:`596`, :issue:`610`, 3778 :issue:`617`, :issue:`618`, :issue:`627`, :issue:`613`, :issue:`643`, 3779 :issue:`654`, :issue:`675`, :issue:`663`, :issue:`711`, :issue:`714`) 3780 3781Bugfixes 3782~~~~~~~~ 3783 3784- Encode unicode URL value when creating Links in RegexLinkExtractor (:issue:`561`) 3785- Ignore None values in ItemLoader processors (:issue:`556`) 3786- Fix link text when there is an inner tag in SGMLLinkExtractor and 3787 HtmlParserLinkExtractor (:issue:`485`, :issue:`574`) 3788- Fix wrong checks on subclassing of deprecated classes 3789 (:issue:`581`, :issue:`584`) 3790- Handle errors caused by inspect.stack() failures (:issue:`582`) 3791- Fix a reference to unexistent engine attribute (:issue:`593`, :issue:`594`) 3792- Fix dynamic itemclass example usage of type() (:issue:`603`) 3793- Use lucasdemarchi/codespell to fix typos (:issue:`628`) 3794- Fix default value of attrs argument in SgmlLinkExtractor to be tuple (:issue:`661`) 3795- Fix XXE flaw in sitemap reader (:issue:`676`) 3796- Fix engine to support filtered start requests (:issue:`707`) 3797- Fix offsite middleware case on urls with no hostnames (:issue:`745`) 3798- Testsuite doesn't require PIL anymore (:issue:`585`) 3799 3800 3801Scrapy 0.22.2 (released 2014-02-14) 3802----------------------------------- 3803 3804- fix a reference to unexistent engine.slots. closes #593 (:commit:`13c099a`) 3805- downloaderMW doc typo (spiderMW doc copy remnant) (:commit:`8ae11bf`) 3806- Correct typos (:commit:`1346037`) 3807 3808Scrapy 0.22.1 (released 2014-02-08) 3809----------------------------------- 3810 3811- localhost666 can resolve under certain circumstances (:commit:`2ec2279`) 3812- test inspect.stack failure (:commit:`cc3eda3`) 3813- Handle cases when inspect.stack() fails (:commit:`8cb44f9`) 3814- Fix wrong checks on subclassing of deprecated classes. closes #581 (:commit:`46d98d6`) 3815- Docs: 4-space indent for final spider example (:commit:`13846de`) 3816- Fix HtmlParserLinkExtractor and tests after #485 merge (:commit:`368a946`) 3817- BaseSgmlLinkExtractor: Fixed the missing space when the link has an inner tag (:commit:`b566388`) 3818- BaseSgmlLinkExtractor: Added unit test of a link with an inner tag (:commit:`c1cb418`) 3819- BaseSgmlLinkExtractor: Fixed unknown_endtag() so that it only set current_link=None when the end tag match the opening tag (:commit:`7e4d627`) 3820- Fix tests for Travis-CI build (:commit:`76c7e20`) 3821- replace unencodable codepoints with html entities. fixes #562 and #285 (:commit:`5f87b17`) 3822- RegexLinkExtractor: encode URL unicode value when creating Links (:commit:`d0ee545`) 3823- Updated the tutorial crawl output with latest output. (:commit:`8da65de`) 3824- Updated shell docs with the crawler reference and fixed the actual shell output. (:commit:`875b9ab`) 3825- PEP8 minor edits. (:commit:`f89efaf`) 3826- Expose current crawler in the Scrapy shell. (:commit:`5349cec`) 3827- Unused re import and PEP8 minor edits. (:commit:`387f414`) 3828- Ignore None's values when using the ItemLoader. (:commit:`0632546`) 3829- DOC Fixed HTTPCACHE_STORAGE typo in the default value which is now Filesystem instead Dbm. (:commit:`cde9a8c`) 3830- show Ubuntu setup instructions as literal code (:commit:`fb5c9c5`) 3831- Update Ubuntu installation instructions (:commit:`70fb105`) 3832- Merge pull request #550 from stray-leone/patch-1 (:commit:`6f70b6a`) 3833- modify the version of Scrapy Ubuntu package (:commit:`725900d`) 3834- fix 0.22.0 release date (:commit:`af0219a`) 3835- fix typos in news.rst and remove (not released yet) header (:commit:`b7f58f4`) 3836 3837Scrapy 0.22.0 (released 2014-01-17) 3838----------------------------------- 3839 3840Enhancements 3841~~~~~~~~~~~~ 3842 3843- [**Backward incompatible**] Switched HTTPCacheMiddleware backend to filesystem (:issue:`541`) 3844 To restore old backend set ``HTTPCACHE_STORAGE`` to ``scrapy.contrib.httpcache.DbmCacheStorage`` 3845- Proxy \https:// urls using CONNECT method (:issue:`392`, :issue:`397`) 3846- Add a middleware to crawl ajax crawleable pages as defined by google (:issue:`343`) 3847- Rename scrapy.spider.BaseSpider to scrapy.spider.Spider (:issue:`510`, :issue:`519`) 3848- Selectors register EXSLT namespaces by default (:issue:`472`) 3849- Unify item loaders similar to selectors renaming (:issue:`461`) 3850- Make ``RFPDupeFilter`` class easily subclassable (:issue:`533`) 3851- Improve test coverage and forthcoming Python 3 support (:issue:`525`) 3852- Promote startup info on settings and middleware to INFO level (:issue:`520`) 3853- Support partials in ``get_func_args`` util (:issue:`506`, issue:`504`) 3854- Allow running individual tests via tox (:issue:`503`) 3855- Update extensions ignored by link extractors (:issue:`498`) 3856- Add middleware methods to get files/images/thumbs paths (:issue:`490`) 3857- Improve offsite middleware tests (:issue:`478`) 3858- Add a way to skip default Referer header set by RefererMiddleware (:issue:`475`) 3859- Do not send ``x-gzip`` in default ``Accept-Encoding`` header (:issue:`469`) 3860- Support defining http error handling using settings (:issue:`466`) 3861- Use modern python idioms wherever you find legacies (:issue:`497`) 3862- Improve and correct documentation 3863 (:issue:`527`, :issue:`524`, :issue:`521`, :issue:`517`, :issue:`512`, :issue:`505`, 3864 :issue:`502`, :issue:`489`, :issue:`465`, :issue:`460`, :issue:`425`, :issue:`536`) 3865 3866Fixes 3867~~~~~ 3868 3869- Update Selector class imports in CrawlSpider template (:issue:`484`) 3870- Fix unexistent reference to ``engine.slots`` (:issue:`464`) 3871- Do not try to call ``body_as_unicode()`` on a non-TextResponse instance (:issue:`462`) 3872- Warn when subclassing XPathItemLoader, previously it only warned on 3873 instantiation. (:issue:`523`) 3874- Warn when subclassing XPathSelector, previously it only warned on 3875 instantiation. (:issue:`537`) 3876- Multiple fixes to memory stats (:issue:`531`, :issue:`530`, :issue:`529`) 3877- Fix overriding url in ``FormRequest.from_response()`` (:issue:`507`) 3878- Fix tests runner under pip 1.5 (:issue:`513`) 3879- Fix logging error when spider name is unicode (:issue:`479`) 3880 3881Scrapy 0.20.2 (released 2013-12-09) 3882----------------------------------- 3883 3884- Update CrawlSpider Template with Selector changes (:commit:`6d1457d`) 3885- fix method name in tutorial. closes GH-480 (:commit:`b4fc359` 3886 3887Scrapy 0.20.1 (released 2013-11-28) 3888----------------------------------- 3889 3890- include_package_data is required to build wheels from published sources (:commit:`5ba1ad5`) 3891- process_parallel was leaking the failures on its internal deferreds. closes #458 (:commit:`419a780`) 3892 3893Scrapy 0.20.0 (released 2013-11-08) 3894----------------------------------- 3895 3896Enhancements 3897~~~~~~~~~~~~ 3898 3899- New Selector's API including CSS selectors (:issue:`395` and :issue:`426`), 3900- Request/Response url/body attributes are now immutable 3901 (modifying them had been deprecated for a long time) 3902- :setting:`ITEM_PIPELINES` is now defined as a dict (instead of a list) 3903- Sitemap spider can fetch alternate URLs (:issue:`360`) 3904- ``Selector.remove_namespaces()`` now remove namespaces from element's attributes. (:issue:`416`) 3905- Paved the road for Python 3.3+ (:issue:`435`, :issue:`436`, :issue:`431`, :issue:`452`) 3906- New item exporter using native python types with nesting support (:issue:`366`) 3907- Tune HTTP1.1 pool size so it matches concurrency defined by settings (:commit:`b43b5f575`) 3908- scrapy.mail.MailSender now can connect over TLS or upgrade using STARTTLS (:issue:`327`) 3909- New FilesPipeline with functionality factored out from ImagesPipeline (:issue:`370`, :issue:`409`) 3910- Recommend Pillow instead of PIL for image handling (:issue:`317`) 3911- Added Debian packages for Ubuntu Quantal and Raring (:commit:`86230c0`) 3912- Mock server (used for tests) can listen for HTTPS requests (:issue:`410`) 3913- Remove multi spider support from multiple core components 3914 (:issue:`422`, :issue:`421`, :issue:`420`, :issue:`419`, :issue:`423`, :issue:`418`) 3915- Travis-CI now tests Scrapy changes against development versions of ``w3lib`` and ``queuelib`` python packages. 3916- Add pypy 2.1 to continuous integration tests (:commit:`ecfa7431`) 3917- Pylinted, pep8 and removed old-style exceptions from source (:issue:`430`, :issue:`432`) 3918- Use importlib for parametric imports (:issue:`445`) 3919- Handle a regression introduced in Python 2.7.5 that affects XmlItemExporter (:issue:`372`) 3920- Bugfix crawling shutdown on SIGINT (:issue:`450`) 3921- Do not submit ``reset`` type inputs in FormRequest.from_response (:commit:`b326b87`) 3922- Do not silence download errors when request errback raises an exception (:commit:`684cfc0`) 3923 3924Bugfixes 3925~~~~~~~~ 3926 3927- Fix tests under Django 1.6 (:commit:`b6bed44c`) 3928- Lot of bugfixes to retry middleware under disconnections using HTTP 1.1 download handler 3929- Fix inconsistencies among Twisted releases (:issue:`406`) 3930- Fix Scrapy shell bugs (:issue:`418`, :issue:`407`) 3931- Fix invalid variable name in setup.py (:issue:`429`) 3932- Fix tutorial references (:issue:`387`) 3933- Improve request-response docs (:issue:`391`) 3934- Improve best practices docs (:issue:`399`, :issue:`400`, :issue:`401`, :issue:`402`) 3935- Improve django integration docs (:issue:`404`) 3936- Document ``bindaddress`` request meta (:commit:`37c24e01d7`) 3937- Improve ``Request`` class documentation (:issue:`226`) 3938 3939Other 3940~~~~~ 3941 3942- Dropped Python 2.6 support (:issue:`448`) 3943- Add :doc:`cssselect <cssselect:index>` python package as install dependency 3944- Drop libxml2 and multi selector's backend support, `lxml`_ is required from now on. 3945- Minimum Twisted version increased to 10.0.0, dropped Twisted 8.0 support. 3946- Running test suite now requires ``mock`` python library (:issue:`390`) 3947 3948 3949Thanks 3950~~~~~~ 3951 3952Thanks to everyone who contribute to this release! 3953 3954List of contributors sorted by number of commits:: 3955 3956 69 Daniel Graña <dangra@...> 3957 37 Pablo Hoffman <pablo@...> 3958 13 Mikhail Korobov <kmike84@...> 3959 9 Alex Cepoi <alex.cepoi@...> 3960 9 alexanderlukanin13 <alexander.lukanin.13@...> 3961 8 Rolando Espinoza La fuente <darkrho@...> 3962 8 Lukasz Biedrycki <lukasz.biedrycki@...> 3963 6 Nicolas Ramirez <nramirez.uy@...> 3964 3 Paul Tremberth <paul.tremberth@...> 3965 2 Martin Olveyra <molveyra@...> 3966 2 Stefan <misc@...> 3967 2 Rolando Espinoza <darkrho@...> 3968 2 Loren Davie <loren@...> 3969 2 irgmedeiros <irgmedeiros@...> 3970 1 Stefan Koch <taikano@...> 3971 1 Stefan <cct@...> 3972 1 scraperdragon <dragon@...> 3973 1 Kumara Tharmalingam <ktharmal@...> 3974 1 Francesco Piccinno <stack.box@...> 3975 1 Marcos Campal <duendex@...> 3976 1 Dragon Dave <dragon@...> 3977 1 Capi Etheriel <barraponto@...> 3978 1 cacovsky <amarquesferraz@...> 3979 1 Berend Iwema <berend@...> 3980 3981Scrapy 0.18.4 (released 2013-10-10) 3982----------------------------------- 3983 3984- IPython refuses to update the namespace. fix #396 (:commit:`3d32c4f`) 3985- Fix AlreadyCalledError replacing a request in shell command. closes #407 (:commit:`b1d8919`) 3986- Fix start_requests laziness and early hangs (:commit:`89faf52`) 3987 3988Scrapy 0.18.3 (released 2013-10-03) 3989----------------------------------- 3990 3991- fix regression on lazy evaluation of start requests (:commit:`12693a5`) 3992- forms: do not submit reset inputs (:commit:`e429f63`) 3993- increase unittest timeouts to decrease travis false positive failures (:commit:`912202e`) 3994- backport master fixes to json exporter (:commit:`cfc2d46`) 3995- Fix permission and set umask before generating sdist tarball (:commit:`06149e0`) 3996 3997Scrapy 0.18.2 (released 2013-09-03) 3998----------------------------------- 3999 4000- Backport ``scrapy check`` command fixes and backward compatible multi 4001 crawler process(:issue:`339`) 4002 4003Scrapy 0.18.1 (released 2013-08-27) 4004----------------------------------- 4005 4006- remove extra import added by cherry picked changes (:commit:`d20304e`) 4007- fix crawling tests under twisted pre 11.0.0 (:commit:`1994f38`) 4008- py26 can not format zero length fields {} (:commit:`abf756f`) 4009- test PotentiaDataLoss errors on unbound responses (:commit:`b15470d`) 4010- Treat responses without content-length or Transfer-Encoding as good responses (:commit:`c4bf324`) 4011- do no include ResponseFailed if http11 handler is not enabled (:commit:`6cbe684`) 4012- New HTTP client wraps connection lost in ResponseFailed exception. fix #373 (:commit:`1a20bba`) 4013- limit travis-ci build matrix (:commit:`3b01bb8`) 4014- Merge pull request #375 from peterarenot/patch-1 (:commit:`fa766d7`) 4015- Fixed so it refers to the correct folder (:commit:`3283809`) 4016- added Quantal & Raring to support Ubuntu releases (:commit:`1411923`) 4017- fix retry middleware which didn't retry certain connection errors after the upgrade to http1 client, closes GH-373 (:commit:`bb35ed0`) 4018- fix XmlItemExporter in Python 2.7.4 and 2.7.5 (:commit:`de3e451`) 4019- minor updates to 0.18 release notes (:commit:`c45e5f1`) 4020- fix contributors list format (:commit:`0b60031`) 4021 4022Scrapy 0.18.0 (released 2013-08-09) 4023----------------------------------- 4024 4025- Lot of improvements to testsuite run using Tox, including a way to test on pypi 4026- Handle GET parameters for AJAX crawleable urls (:commit:`3fe2a32`) 4027- Use lxml recover option to parse sitemaps (:issue:`347`) 4028- Bugfix cookie merging by hostname and not by netloc (:issue:`352`) 4029- Support disabling ``HttpCompressionMiddleware`` using a flag setting (:issue:`359`) 4030- Support xml namespaces using ``iternodes`` parser in ``XMLFeedSpider`` (:issue:`12`) 4031- Support ``dont_cache`` request meta flag (:issue:`19`) 4032- Bugfix ``scrapy.utils.gz.gunzip`` broken by changes in python 2.7.4 (:commit:`4dc76e`) 4033- Bugfix url encoding on ``SgmlLinkExtractor`` (:issue:`24`) 4034- Bugfix ``TakeFirst`` processor shouldn't discard zero (0) value (:issue:`59`) 4035- Support nested items in xml exporter (:issue:`66`) 4036- Improve cookies handling performance (:issue:`77`) 4037- Log dupe filtered requests once (:issue:`105`) 4038- Split redirection middleware into status and meta based middlewares (:issue:`78`) 4039- Use HTTP1.1 as default downloader handler (:issue:`109` and :issue:`318`) 4040- Support xpath form selection on ``FormRequest.from_response`` (:issue:`185`) 4041- Bugfix unicode decoding error on ``SgmlLinkExtractor`` (:issue:`199`) 4042- Bugfix signal dispatching on pypi interpreter (:issue:`205`) 4043- Improve request delay and concurrency handling (:issue:`206`) 4044- Add RFC2616 cache policy to ``HttpCacheMiddleware`` (:issue:`212`) 4045- Allow customization of messages logged by engine (:issue:`214`) 4046- Multiples improvements to ``DjangoItem`` (:issue:`217`, :issue:`218`, :issue:`221`) 4047- Extend Scrapy commands using setuptools entry points (:issue:`260`) 4048- Allow spider ``allowed_domains`` value to be set/tuple (:issue:`261`) 4049- Support ``settings.getdict`` (:issue:`269`) 4050- Simplify internal ``scrapy.core.scraper`` slot handling (:issue:`271`) 4051- Added ``Item.copy`` (:issue:`290`) 4052- Collect idle downloader slots (:issue:`297`) 4053- Add ``ftp://`` scheme downloader handler (:issue:`329`) 4054- Added downloader benchmark webserver and spider tools :ref:`benchmarking` 4055- Moved persistent (on disk) queues to a separate project (queuelib_) which Scrapy now depends on 4056- Add Scrapy commands using external libraries (:issue:`260`) 4057- Added ``--pdb`` option to ``scrapy`` command line tool 4058- Added :meth:`XPathSelector.remove_namespaces <scrapy.selector.Selector.remove_namespaces>` which allows to remove all namespaces from XML documents for convenience (to work with namespace-less XPaths). Documented in :ref:`topics-selectors`. 4059- Several improvements to spider contracts 4060- New default middleware named MetaRefreshMiddldeware that handles meta-refresh html tag redirections, 4061- MetaRefreshMiddldeware and RedirectMiddleware have different priorities to address #62 4062- added from_crawler method to spiders 4063- added system tests with mock server 4064- more improvements to macOS compatibility (thanks Alex Cepoi) 4065- several more cleanups to singletons and multi-spider support (thanks Nicolas Ramirez) 4066- support custom download slots 4067- added --spider option to "shell" command. 4068- log overridden settings when Scrapy starts 4069 4070Thanks to everyone who contribute to this release. Here is a list of 4071contributors sorted by number of commits:: 4072 4073 130 Pablo Hoffman <pablo@...> 4074 97 Daniel Graña <dangra@...> 4075 20 Nicolás Ramírez <nramirez.uy@...> 4076 13 Mikhail Korobov <kmike84@...> 4077 12 Pedro Faustino <pedrobandim@...> 4078 11 Steven Almeroth <sroth77@...> 4079 5 Rolando Espinoza La fuente <darkrho@...> 4080 4 Michal Danilak <mimino.coder@...> 4081 4 Alex Cepoi <alex.cepoi@...> 4082 4 Alexandr N Zamaraev (aka tonal) <tonal@...> 4083 3 paul <paul.tremberth@...> 4084 3 Martin Olveyra <molveyra@...> 4085 3 Jordi Llonch <llonchj@...> 4086 3 arijitchakraborty <myself.arijit@...> 4087 2 Shane Evans <shane.evans@...> 4088 2 joehillen <joehillen@...> 4089 2 Hart <HartSimha@...> 4090 2 Dan <ellisd23@...> 4091 1 Zuhao Wan <wanzuhao@...> 4092 1 whodatninja <blake@...> 4093 1 vkrest <v.krestiannykov@...> 4094 1 tpeng <pengtaoo@...> 4095 1 Tom Mortimer-Jones <tom@...> 4096 1 Rocio Aramberri <roschegel@...> 4097 1 Pedro <pedro@...> 4098 1 notsobad <wangxiaohugg@...> 4099 1 Natan L <kuyanatan.nlao@...> 4100 1 Mark Grey <mark.grey@...> 4101 1 Luan <luanpab@...> 4102 1 Libor Nenadál <libor.nenadal@...> 4103 1 Juan M Uys <opyate@...> 4104 1 Jonas Brunsgaard <jonas.brunsgaard@...> 4105 1 Ilya Baryshev <baryshev@...> 4106 1 Hasnain Lakhani <m.hasnain.lakhani@...> 4107 1 Emanuel Schorsch <emschorsch@...> 4108 1 Chris Tilden <chris.tilden@...> 4109 1 Capi Etheriel <barraponto@...> 4110 1 cacovsky <amarquesferraz@...> 4111 1 Berend Iwema <berend@...> 4112 4113 4114Scrapy 0.16.5 (released 2013-05-30) 4115----------------------------------- 4116 4117- obey request method when Scrapy deploy is redirected to a new endpoint (:commit:`8c4fcee`) 4118- fix inaccurate downloader middleware documentation. refs #280 (:commit:`40667cb`) 4119- doc: remove links to diveintopython.org, which is no longer available. closes #246 (:commit:`bd58bfa`) 4120- Find form nodes in invalid html5 documents (:commit:`e3d6945`) 4121- Fix typo labeling attrs type bool instead of list (:commit:`a274276`) 4122 4123Scrapy 0.16.4 (released 2013-01-23) 4124----------------------------------- 4125 4126- fixes spelling errors in documentation (:commit:`6d2b3aa`) 4127- add doc about disabling an extension. refs #132 (:commit:`c90de33`) 4128- Fixed error message formatting. log.err() doesn't support cool formatting and when error occurred, the message was: "ERROR: Error processing %(item)s" (:commit:`c16150c`) 4129- lint and improve images pipeline error logging (:commit:`56b45fc`) 4130- fixed doc typos (:commit:`243be84`) 4131- add documentation topics: Broad Crawls & Common Practices (:commit:`1fbb715`) 4132- fix bug in Scrapy parse command when spider is not specified explicitly. closes #209 (:commit:`c72e682`) 4133- Update docs/topics/commands.rst (:commit:`28eac7a`) 4134 4135Scrapy 0.16.3 (released 2012-12-07) 4136----------------------------------- 4137 4138- Remove concurrency limitation when using download delays and still ensure inter-request delays are enforced (:commit:`487b9b5`) 4139- add error details when image pipeline fails (:commit:`8232569`) 4140- improve macOS compatibility (:commit:`8dcf8aa`) 4141- setup.py: use README.rst to populate long_description (:commit:`7b5310d`) 4142- doc: removed obsolete references to ClientForm (:commit:`80f9bb6`) 4143- correct docs for default storage backend (:commit:`2aa491b`) 4144- doc: removed broken proxyhub link from FAQ (:commit:`bdf61c4`) 4145- Fixed docs typo in SpiderOpenCloseLogging example (:commit:`7184094`) 4146 4147 4148Scrapy 0.16.2 (released 2012-11-09) 4149----------------------------------- 4150 4151- Scrapy contracts: python2.6 compat (:commit:`a4a9199`) 4152- Scrapy contracts verbose option (:commit:`ec41673`) 4153- proper unittest-like output for Scrapy contracts (:commit:`86635e4`) 4154- added open_in_browser to debugging doc (:commit:`c9b690d`) 4155- removed reference to global Scrapy stats from settings doc (:commit:`dd55067`) 4156- Fix SpiderState bug in Windows platforms (:commit:`58998f4`) 4157 4158 4159Scrapy 0.16.1 (released 2012-10-26) 4160----------------------------------- 4161 4162- fixed LogStats extension, which got broken after a wrong merge before the 0.16 release (:commit:`8c780fd`) 4163- better backward compatibility for scrapy.conf.settings (:commit:`3403089`) 4164- extended documentation on how to access crawler stats from extensions (:commit:`c4da0b5`) 4165- removed .hgtags (no longer needed now that Scrapy uses git) (:commit:`d52c188`) 4166- fix dashes under rst headers (:commit:`fa4f7f9`) 4167- set release date for 0.16.0 in news (:commit:`e292246`) 4168 4169 4170Scrapy 0.16.0 (released 2012-10-18) 4171----------------------------------- 4172 4173Scrapy changes: 4174 4175- added :ref:`topics-contracts`, a mechanism for testing spiders in a formal/reproducible way 4176- added options ``-o`` and ``-t`` to the :command:`runspider` command 4177- documented :doc:`topics/autothrottle` and added to extensions installed by default. You still need to enable it with :setting:`AUTOTHROTTLE_ENABLED` 4178- major Stats Collection refactoring: removed separation of global/per-spider stats, removed stats-related signals (``stats_spider_opened``, etc). Stats are much simpler now, backward compatibility is kept on the Stats Collector API and signals. 4179- added :meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_start_requests` method to spider middlewares 4180- dropped Signals singleton. Signals should now be accessed through the Crawler.signals attribute. See the signals documentation for more info. 4181- dropped Stats Collector singleton. Stats can now be accessed through the Crawler.stats attribute. See the stats collection documentation for more info. 4182- documented :ref:`topics-api` 4183- ``lxml`` is now the default selectors backend instead of ``libxml2`` 4184- ported FormRequest.from_response() to use `lxml`_ instead of `ClientForm`_ 4185- removed modules: ``scrapy.xlib.BeautifulSoup`` and ``scrapy.xlib.ClientForm`` 4186- SitemapSpider: added support for sitemap urls ending in .xml and .xml.gz, even if they advertise a wrong content type (:commit:`10ed28b`) 4187- StackTraceDump extension: also dump trackref live references (:commit:`fe2ce93`) 4188- nested items now fully supported in JSON and JSONLines exporters 4189- added :reqmeta:`cookiejar` Request meta key to support multiple cookie sessions per spider 4190- decoupled encoding detection code to `w3lib.encoding`_, and ported Scrapy code to use that module 4191- dropped support for Python 2.5. See https://blog.scrapinghub.com/2012/02/27/scrapy-0-15-dropping-support-for-python-2-5/ 4192- dropped support for Twisted 2.5 4193- added :setting:`REFERER_ENABLED` setting, to control referer middleware 4194- changed default user agent to: ``Scrapy/VERSION (+http://scrapy.org)`` 4195- removed (undocumented) ``HTMLImageLinkExtractor`` class from ``scrapy.contrib.linkextractors.image`` 4196- removed per-spider settings (to be replaced by instantiating multiple crawler objects) 4197- ``USER_AGENT`` spider attribute will no longer work, use ``user_agent`` attribute instead 4198- ``DOWNLOAD_TIMEOUT`` spider attribute will no longer work, use ``download_timeout`` attribute instead 4199- removed ``ENCODING_ALIASES`` setting, as encoding auto-detection has been moved to the `w3lib`_ library 4200- promoted :ref:`topics-djangoitem` to main contrib 4201- LogFormatter method now return dicts(instead of strings) to support lazy formatting (:issue:`164`, :commit:`dcef7b0`) 4202- downloader handlers (:setting:`DOWNLOAD_HANDLERS` setting) now receive settings as the first argument of the ``__init__`` method 4203- replaced memory usage acounting with (more portable) `resource`_ module, removed ``scrapy.utils.memory`` module 4204- removed signal: ``scrapy.mail.mail_sent`` 4205- removed ``TRACK_REFS`` setting, now :ref:`trackrefs <topics-leaks-trackrefs>` is always enabled 4206- DBM is now the default storage backend for HTTP cache middleware 4207- number of log messages (per level) are now tracked through Scrapy stats (stat name: ``log_count/LEVEL``) 4208- number received responses are now tracked through Scrapy stats (stat name: ``response_received_count``) 4209- removed ``scrapy.log.started`` attribute 4210 4211Scrapy 0.14.4 4212------------- 4213 4214- added precise to supported Ubuntu distros (:commit:`b7e46df`) 4215- fixed bug in json-rpc webservice reported in https://groups.google.com/forum/#!topic/scrapy-users/qgVBmFybNAQ/discussion. also removed no longer supported 'run' command from extras/scrapy-ws.py (:commit:`340fbdb`) 4216- meta tag attributes for content-type http equiv can be in any order. #123 (:commit:`0cb68af`) 4217- replace "import Image" by more standard "from PIL import Image". closes #88 (:commit:`4d17048`) 4218- return trial status as bin/runtests.sh exit value. #118 (:commit:`b7b2e7f`) 4219 4220Scrapy 0.14.3 4221------------- 4222 4223- forgot to include pydispatch license. #118 (:commit:`fd85f9c`) 4224- include egg files used by testsuite in source distribution. #118 (:commit:`c897793`) 4225- update docstring in project template to avoid confusion with genspider command, which may be considered as an advanced feature. refs #107 (:commit:`2548dcc`) 4226- added note to docs/topics/firebug.rst about google directory being shut down (:commit:`668e352`) 4227- don't discard slot when empty, just save in another dict in order to recycle if needed again. (:commit:`8e9f607`) 4228- do not fail handling unicode xpaths in libxml2 backed selectors (:commit:`b830e95`) 4229- fixed minor mistake in Request objects documentation (:commit:`bf3c9ee`) 4230- fixed minor defect in link extractors documentation (:commit:`ba14f38`) 4231- removed some obsolete remaining code related to sqlite support in Scrapy (:commit:`0665175`) 4232 4233Scrapy 0.14.2 4234------------- 4235 4236- move buffer pointing to start of file before computing checksum. refs #92 (:commit:`6a5bef2`) 4237- Compute image checksum before persisting images. closes #92 (:commit:`9817df1`) 4238- remove leaking references in cached failures (:commit:`673a120`) 4239- fixed bug in MemoryUsage extension: get_engine_status() takes exactly 1 argument (0 given) (:commit:`11133e9`) 4240- fixed struct.error on http compression middleware. closes #87 (:commit:`1423140`) 4241- ajax crawling wasn't expanding for unicode urls (:commit:`0de3fb4`) 4242- Catch start_requests iterator errors. refs #83 (:commit:`454a21d`) 4243- Speed-up libxml2 XPathSelector (:commit:`2fbd662`) 4244- updated versioning doc according to recent changes (:commit:`0a070f5`) 4245- scrapyd: fixed documentation link (:commit:`2b4e4c3`) 4246- extras/makedeb.py: no longer obtaining version from git (:commit:`caffe0e`) 4247 4248Scrapy 0.14.1 4249------------- 4250 4251- extras/makedeb.py: no longer obtaining version from git (:commit:`caffe0e`) 4252- bumped version to 0.14.1 (:commit:`6cb9e1c`) 4253- fixed reference to tutorial directory (:commit:`4b86bd6`) 4254- doc: removed duplicated callback argument from Request.replace() (:commit:`1aeccdd`) 4255- fixed formatting of scrapyd doc (:commit:`8bf19e6`) 4256- Dump stacks for all running threads and fix engine status dumped by StackTraceDump extension (:commit:`14a8e6e`) 4257- added comment about why we disable ssl on boto images upload (:commit:`5223575`) 4258- SSL handshaking hangs when doing too many parallel connections to S3 (:commit:`63d583d`) 4259- change tutorial to follow changes on dmoz site (:commit:`bcb3198`) 4260- Avoid _disconnectedDeferred AttributeError exception in Twisted>=11.1.0 (:commit:`98f3f87`) 4261- allow spider to set autothrottle max concurrency (:commit:`175a4b5`) 4262 4263Scrapy 0.14 4264----------- 4265 4266New features and settings 4267~~~~~~~~~~~~~~~~~~~~~~~~~ 4268 4269- Support for `AJAX crawleable urls`_ 4270- New persistent scheduler that stores requests on disk, allowing to suspend and resume crawls (:rev:`2737`) 4271- added ``-o`` option to ``scrapy crawl``, a shortcut for dumping scraped items into a file (or standard output using ``-``) 4272- Added support for passing custom settings to Scrapyd ``schedule.json`` api (:rev:`2779`, :rev:`2783`) 4273- New ``ChunkedTransferMiddleware`` (enabled by default) to support `chunked transfer encoding`_ (:rev:`2769`) 4274- Add boto 2.0 support for S3 downloader handler (:rev:`2763`) 4275- Added `marshal`_ to formats supported by feed exports (:rev:`2744`) 4276- In request errbacks, offending requests are now received in ``failure.request`` attribute (:rev:`2738`) 4277- Big downloader refactoring to support per domain/ip concurrency limits (:rev:`2732`) 4278 - ``CONCURRENT_REQUESTS_PER_SPIDER`` setting has been deprecated and replaced by: 4279 - :setting:`CONCURRENT_REQUESTS`, :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`, :setting:`CONCURRENT_REQUESTS_PER_IP` 4280 - check the documentation for more details 4281- Added builtin caching DNS resolver (:rev:`2728`) 4282- Moved Amazon AWS-related components/extensions (SQS spider queue, SimpleDB stats collector) to a separate project: [scaws](https://github.com/scrapinghub/scaws) (:rev:`2706`, :rev:`2714`) 4283- Moved spider queues to scrapyd: ``scrapy.spiderqueue`` -> ``scrapyd.spiderqueue`` (:rev:`2708`) 4284- Moved sqlite utils to scrapyd: ``scrapy.utils.sqlite`` -> ``scrapyd.sqlite`` (:rev:`2781`) 4285- Real support for returning iterators on ``start_requests()`` method. The iterator is now consumed during the crawl when the spider is getting idle (:rev:`2704`) 4286- Added :setting:`REDIRECT_ENABLED` setting to quickly enable/disable the redirect middleware (:rev:`2697`) 4287- Added :setting:`RETRY_ENABLED` setting to quickly enable/disable the retry middleware (:rev:`2694`) 4288- Added ``CloseSpider`` exception to manually close spiders (:rev:`2691`) 4289- Improved encoding detection by adding support for HTML5 meta charset declaration (:rev:`2690`) 4290- Refactored close spider behavior to wait for all downloads to finish and be processed by spiders, before closing the spider (:rev:`2688`) 4291- Added ``SitemapSpider`` (see documentation in Spiders page) (:rev:`2658`) 4292- Added ``LogStats`` extension for periodically logging basic stats (like crawled pages and scraped items) (:rev:`2657`) 4293- Make handling of gzipped responses more robust (#319, :rev:`2643`). Now Scrapy will try and decompress as much as possible from a gzipped response, instead of failing with an ``IOError``. 4294- Simplified !MemoryDebugger extension to use stats for dumping memory debugging info (:rev:`2639`) 4295- Added new command to edit spiders: ``scrapy edit`` (:rev:`2636`) and ``-e`` flag to ``genspider`` command that uses it (:rev:`2653`) 4296- Changed default representation of items to pretty-printed dicts. (:rev:`2631`). This improves default logging by making log more readable in the default case, for both Scraped and Dropped lines. 4297- Added :signal:`spider_error` signal (:rev:`2628`) 4298- Added :setting:`COOKIES_ENABLED` setting (:rev:`2625`) 4299- Stats are now dumped to Scrapy log (default value of :setting:`STATS_DUMP` setting has been changed to ``True``). This is to make Scrapy users more aware of Scrapy stats and the data that is collected there. 4300- Added support for dynamically adjusting download delay and maximum concurrent requests (:rev:`2599`) 4301- Added new DBM HTTP cache storage backend (:rev:`2576`) 4302- Added ``listjobs.json`` API to Scrapyd (:rev:`2571`) 4303- ``CsvItemExporter``: added ``join_multivalued`` parameter (:rev:`2578`) 4304- Added namespace support to ``xmliter_lxml`` (:rev:`2552`) 4305- Improved cookies middleware by making ``COOKIES_DEBUG`` nicer and documenting it (:rev:`2579`) 4306- Several improvements to Scrapyd and Link extractors 4307 4308Code rearranged and removed 4309~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4310 4311- Merged item passed and item scraped concepts, as they have often proved confusing in the past. This means: (:rev:`2630`) 4312 - original item_scraped signal was removed 4313 - original item_passed signal was renamed to item_scraped 4314 - old log lines ``Scraped Item...`` were removed 4315 - old log lines ``Passed Item...`` were renamed to ``Scraped Item...`` lines and downgraded to ``DEBUG`` level 4316- Reduced Scrapy codebase by striping part of Scrapy code into two new libraries: 4317 - `w3lib`_ (several functions from ``scrapy.utils.{http,markup,multipart,response,url}``, done in :rev:`2584`) 4318 - `scrapely`_ (was ``scrapy.contrib.ibl``, done in :rev:`2586`) 4319- Removed unused function: ``scrapy.utils.request.request_info()`` (:rev:`2577`) 4320- Removed googledir project from ``examples/googledir``. There's now a new example project called ``dirbot`` available on GitHub: https://github.com/scrapy/dirbot 4321- Removed support for default field values in Scrapy items (:rev:`2616`) 4322- Removed experimental crawlspider v2 (:rev:`2632`) 4323- Removed scheduler middleware to simplify architecture. Duplicates filter is now done in the scheduler itself, using the same dupe fltering class as before (``DUPEFILTER_CLASS`` setting) (:rev:`2640`) 4324- Removed support for passing urls to ``scrapy crawl`` command (use ``scrapy parse`` instead) (:rev:`2704`) 4325- Removed deprecated Execution Queue (:rev:`2704`) 4326- Removed (undocumented) spider context extension (from scrapy.contrib.spidercontext) (:rev:`2780`) 4327- removed ``CONCURRENT_SPIDERS`` setting (use scrapyd maxproc instead) (:rev:`2789`) 4328- Renamed attributes of core components: downloader.sites -> downloader.slots, scraper.sites -> scraper.slots (:rev:`2717`, :rev:`2718`) 4329- Renamed setting ``CLOSESPIDER_ITEMPASSED`` to :setting:`CLOSESPIDER_ITEMCOUNT` (:rev:`2655`). Backward compatibility kept. 4330 4331Scrapy 0.12 4332----------- 4333 4334The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available. 4335 4336New features and improvements 4337~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4338 4339- Passed item is now sent in the ``item`` argument of the :signal:`item_passed 4340 <item_scraped>` (#273) 4341- Added verbose option to ``scrapy version`` command, useful for bug reports (#298) 4342- HTTP cache now stored by default in the project data dir (#279) 4343- Added project data storage directory (#276, #277) 4344- Documented file structure of Scrapy projects (see command-line tool doc) 4345- New lxml backend for XPath selectors (#147) 4346- Per-spider settings (#245) 4347- Support exit codes to signal errors in Scrapy commands (#248) 4348- Added ``-c`` argument to ``scrapy shell`` command 4349- Made ``libxml2`` optional (#260) 4350- New ``deploy`` command (#261) 4351- Added :setting:`CLOSESPIDER_PAGECOUNT` setting (#253) 4352- Added :setting:`CLOSESPIDER_ERRORCOUNT` setting (#254) 4353 4354Scrapyd changes 4355~~~~~~~~~~~~~~~ 4356 4357- Scrapyd now uses one process per spider 4358- It stores one log file per spider run, and rotate them keeping the lastest 5 logs per spider (by default) 4359- A minimal web ui was added, available at http://localhost:6800 by default 4360- There is now a ``scrapy server`` command to start a Scrapyd server of the current project 4361 4362Changes to settings 4363~~~~~~~~~~~~~~~~~~~ 4364 4365- added ``HTTPCACHE_ENABLED`` setting (False by default) to enable HTTP cache middleware 4366- changed ``HTTPCACHE_EXPIRATION_SECS`` semantics: now zero means "never expire". 4367 4368Deprecated/obsoleted functionality 4369~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4370 4371- Deprecated ``runserver`` command in favor of ``server`` command which starts a Scrapyd server. See also: Scrapyd changes 4372- Deprecated ``queue`` command in favor of using Scrapyd ``schedule.json`` API. See also: Scrapyd changes 4373- Removed the !LxmlItemLoader (experimental contrib which never graduated to main contrib) 4374 4375Scrapy 0.10 4376----------- 4377 4378The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available. 4379 4380New features and improvements 4381~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4382 4383- New Scrapy service called ``scrapyd`` for deploying Scrapy crawlers in production (#218) (documentation available) 4384- Simplified Images pipeline usage which doesn't require subclassing your own images pipeline now (#217) 4385- Scrapy shell now shows the Scrapy log by default (#206) 4386- Refactored execution queue in a common base code and pluggable backends called "spider queues" (#220) 4387- New persistent spider queue (based on SQLite) (#198), available by default, which allows to start Scrapy in server mode and then schedule spiders to run. 4388- Added documentation for Scrapy command-line tool and all its available sub-commands. (documentation available) 4389- Feed exporters with pluggable backends (#197) (documentation available) 4390- Deferred signals (#193) 4391- Added two new methods to item pipeline open_spider(), close_spider() with deferred support (#195) 4392- Support for overriding default request headers per spider (#181) 4393- Replaced default Spider Manager with one with similar functionality but not depending on Twisted Plugins (#186) 4394- Splitted Debian package into two packages - the library and the service (#187) 4395- Scrapy log refactoring (#188) 4396- New extension for keeping persistent spider contexts among different runs (#203) 4397- Added ``dont_redirect`` request.meta key for avoiding redirects (#233) 4398- Added ``dont_retry`` request.meta key for avoiding retries (#234) 4399 4400Command-line tool changes 4401~~~~~~~~~~~~~~~~~~~~~~~~~ 4402 4403- New ``scrapy`` command which replaces the old ``scrapy-ctl.py`` (#199) 4404 - there is only one global ``scrapy`` command now, instead of one ``scrapy-ctl.py`` per project 4405 - Added ``scrapy.bat`` script for running more conveniently from Windows 4406- Added bash completion to command-line tool (#210) 4407- Renamed command ``start`` to ``runserver`` (#209) 4408 4409API changes 4410~~~~~~~~~~~ 4411 4412- ``url`` and ``body`` attributes of Request objects are now read-only (#230) 4413- ``Request.copy()`` and ``Request.replace()`` now also copies their ``callback`` and ``errback`` attributes (#231) 4414- Removed ``UrlFilterMiddleware`` from ``scrapy.contrib`` (already disabled by default) 4415- Offsite middleware doesn't filter out any request coming from a spider that doesn't have a allowed_domains attribute (#225) 4416- Removed Spider Manager ``load()`` method. Now spiders are loaded in the ``__init__`` method itself. 4417- Changes to Scrapy Manager (now called "Crawler"): 4418 - ``scrapy.core.manager.ScrapyManager`` class renamed to ``scrapy.crawler.Crawler`` 4419 - ``scrapy.core.manager.scrapymanager`` singleton moved to ``scrapy.project.crawler`` 4420- Moved module: ``scrapy.contrib.spidermanager`` to ``scrapy.spidermanager`` 4421- Spider Manager singleton moved from ``scrapy.spider.spiders`` to the ``spiders` attribute of ``scrapy.project.crawler`` singleton. 4422- moved Stats Collector classes: (#204) 4423 - ``scrapy.stats.collector.StatsCollector`` to ``scrapy.statscol.StatsCollector`` 4424 - ``scrapy.stats.collector.SimpledbStatsCollector`` to ``scrapy.contrib.statscol.SimpledbStatsCollector`` 4425- default per-command settings are now specified in the ``default_settings`` attribute of command object class (#201) 4426- changed arguments of Item pipeline ``process_item()`` method from ``(spider, item)`` to ``(item, spider)`` 4427 - backward compatibility kept (with deprecation warning) 4428- moved ``scrapy.core.signals`` module to ``scrapy.signals`` 4429 - backward compatibility kept (with deprecation warning) 4430- moved ``scrapy.core.exceptions`` module to ``scrapy.exceptions`` 4431 - backward compatibility kept (with deprecation warning) 4432- added ``handles_request()`` class method to ``BaseSpider`` 4433- dropped ``scrapy.log.exc()`` function (use ``scrapy.log.err()`` instead) 4434- dropped ``component`` argument of ``scrapy.log.msg()`` function 4435- dropped ``scrapy.log.log_level`` attribute 4436- Added ``from_settings()`` class methods to Spider Manager, and Item Pipeline Manager 4437 4438Changes to settings 4439~~~~~~~~~~~~~~~~~~~ 4440 4441- Added ``HTTPCACHE_IGNORE_SCHEMES`` setting to ignore certain schemes on !HttpCacheMiddleware (#225) 4442- Added ``SPIDER_QUEUE_CLASS`` setting which defines the spider queue to use (#220) 4443- Added ``KEEP_ALIVE`` setting (#220) 4444- Removed ``SERVICE_QUEUE`` setting (#220) 4445- Removed ``COMMANDS_SETTINGS_MODULE`` setting (#201) 4446- Renamed ``REQUEST_HANDLERS`` to ``DOWNLOAD_HANDLERS`` and make download handlers classes (instead of functions) 4447 4448Scrapy 0.9 4449---------- 4450 4451The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available. 4452 4453New features and improvements 4454~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4455 4456- Added SMTP-AUTH support to scrapy.mail 4457- New settings added: ``MAIL_USER``, ``MAIL_PASS`` (:rev:`2065` | #149) 4458- Added new scrapy-ctl view command - To view URL in the browser, as seen by Scrapy (:rev:`2039`) 4459- Added web service for controlling Scrapy process (this also deprecates the web console. (:rev:`2053` | #167) 4460- Support for running Scrapy as a service, for production systems (:rev:`1988`, :rev:`2054`, :rev:`2055`, :rev:`2056`, :rev:`2057` | #168) 4461- Added wrapper induction library (documentation only available in source code for now). (:rev:`2011`) 4462- Simplified and improved response encoding support (:rev:`1961`, :rev:`1969`) 4463- Added ``LOG_ENCODING`` setting (:rev:`1956`, documentation available) 4464- Added ``RANDOMIZE_DOWNLOAD_DELAY`` setting (enabled by default) (:rev:`1923`, doc available) 4465- ``MailSender`` is no longer IO-blocking (:rev:`1955` | #146) 4466- Linkextractors and new Crawlspider now handle relative base tag urls (:rev:`1960` | #148) 4467- Several improvements to Item Loaders and processors (:rev:`2022`, :rev:`2023`, :rev:`2024`, :rev:`2025`, :rev:`2026`, :rev:`2027`, :rev:`2028`, :rev:`2029`, :rev:`2030`) 4468- Added support for adding variables to telnet console (:rev:`2047` | #165) 4469- Support for requests without callbacks (:rev:`2050` | #166) 4470 4471API changes 4472~~~~~~~~~~~ 4473 4474- Change ``Spider.domain_name`` to ``Spider.name`` (SEP-012, :rev:`1975`) 4475- ``Response.encoding`` is now the detected encoding (:rev:`1961`) 4476- ``HttpErrorMiddleware`` now returns None or raises an exception (:rev:`2006` | #157) 4477- ``scrapy.command`` modules relocation (:rev:`2035`, :rev:`2036`, :rev:`2037`) 4478- Added ``ExecutionQueue`` for feeding spiders to scrape (:rev:`2034`) 4479- Removed ``ExecutionEngine`` singleton (:rev:`2039`) 4480- Ported ``S3ImagesStore`` (images pipeline) to use boto and threads (:rev:`2033`) 4481- Moved module: ``scrapy.management.telnet`` to ``scrapy.telnet`` (:rev:`2047`) 4482 4483Changes to default settings 4484~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4485 4486- Changed default ``SCHEDULER_ORDER`` to ``DFO`` (:rev:`1939`) 4487 4488Scrapy 0.8 4489---------- 4490 4491The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available. 4492 4493New features 4494~~~~~~~~~~~~ 4495 4496- Added DEFAULT_RESPONSE_ENCODING setting (:rev:`1809`) 4497- Added ``dont_click`` argument to ``FormRequest.from_response()`` method (:rev:`1813`, :rev:`1816`) 4498- Added ``clickdata`` argument to ``FormRequest.from_response()`` method (:rev:`1802`, :rev:`1803`) 4499- Added support for HTTP proxies (``HttpProxyMiddleware``) (:rev:`1781`, :rev:`1785`) 4500- Offsite spider middleware now logs messages when filtering out requests (:rev:`1841`) 4501 4502Backward-incompatible changes 4503~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4504 4505- Changed ``scrapy.utils.response.get_meta_refresh()`` signature (:rev:`1804`) 4506- Removed deprecated ``scrapy.item.ScrapedItem`` class - use ``scrapy.item.Item instead`` (:rev:`1838`) 4507- Removed deprecated ``scrapy.xpath`` module - use ``scrapy.selector`` instead. (:rev:`1836`) 4508- Removed deprecated ``core.signals.domain_open`` signal - use ``core.signals.domain_opened`` instead (:rev:`1822`) 4509- ``log.msg()`` now receives a ``spider`` argument (:rev:`1822`) 4510 - Old domain argument has been deprecated and will be removed in 0.9. For spiders, you should always use the ``spider`` argument and pass spider references. If you really want to pass a string, use the ``component`` argument instead. 4511- Changed core signals ``domain_opened``, ``domain_closed``, ``domain_idle`` 4512- Changed Item pipeline to use spiders instead of domains 4513 - The ``domain`` argument of ``process_item()`` item pipeline method was changed to ``spider``, the new signature is: ``process_item(spider, item)`` (:rev:`1827` | #105) 4514 - To quickly port your code (to work with Scrapy 0.8) just use ``spider.domain_name`` where you previously used ``domain``. 4515- Changed Stats API to use spiders instead of domains (:rev:`1849` | #113) 4516 - ``StatsCollector`` was changed to receive spider references (instead of domains) in its methods (``set_value``, ``inc_value``, etc). 4517 - added ``StatsCollector.iter_spider_stats()`` method 4518 - removed ``StatsCollector.list_domains()`` method 4519 - Also, Stats signals were renamed and now pass around spider references (instead of domains). Here's a summary of the changes: 4520 - To quickly port your code (to work with Scrapy 0.8) just use ``spider.domain_name`` where you previously used ``domain``. ``spider_stats`` contains exactly the same data as ``domain_stats``. 4521- ``CloseDomain`` extension moved to ``scrapy.contrib.closespider.CloseSpider`` (:rev:`1833`) 4522 - Its settings were also renamed: 4523 - ``CLOSEDOMAIN_TIMEOUT`` to ``CLOSESPIDER_TIMEOUT`` 4524 - ``CLOSEDOMAIN_ITEMCOUNT`` to ``CLOSESPIDER_ITEMCOUNT`` 4525- Removed deprecated ``SCRAPYSETTINGS_MODULE`` environment variable - use ``SCRAPY_SETTINGS_MODULE`` instead (:rev:`1840`) 4526- Renamed setting: ``REQUESTS_PER_DOMAIN`` to ``CONCURRENT_REQUESTS_PER_SPIDER`` (:rev:`1830`, :rev:`1844`) 4527- Renamed setting: ``CONCURRENT_DOMAINS`` to ``CONCURRENT_SPIDERS`` (:rev:`1830`) 4528- Refactored HTTP Cache middleware 4529- HTTP Cache middleware has been heavilty refactored, retaining the same functionality except for the domain sectorization which was removed. (:rev:`1843` ) 4530- Renamed exception: ``DontCloseDomain`` to ``DontCloseSpider`` (:rev:`1859` | #120) 4531- Renamed extension: ``DelayedCloseDomain`` to ``SpiderCloseDelay`` (:rev:`1861` | #121) 4532- Removed obsolete ``scrapy.utils.markup.remove_escape_chars`` function - use ``scrapy.utils.markup.replace_escape_chars`` instead (:rev:`1865`) 4533 4534Scrapy 0.7 4535---------- 4536 4537First release of Scrapy. 4538 4539 4540.. _AJAX crawleable urls: https://developers.google.com/search/docs/ajax-crawling/docs/getting-started?csw=1 4541.. _botocore: https://github.com/boto/botocore 4542.. _chunked transfer encoding: https://en.wikipedia.org/wiki/Chunked_transfer_encoding 4543.. _ClientForm: http://wwwsearch.sourceforge.net/old/ClientForm/ 4544.. _Creating a pull request: https://help.github.com/en/articles/creating-a-pull-request 4545.. _cryptography: https://cryptography.io/en/latest/ 4546.. _docstrings: https://docs.python.org/3/glossary.html#term-docstring 4547.. _KeyboardInterrupt: https://docs.python.org/3/library/exceptions.html#KeyboardInterrupt 4548.. _LevelDB: https://github.com/google/leveldb 4549.. _lxml: https://lxml.de/ 4550.. _marshal: https://docs.python.org/2/library/marshal.html 4551.. _parsel.csstranslator.GenericTranslator: https://parsel.readthedocs.io/en/latest/parsel.html#parsel.csstranslator.GenericTranslator 4552.. _parsel.csstranslator.HTMLTranslator: https://parsel.readthedocs.io/en/latest/parsel.html#parsel.csstranslator.HTMLTranslator 4553.. _parsel.csstranslator.XPathExpr: https://parsel.readthedocs.io/en/latest/parsel.html#parsel.csstranslator.XPathExpr 4554.. _PEP 257: https://www.python.org/dev/peps/pep-0257/ 4555.. _Pillow: https://python-pillow.org/ 4556.. _pyOpenSSL: https://www.pyopenssl.org/en/stable/ 4557.. _queuelib: https://github.com/scrapy/queuelib 4558.. _registered with IANA: https://www.iana.org/assignments/media-types/media-types.xhtml 4559.. _resource: https://docs.python.org/2/library/resource.html 4560.. _robots.txt: https://www.robotstxt.org/ 4561.. _scrapely: https://github.com/scrapy/scrapely 4562.. _scrapy-bench: https://github.com/scrapy/scrapy-bench 4563.. _service_identity: https://service-identity.readthedocs.io/en/stable/ 4564.. _six: https://six.readthedocs.io/ 4565.. _tox: https://pypi.org/project/tox/ 4566.. _Twisted: https://twistedmatrix.com/trac/ 4567.. _w3lib: https://github.com/scrapy/w3lib 4568.. _w3lib.encoding: https://github.com/scrapy/w3lib/blob/master/w3lib/encoding.py 4569.. _What is cacheable: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9.1 4570.. _zope.interface: https://zopeinterface.readthedocs.io/en/latest/ 4571.. _Zsh: https://www.zsh.org/ 4572.. _zstandard: https://pypi.org/project/zstandard/ 4573