• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

examples/H03-May-2022-1,335874

pdfrw/H18-Sep-2017-3,6292,796

pdfrw.egg-info/H03-May-2022-819621

tests/H03-May-2022-835618

.gitignoreH A D14-Sep-2017769 6857

.travis.ymlH A D14-Sep-2017466 2120

MANIFEST.inH A D14-Sep-201792 43

PKG-INFOH A D18-Sep-201737.8 KiB819621

README.rstH A D18-Sep-201730.5 KiB790593

setup.cfgH A D18-Sep-201767 85

setup.pyH A D18-Sep-20171.4 KiB4137

README.rst

1==================
2pdfrw 0.4
3==================
4
5:Author: Patrick Maupin
6
7.. contents::
8    :backlinks: none
9
10.. sectnum::
11
12Introduction
13============
14
15**pdfrw** is a Python library and utility that reads and writes PDF files:
16
17* Version 0.4 is tested and works on Python 2.6, 2.7, 3.3, 3.4, 3.5, and 3.6
18* Operations include subsetting, merging, rotating, modifying metadata, etc.
19* The fastest pure Python PDF parser available
20* Has been used for years by a printer in pre-press production
21* Can be used with rst2pdf to faithfully reproduce vector images
22* Can be used either standalone, or in conjunction with `reportlab`__
23  to reuse existing PDFs in new ones
24* Permissively licensed
25
26__ http://www.reportlab.org/
27
28
29pdfrw will faithfully reproduce vector formats without
30rasterization, so the rst2pdf package has used pdfrw
31for PDF and SVG images by default since March 2010.
32
33pdfrw can also be used in conjunction with reportlab, in order
34to re-use portions of existing PDFs in new PDFs created with
35reportlab.
36
37
38Examples
39=========
40
41The library comes with several examples that show operation both with
42and without reportlab.
43
44
45All examples
46------------------
47
48The examples directory has a few scripts which use the library.
49Note that if these examples do not work with your PDF, you should
50try to use pdftk to uncompress and/or unencrypt them first.
51
52* `4up.py`__ will shrink pages down and place 4 of them on
53  each output page.
54* `alter.py`__ shows an example of modifying metadata, without
55  altering the structure of the PDF.
56* `booklet.py`__ shows an example of creating a 2-up output
57  suitable for printing and folding (e.g on tabloid size paper).
58* `cat.py`__ shows an example of concatenating multiple PDFs together.
59* `extract.py`__ will extract images and Form XObjects (embedded pages)
60  from existing PDFs to make them easier to use and refer to from
61  new PDFs (e.g. with reportlab or rst2pdf).
62* `poster.py`__ increases the size of a PDF so it can be printed
63  as a poster.
64* `print_two.py`__ Allows creation of 8.5 X 5.5" booklets by slicing
65  8.5 X 11" paper apart after printing.
66* `rotate.py`__ Rotates all or selected pages in a PDF.
67* `subset.py`__ Creates a new PDF with only a subset of pages from the
68  original.
69* `unspread.py`__ Takes a 2-up PDF, and splits out pages.
70* `watermark.py`__ Adds a watermark PDF image over or under all the pages
71  of a PDF.
72* `rl1/4up.py`__ Another 4up example, using reportlab canvas for output.
73* `rl1/booklet.py`__ Another booklet example, using reportlab canvas for
74  output.
75* `rl1/subset.py`__ Another subsetting example, using reportlab canvas for
76  output.
77* `rl1/platypus_pdf_template.py`__ Another watermarking example, using
78  reportlab canvas and generated output for the document.  Contributed
79  by user asannes.
80* `rl2`__ Experimental code for parsing graphics.  Needs work.
81* `subset_booklets.py`__ shows an example of creating a full printable pdf
82  version in a more professional and pratical way ( take a look at
83  http://www.wikihow.com/Bind-a-Book )
84
85__ https://github.com/pmaupin/pdfrw/tree/master/examples/4up.py
86__ https://github.com/pmaupin/pdfrw/tree/master/examples/alter.py
87__ https://github.com/pmaupin/pdfrw/tree/master/examples/booklet.py
88__ https://github.com/pmaupin/pdfrw/tree/master/examples/cat.py
89__ https://github.com/pmaupin/pdfrw/tree/master/examples/extract.py
90__ https://github.com/pmaupin/pdfrw/tree/master/examples/poster.py
91__ https://github.com/pmaupin/pdfrw/tree/master/examples/print_two.py
92__ https://github.com/pmaupin/pdfrw/tree/master/examples/rotate.py
93__ https://github.com/pmaupin/pdfrw/tree/master/examples/subset.py
94__ https://github.com/pmaupin/pdfrw/tree/master/examples/unspread.py
95__ https://github.com/pmaupin/pdfrw/tree/master/examples/watermark.py
96__ https://github.com/pmaupin/pdfrw/tree/master/examples/rl1/4up.py
97__ https://github.com/pmaupin/pdfrw/tree/master/examples/rl1/booklet.py
98__ https://github.com/pmaupin/pdfrw/tree/master/examples/rl1/subset.py
99__ https://github.com/pmaupin/pdfrw/tree/master/examples/rl1/platypus_pdf_template.py
100__ https://github.com/pmaupin/pdfrw/tree/master/examples/rl2/
101__ https://github.com/pmaupin/pdfrw/tree/master/examples/subset_booklets.py
102
103Notes on selected examples
104------------------------------------
105
106Reorganizing pages and placing them two-up
107~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
108
109A printer with a fancy printer and/or a full-up copy of Acrobat can
110easily turn your small PDF into a little booklet (for example, print 4
111letter-sized pages on a single 11" x 17").
112
113But that assumes several things, including that the personnel know how
114to operate the hardware and software. `booklet.py`__ lets you turn your PDF
115into a preformatted booklet, to give them fewer chances to mess it up.
116
117__ https://github.com/pmaupin/pdfrw/tree/master/examples/booklet.py
118
119Adding or modifying metadata
120~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
121
122The `cat.py`__ example will accept multiple input files on the command
123line, concatenate them and output them to output.pdf, after adding some
124nonsensical metadata to the output PDF file.
125
126__ https://github.com/pmaupin/pdfrw/tree/master/examples/cat.py
127
128The `alter.py`__ example alters a single metadata item in a PDF,
129and writes the result to a new PDF.
130
131__ https://github.com/pmaupin/pdfrw/tree/master/examples/alter.py
132
133
134One difference is that, since **cat** is creating a new PDF structure,
135and **alter** is attempting to modify an existing PDF structure, the
136PDF produced by alter (and also by watermark.py) *should* be
137more faithful to the original (except for the desired changes).
138
139For example, the alter.py navigation should be left intact, whereas with
140cat.py it will be stripped.
141
142
143Rotating and doubling
144~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
145
146If you ever want to print something that is like a small booklet, but
147needs to be spiral bound, you either have to do some fancy rearranging,
148or just waste half your paper.
149
150The `print_two.py`__ example program will, for example, make two side-by-side
151copies each page of of your PDF on a each output sheet.
152
153__ https://github.com/pmaupin/pdfrw/tree/master/examples/print_two.py
154
155But, every other page is flipped, so that you can print double-sided and
156the pages will line up properly and be pre-collated.
157
158Graphics stream parsing proof of concept
159~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
160
161The `copy.py`__ script shows a simple example of reading in a PDF, and
162using the decodegraphics.py module to try to write the same information
163out to a new PDF through a reportlab canvas. (If you know about reportlab,
164you know that if you can faithfully render a PDF to a reportlab canvas, you
165can do pretty much anything else with that PDF you want.) This kind of
166low level manipulation should be done only if you really need to.
167decodegraphics is really more than a proof of concept than anything
168else. For most cases, just use the Form XObject capability, as shown in
169the examples/rl1/booklet.py demo.
170
171__ https://github.com/pmaupin/pdfrw/tree/master/examples/rl2/copy.py
172
173pdfrw philosophy
174==================
175
176Core library
177-------------
178
179The philosophy of the library portion of pdfrw is to provide intuitive
180functions to read, manipulate, and write PDF files.  There should be
181minimal leakage between abstraction layers, although getting useful
182work done makes "pure" functionality separation difficult.
183
184A key concept supported by the library is the use of Form XObjects,
185which allow easy embedding of pieces of one PDF into another.
186
187Addition of core support to the library is typically done carefully
188and thoughtfully, so as not to clutter it up with too many special
189cases.
190
191There are a lot of incorrectly formatted PDFs floating around; support
192for these is added in some cases.  The decision is often based on what
193acroread and okular do with the PDFs; if they can display them properly,
194then eventually pdfrw should, too, if it is not too difficult or costly.
195
196Contributions are welcome; one user has contributed some decompression
197filters and the ability to process PDF 1.5 stream objects.  Additional
198functionality that would obviously be useful includes additional
199decompression filters, the ability to process password-protected PDFs,
200and the ability to output linearized PDFs.
201
202Examples
203--------
204
205The philosophy of the examples is to provide small, easily-understood
206examples that showcase pdfrw functionality.
207
208
209PDF files and Python
210======================
211
212Introduction
213------------
214
215In general, PDF files conceptually map quite well to Python. The major
216objects to think about are:
217
218-  **strings**. Most things are strings. These also often decompose
219   naturally into
220-  **lists of tokens**. Tokens can be combined to create higher-level
221   objects like
222-  **arrays** and
223-  **dictionaries** and
224-  **Contents streams** (which can be more streams of tokens)
225
226Difficulties
227------------
228
229The apparent primary difficulty in mapping PDF files to Python is the
230PDF file concept of "indirect objects."  Indirect objects provide
231the efficiency of allowing a single piece of data to be referred to
232from more than one containing object, but probably more importantly,
233indirect objects provide a way to get around the chicken and egg
234problem of circular object references when mapping arbitrary data
235structures to files. To flatten out a circular reference, an indirect
236object is *referred to* instead of being *directly included* in another
237object. PDF files have a global mechanism for locating indirect objects,
238and they all have two reference numbers (a reference number and a
239"generation" number, in case you wanted to append to the PDF file
240rather than just rewriting the whole thing).
241
242pdfrw automatically handles indirect references on reading in a PDF
243file. When pdfrw encounters an indirect PDF file object, the
244corresponding Python object it creates will have an 'indirect' attribute
245with a value of True. When writing a PDF file, if you have created
246arbitrary data, you just need to make sure that circular references are
247broken up by putting an attribute named 'indirect' which evaluates to
248True on at least one object in every cycle.
249
250Another PDF file concept that doesn't quite map to regular Python is a
251"stream". Streams are dictionaries which each have an associated
252unformatted data block. pdfrw handles streams by placing a special
253attribute on a subclassed dictionary.
254
255Usage Model
256-----------
257
258The usage model for pdfrw treats most objects as strings (it takes their
259string representation when writing them to a file). The two main
260exceptions are the PdfArray object and the PdfDict object.
261
262PdfArray is a subclass of list with two special features.  First,
263an 'indirect' attribute allows a PdfArray to be written out as
264an indirect PDF object.  Second, pdfrw reads files lazily, so
265PdfArray knows about, and resolves references to other indirect
266objects on an as-needed basis.
267
268PdfDict is a subclass of dict that also has an indirect attribute
269and lazy reference resolution as well.  (And the subclassed
270IndirectPdfDict has indirect automatically set True).
271
272But PdfDict also has an optional associated stream. The stream object
273defaults to None, but if you assign a stream to the dict, it will
274automatically set the PDF /Length attribute for the dictionary.
275
276Finally, since PdfDict instances are indexed by PdfName objects (which
277always start with a /) and since most (all?) standard Adobe PdfName
278objects use names formatted like "/CamelCase", it makes sense to allow
279access to dictionary elements via object attribute accesses as well as
280object index accesses. So usage of PdfDict objects is normally via
281attribute access, although non-standard names (though still with a
282leading slash) can be accessed via dictionary index lookup.
283
284Reading PDFs
285~~~~~~~~~~~~~~~
286
287The PdfReader object is a subclass of PdfDict, which allows easy access
288to an entire document::
289
290    >>> from pdfrw import PdfReader
291    >>> x = PdfReader('source.pdf')
292    >>> x.keys()
293    ['/Info', '/Size', '/Root']
294    >>> x.Info
295    {'/Producer': '(cairo 1.8.6 (http://cairographics.org))',
296     '/Creator': '(cairo 1.8.6 (http://cairographics.org))'}
297    >>> x.Root.keys()
298    ['/Type', '/Pages']
299
300Info, Size, and Root are retrieved from the trailer of the PDF file.
301
302In addition to the tree structure, pdfrw creates a special attribute
303named *pages*, that is a list of all the pages in the document. pdfrw
304creates the *pages* attribute as a simplification for the user, because
305the PDF format allows arbitrarily complicated nested dictionaries to
306describe the page order. Each entry in the *pages* list is the PdfDict
307object for one of the pages in the file, in order.
308
309::
310
311    >>> len(x.pages)
312    1
313    >>> x.pages[0]
314    {'/Parent': {'/Kids': [{...}], '/Type': '/Pages', '/Count': '1'},
315     '/Contents': {'/Length': '11260', '/Filter': None},
316     '/Resources': ... (Lots more stuff snipped)
317    >>> x.pages[0].Contents
318    {'/Length': '11260', '/Filter': None}
319    >>> x.pages[0].Contents.stream
320    'q\n1 1 1 rg /a0 gs\n0 0 0 RG 0.657436
321      w\n0 J\n0 j\n[] 0.0 d\n4 M q' ... (Lots more stuff snipped)
322
323Writing PDFs
324~~~~~~~~~~~~~~~
325
326As you can see, it is quite easy to dig down into a PDF document. But
327what about when it's time to write it out?
328
329::
330
331    >>> from pdfrw import PdfWriter
332    >>> y = PdfWriter()
333    >>> y.addpage(x.pages[0])
334    >>> y.write('result.pdf')
335
336That's all it takes to create a new PDF. You may still need to read the
337`Adobe PDF reference manual`__ to figure out what needs to go *into*
338the PDF, but at least you don't have to sweat actually building it
339and getting the file offsets right.
340
341__ http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
342
343Manipulating PDFs in memory
344~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
345
346For the most part, pdfrw tries to be agnostic about the contents of
347PDF files, and support them as containers, but to do useful work,
348something a little higher-level is required, so pdfrw works to
349understand a bit about the contents of the containers.  For example:
350
351-  PDF pages. pdfrw knows enough to find the pages in PDF files you read
352   in, and to write a set of pages back out to a new PDF file.
353-  Form XObjects. pdfrw can take any page or rectangle on a page, and
354   convert it to a Form XObject, suitable for use inside another PDF
355   file.  It knows enough about these to perform scaling, rotation,
356   and positioning.
357-  reportlab objects. pdfrw can recursively create a set of reportlab
358   objects from its internal object format. This allows, for example,
359   Form XObjects to be used inside reportlab, so that you can reuse
360   content from an existing PDF file when building a new PDF with
361   reportlab.
362
363There are several examples that demonstrate these features in
364the example code directory.
365
366Missing features
367~~~~~~~~~~~~~~~~~~~~~~~
368
369Even as a pure PDF container library, pdfrw comes up a bit short. It
370does not currently support:
371
372-  Most compression/decompression filters
373-  encryption
374
375`pdftk`__ is a wonderful command-line
376tool that can convert your PDFs to remove encryption and compression.
377However, in most cases, you can do a lot of useful work with PDFs
378without actually removing compression, because only certain elements
379inside PDFs are actually compressed.
380
381__ https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
382
383Library internals
384==================
385
386Introduction
387------------
388
389**pdfrw** currently consists of 19 modules organized into a main
390package and one sub-package.
391
392The `__init.py__`__ module does the usual thing of importing a few
393major attributes from some of the submodules, and the `errors.py`__
394module supports logging and exception generation.
395
396__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/__init__.py
397__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/errors.py
398
399
400PDF object model support
401--------------------------
402
403The `objects`__ sub-package contains one module for each of the
404internal representations of the kinds of basic objects that exist
405in a PDF file, with the `objects/__init__.py`__ module in that
406package simply gathering them up and making them available to the
407main pdfrw package.
408
409One feature that all the PDF object classes have in common is the
410inclusion of an 'indirect' attribute. If 'indirect' exists and evaluates
411to True, then when the object is written out, it is written out as an
412indirect object. That is to say, it is addressable in the PDF file, and
413could be referenced by any number (including zero) of container objects.
414This indirect object capability saves space in PDF files by allowing
415objects such as fonts to be referenced from multiple pages, and also
416allows PDF files to contain internal circular references.  This latter
417capability is used, for example, when each page object has a "parent"
418object in its dictionary.
419
420__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/objects/
421__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/objects/__init__.py
422
423Ordinary objects
424~~~~~~~~~~~~~~~~
425
426The `objects/pdfobject.py`__ module contains the PdfObject class, which is
427a subclass of str, and is the catch-all object for any PDF file elements
428that are not explicitly represented by other objects, as described below.
429
430__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/objects/pdfobject.py
431
432Name objects
433~~~~~~~~~~~~
434
435The `objects/pdfname.py`__ module contains the PdfName singleton object,
436which will convert a string into a PDF name by prepending a slash. It can
437be used either by calling it or getting an attribute, e.g.::
438
439    PdfName.Rotate == PdfName('Rotate') == PdfObject('/Rotate')
440
441In the example above, there is a slight difference between the objects
442returned from PdfName, and the object returned from PdfObject.  The
443PdfName objects are actually objects of class "BasePdfName".  This
444is important, because only these may be used as keys in PdfDict objects.
445
446__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/objects/pdfname.py
447
448String objects
449~~~~~~~~~~~~~~
450
451The `objects/pdfstring.py`__
452module contains the PdfString class, which is a subclass of str that is
453used to represent encoded strings in a PDF file. The class has encode
454and decode methods for the strings.
455
456__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/objects/pdfstring.py
457
458
459Array objects
460~~~~~~~~~~~~~
461
462The `objects/pdfarray.py`__
463module contains the PdfArray class, which is a subclass of list that is
464used to represent arrays in a PDF file. A regular list could be used
465instead, but use of the PdfArray class allows for an indirect attribute
466to be set, and also allows for proxying of unresolved indirect objects
467(that haven't been read in yet) in a manner that is transparent to pdfrw
468clients.
469
470__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/objects/pdfarray.py
471
472Dict objects
473~~~~~~~~~~~~
474
475The `objects/pdfdict.py`__
476module contains the PdfDict class, which is a subclass of dict that is
477used to represent dictionaries in a PDF file. A regular dict could be
478used instead, but the PdfDict class matches the requirements of PDF
479files more closely:
480
481* Transparent (from the library client's viewpoint) proxying
482  of unresolved indirect objects
483* Return of None for non-existent keys (like dict.get)
484* Mapping of attribute accesses to the dict itself
485  (pdfdict.Foo == pdfdict[NameObject('Foo')])
486* Automatic management of following stream and /Length attributes
487  for content dictionaries
488* Indirect attribute
489* Other attributes may be set for private internal use of the
490  library and/or its clients.
491* Support for searching parent dictionaries for PDF "inheritable"
492  attributes.
493
494__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/objects/pdfdict.py
495
496If a PdfDict has an associated data stream in the PDF file, the stream
497is accessed via the 'stream' (all lower-case) attribute.  Setting the
498stream attribute on the PdfDict will automatically set the /Length attribute
499as well.  If that is not what is desired (for example if the the stream
500is compressed), then _stream (same name with an underscore) may be used
501to associate the stream with the PdfDict without setting the length.
502
503To set private attributes (that will not be written out to a new PDF
504file) on a dictionary, use the 'private' attribute::
505
506    mydict.private.foo = 1
507
508Once the attribute is set, it may be accessed directly as an attribute
509of the dictionary::
510
511    foo = mydict.foo
512
513Some attributes of PDF pages are "inheritable."  That is, they may
514belong to a parent dictionary (or a parent of a parent dictionary, etc.)
515The "inheritable" attribute allows for easy discovery of these::
516
517    mediabox = mypage.inheritable.MediaBox
518
519
520Proxy objects
521~~~~~~~~~~~~~
522
523The `objects/pdfindirect.py`__
524module contains the PdfIndirect class, which is a non-transparent proxy
525object for PDF objects that have not yet been read in and resolved from
526a file. Although these are non-transparent inside the library, client code
527should never see one of these -- they exist inside the PdfArray and PdfDict
528container types, but are resolved before being returned to a client of
529those types.
530
531__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/objects/pdfindirect.py
532
533
534File reading, tokenization and parsing
535--------------------------------------
536
537`pdfreader.py`__
538contains the PdfReader class, which can read a PDF file (or be passed a
539file object or already read string) and parse it. It uses the PdfTokens
540class in `tokens.py`__  for low-level tokenization.
541
542__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/pdfreader.py
543__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/tokens.py
544
545
546The PdfReader class does not, in general, parse into containers (e.g.
547inside the content streams). There is a proof of concept for doing that
548inside the examples/rl2 subdirectory, but that is slow and not well-developed,
549and not useful for most applications.
550
551An instance of the PdfReader class is an instance of a PdfDict -- the
552trailer dictionary of the PDF file, to be exact.  It will have a private
553attribute set on it that is named 'pages' that is a list containing all
554the pages in the file.
555
556When instantiating a PdfReader object, there are options available
557for decompressing all the objects in the file.  pdfrw does not currently
558have very many options for decompression, so this is not all that useful,
559except in the specific case of compressed object streams.
560
561Also, there are no options for decryption yet.  If you have PDF files
562that are encrypted or heavily compressed, you may find that using another
563program like pdftk on them can make them readable by pdfrw.
564
565In general, the objects are read from the file lazily, but this is not
566currently true with compressed object streams -- all of these are decompressed
567and read in when the PdfReader is instantiated.
568
569
570File output
571-----------
572
573`pdfwriter.py`__
574contains the PdfWriter class, which can create and output a PDF file.
575
576__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/pdfwriter.py
577
578There are a few options available when creating and using this class.
579
580In the simplest case, an instance of PdfWriter is instantiated, and
581then pages are added to it from one or more source files (or created
582programmatically), and then the write method is called to dump the
583results out to a file.
584
585If you have a source PDF and do not want to disturb the structure
586of it too badly, then you may pass its trailer directly to PdfWriter
587rather than letting PdfWriter construct one for you.  There is an
588example of this (alter.py) in the examples directory.
589
590
591Advanced features
592-----------------
593
594`buildxobj.py`__
595contains functions to build Form XObjects out of pages or rectangles on
596pages.  These may be reused in new PDFs essentially as if they were images.
597
598buildxobj is careful to cache any page used so that it only appears in
599the output once.
600
601__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/buildxobj.py
602
603
604`toreportlab.py`__
605provides the makerl function, which will translate pdfrw objects into a
606format which can be used with `reportlab <http://www.reportlab.org/>`__.
607It is normally used in conjunction with buildxobj, to be able to reuse
608parts of existing PDFs when using reportlab.
609
610__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/toreportlab.py
611
612
613`pagemerge.py`__ builds on the foundation laid by buildxobj.  It
614contains classes to create a new page (or overlay an existing page)
615using one or more rectangles from other pages.  There are examples
616showing its use for watermarking, scaling, 4-up output, splitting
617each page in 2, etc.
618
619__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/pagemerge.py
620
621`findobjs.py`__ contains code that can find specific kinds of objects
622inside a PDF file.  The extract.py example uses this module to create
623a new PDF that places each image and Form XObject from a source PDF onto
624its own page, e.g. for easy reuse with some of the other examples or
625with reportlab.
626
627__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/findobjs.py
628
629
630Miscellaneous
631----------------
632
633`compress.py`__ and `uncompress.py`__
634contains compression and decompression functions. Very few filters are
635currently supported, so an external tool like pdftk might be good if you
636require the ability to decompress (or, for that matter, decrypt) PDF
637files.
638
639__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/compress.py
640__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/uncompress.py
641
642
643`py23_diffs.py`__ contains code to help manage the differences between
644Python 2 and Python 3.
645
646__ https://github.com/pmaupin/pdfrw/tree/master/pdfrw/py23_diffs.py
647
648Testing
649===============
650
651The tests associated with pdfrw require a large number of PDFs,
652which are not distributed with the library.
653
654To run the tests:
655
656* Download or clone the full package from github.com/pmaupin/pdfrw
657* cd into the tests directory, and then clone the package
658  github.com/pmaupin/static_pdfs into a subdirectory (also named
659  static_pdfs).
660* Now the tests may be run from that directory using unittest, or
661  py.test, or nose.
662* travisci is used at github, and runs the tests with py.test
663
664Other libraries
665=====================
666
667Pure Python
668-----------
669
670-  `reportlab <http://www.reportlab.org/>`__
671
672    reportlab is must-have software if you want to programmatically
673    generate arbitrary PDFs.
674
675-  `pyPdf <https://github.com/mstamy2/PyPDF2>`__
676
677    pyPdf is, in some ways, very full-featured. It can do decompression
678    and decryption and seems to know a lot about items inside at least
679    some kinds of PDF files. In comparison, pdfrw knows less about
680    specific PDF file features (such as metadata), but focuses on trying
681    to have a more Pythonic API for mapping the PDF file container
682    syntax to Python, and (IMO) has a simpler and better PDF file
683    parser.  The Form XObject capability of pdfrw means that, in many
684    cases, it does not actually need to decompress objects -- they
685    can be left compressed.
686
687-  `pdftools <http://www.boddie.org.uk/david/Projects/Python/pdftools/index.html>`__
688
689    pdftools feels large and I fell asleep trying to figure out how it
690    all fit together, but many others have done useful things with it.
691
692-  `pagecatcher <http://www.reportlab.com/docs/pagecatcher-ds.pdf>`__
693
694    My understanding is that pagecatcher would have done exactly what I
695    wanted when I built pdfrw. But I was on a zero budget, so I've never
696    had the pleasure of experiencing pagecatcher. I do, however, use and
697    like `reportlab <http://www.reportlab.org/>`__ (open source, from
698    the people who make pagecatcher) so I'm sure pagecatcher is great,
699    better documented and much more full-featured than pdfrw.
700
701-  `pdfminer <http://www.unixuser.org/~euske/python/pdfminer/index.html>`__
702
703    This looks like a useful, actively-developed program. It is quite
704    large, but then, it is trying to actively comprehend a full PDF
705    document. From the website:
706
707    "PDFMiner is a suite of programs that help extracting and analyzing
708    text data of PDF documents. Unlike other PDF-related tools, it
709    allows to obtain the exact location of texts in a page, as well as
710    other extra information such as font information or ruled lines. It
711    includes a PDF converter that can transform PDF files into other
712    text formats (such as HTML). It has an extensible PDF parser that
713    can be used for other purposes instead of text analysis."
714
715non-pure-Python libraries
716-------------------------
717
718-  `pyPoppler <https://launchpad.net/poppler-python/>`__ can read PDF
719   files.
720-  `pycairo <http://www.cairographics.org/pycairo/>`__ can write PDF
721   files.
722-  `PyMuPDF <https://github.com/rk700/PyMuPDF>`_ high performance rendering
723   of PDF, (Open)XPS, CBZ and EPUB
724
725Other tools
726-----------
727
728-  `pdftk <https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/>`__ is a wonderful command
729   line tool for basic PDF manipulation. It complements pdfrw extremely
730   well, supporting many operations such as decryption and decompression
731   that pdfrw cannot do.
732-  `MuPDF <http://www.mupdf.com/>`_ is a free top performance PDF, (Open)XPS, CBZ and EPUB rendering library
733   that also comes with some command line tools. One of those, ``mutool``, has big overlaps with pdftk's -
734   except it is up to 10 times faster.
735
736Release information
737=======================
738
739Revisions:
740
7410.4 -- Released 18 September, 2017
742
743    - Python 3.6 added to test matrix
744    - Proper unicode support for text strings in PDFs added
745    - buildxobj fixes allow better support creating form XObjects
746      out of compressed pages in some cases
747    - Compression fixes for Python 3+
748    - New subset_booklets.py example
749    - Bug with non-compressed indices into compressed object streams fixed
750    - Bug with distinguishing compressed object stream first objects fixed
751    - Better error reporting added for some invalid PDFs (e.g. when reading
752      past the end of file)
753    - Better scrubbing of old bookmark information when writing PDFs, to
754      remove dangling references
755    - Refactoring of pdfwriter, including updating API, to allow future
756      enhancements for things like incremental writing
757    - Minor tokenizer speedup
758    - Some flate decompressor bugs fixed
759    - Compression and decompression tests added
760    - Tests for new unicode handling added
761    - PdfReader.readpages() recursion error (issue #92) fixed.
762    - Initial crypt filter support added
763
764
7650.3 -- Released 19 October, 2016.
766
767    - Python 3.5 added to test matrix
768    - Better support under Python 3.x for in-memory PDF file-like objects
769    - Some pagemerge and Unicode patches added
770    - Changes to logging allow better coexistence with other packages
771    - Fix for "from pdfrw import \*"
772    - New fancy_watermark.py example shows off capabilities of pagemerge.py
773    - metadata.py example renamed to cat.py
774
775
7760.2 -- Released 21 June, 2015.  Supports Python 2.6, 2.7, 3.3, and 3.4.
777
778    - Several bugs have been fixed
779    - New regression test functionally tests core with dozens of
780      PDFs, and also tests examples.
781    - Core has been ported and tested on Python3 by round-tripping
782      several difficult files and observing binary matching results
783      across the different Python versions.
784    - Still only minimal support for compression and no support
785      for encryption or newer PDF features.  (pdftk is useful
786      to put PDFs in a form that pdfrw can use.)
787
7880.1 -- Released to PyPI in 2012.  Supports Python 2.5 - 2.7
789
790