1.. testsetup::
2
3    import sys
4    from pdfminer.high_level import extract_text_to_fp, extract_text
5
6.. _tutorial_highlevel:
7
8Extract text from a PDF using Python
9************************************
10
11The high-level API can be used to do common tasks.
12
13The most simple way to extract text from a PDF is to use
14:ref:`api_extract_text`:
15
16.. doctest::
17
18    >>> text = extract_text('samples/simple1.pdf')
19    >>> print(repr(text))
20    'Hello \n\nWorld\n\nHello \n\nWorld\n\nH e l l o  \n\nW o r l d\n\nH e l l o  \n\nW o r l d\n\n\x0c'
21    >>> print(text)
22    ... # doctest: +NORMALIZE_WHITESPACE
23    Hello
24    <BLANKLINE>
25    World
26    <BLANKLINE>
27    Hello
28    <BLANKLINE>
29    World
30    <BLANKLINE>
31    H e l l o
32    <BLANKLINE>
33    W o r l d
34    <BLANKLINE>
35    H e l l o
36    <BLANKLINE>
37    W o r l d
38    <BLANKLINE>
39
40
41To read text from a PDF and print it on the command line:
42
43.. doctest::
44
45    >>> if sys.version_info > (3, 0):
46    ...     from io import StringIO
47    ... else:
48    ...     from io import BytesIO as StringIO
49    >>> output_string = StringIO()
50    >>> with open('samples/simple1.pdf', 'rb') as fin:
51    ...     extract_text_to_fp(fin, output_string)
52    >>> print(output_string.getvalue().strip())
53    Hello WorldHello WorldHello WorldHello World
54
55Or to convert it to html and use layout analysis:
56
57.. doctest::
58
59    >>> if sys.version_info > (3, 0):
60    ...     from io import StringIO
61    ... else:
62    ...     from io import BytesIO as StringIO
63    >>> from pdfminer.layout import LAParams
64    >>> output_string = StringIO()
65    >>> with open('samples/simple1.pdf', 'rb') as fin:
66    ...     extract_text_to_fp(fin, output_string, laparams=LAParams(),
67    ...                        output_type='html', codec=None)
68