1=================================
2How to use olefile - API overview
3=================================
4
5This page is part of the documentation for
6`olefile <http://olefile.readthedocs.io/en/latest/>`__. It
7explains how to use all its features to parse and write OLE files. For
8more information about OLE files, see :doc:`OLE_Overview`.
9
10olefile can be used as an independent module or with PIL/Pillow. The
11main functions and methods are explained below.
12
13For more information, see also the :doc:`olefile`, sample code at
14the end of the module itself, and docstrings within the code.
15
16Import olefile
17--------------
18
19When the :py:mod:`olefile` package has been installed, it can be imported in
20Python applications with this statement:
21
22::
23
24    import olefile
25
26As of version 0.30, the code has been changed to be compatible with
27Python 3.x. As a consequence, compatibility with Python 2.5 or older is
28not provided anymore.
29
30Test if a file is an OLE container
31----------------------------------
32
33Use :py:func:`olefile.isOleFile` to check if the first bytes of the file contain the
34Magic for OLE files, before opening it. isOleFile returns True if it is
35an OLE file, False otherwise (new in v0.16).
36
37::
38
39    assert olefile.isOleFile('myfile.doc')
40
41The argument of isOleFile can be (new in v0.41):
42
43-  the path of the file to open on disk (bytes or unicode string smaller
44   than 1536 bytes),
45-  or a bytes string containing the file in memory. (bytes string longer
46   than 1535 bytes),
47-  or a file-like object (with read and seek methods).
48
49Open an OLE file from disk
50--------------------------
51
52Create an :py:class:`olefile.OleFileIO` object with the file path as parameter:
53
54::
55
56    ole = olefile.OleFileIO('myfile.doc')
57
58Since olefile v0.46, the recommended way to open an OLE file is to use
59OleFileIO as a context manager, using the "with" clause:
60
61::
62
63    with olefile.OleFileIO('myfile.doc') as ole
64        # perform all operations on the ole object
65
66This guarantees that the OleFileIO object is closed when exiting
67the with block, even if an exception is triggered.
68It will call :py:meth:`olefile.OleFileIO.close` automatically.
69
70(new in v0.46)
71
72
73Open an OLE file from a bytes string
74------------------------------------
75
76This is useful if the file is already stored in memory as a bytes
77string.
78
79::
80
81    ole = olefile.OleFileIO(s)
82
83
84Note: olefile checks the size of the string provided as argument to
85determine if it is a file path or the content of an OLE file. An OLE
86file cannot be smaller than 1536 bytes. If the string is larger than
871535 bytes, then it is expected to contain an OLE file, otherwise it is
88expected to be a file path.
89
90(new in v0.41)
91
92Open an OLE file from a file-like object
93----------------------------------------
94
95This is useful if the file is not on disk but only available as a
96file-like object (with read, seek and tell methods).
97
98::
99
100    ole = olefile.OleFileIO(f)
101
102If the file-like object does not have seek or tell methods, the easiest
103solution is to read the file entirely in a bytes string before parsing:
104
105::
106
107    data = f.read()
108    ole = olefile.OleFileIO(data)
109
110How to handle malformed OLE files
111---------------------------------
112
113By default, the parser is configured to be as robust and permissive as
114possible, allowing to parse most malformed OLE files. Only fatal errors
115will raise an exception. It is possible to tell the parser to be more
116strict in order to raise exceptions for files that do not fully conform
117to the OLE specifications, using the ``raise_defect`` option (new in
118v0.14):
119
120::
121
122    ole = olefile.OleFileIO('myfile.doc', raise_defects=olefile.DEFECT_INCORRECT)
123
124When the parsing is done, the list of non-fatal issues detected is
125available as a list in the :py:attr:`olefile.OleFileIO.parsing_issues` attribute of the OleFileIO
126object (new in 0.25):
127
128::
129
130    print('Non-fatal issues raised during parsing:')
131    if ole.parsing_issues:
132        for exctype, msg in ole.parsing_issues:
133            print('- %s: %s' % (exctype.__name__, msg))
134    else:
135        print('None')
136
137Open an OLE file in write mode
138------------------------------
139
140Before using the write features, the OLE file must be opened in
141read/write mode, by using the option ``write_mode=True``:
142
143::
144
145    ole = olefile.OleFileIO('test.doc', write_mode=True)
146
147(new in v0.40)
148
149The code for write features is new and it has not been thoroughly tested
150yet. See `issue #6 <https://github.com/decalage2/olefile/issues/6>`__
151for the roadmap and the implementation status. If you encounter any
152issue, please send me your `feedback <http://www.decalage.info/en/contact>`__
153or `report issues <https://github.com/decalage2/olefile/issues>`__.
154
155Syntax for stream and storage paths
156-----------------------------------
157
158Two different syntaxes are allowed for methods that need or return the
159path of streams and storages:
160
1611) Either a **list of strings** including all the storages from the root
162   up to the stream/storage name. For example a stream called
163   "WordDocument" at the root will have ``['WordDocument']`` as full path. A
164   stream called "ThisDocument" located in the storage "Macros/VBA" will
165   be ``['Macros', 'VBA', 'ThisDocument']``. This is the original syntax
166   from PIL. While hard to read and not very convenient, this syntax
167   works in all cases.
168
1692) Or a **single string with slashes** to separate storage and stream
170   names (similar to the Unix path syntax). The previous examples would
171   be ``'WordDocument'`` and ``'Macros/VBA/ThisDocument'``. This syntax is
172   easier, but may fail if a stream or storage name contains a slash
173   (which is normally not allowed, according to the Microsoft
174   specifications [MS-CFB]). (new in v0.15)
175
176Both are case-insensitive.
177
178Switching between the two is easy:
179
180::
181
182    slash_path = '/'.join(list_path)
183    list_path  = slash_path.split('/')
184
185**Encoding**:
186
187-  Stream and Storage names are stored in Unicode format in OLE files,
188   which means they may contain special characters (e.g. Greek,
189   Cyrillic, Japanese, etc) that applications must support to avoid
190   exceptions.
191-  **On Python 2.x**, all stream and storage paths are handled by
192   olefile in bytes strings, using the **UTF-8 encoding** by default. If
193   you need to use Unicode instead, add the option
194   ``path_encoding=None`` when creating the OleFileIO object. This is
195   new in v0.42. Olefile was using the Latin-1 encoding until v0.41,
196   therefore special characters were not supported.
197-  **On Python 3.x**, all stream and storage paths are handled by
198   olefile in unicode strings, without encoding.
199
200Get the list of streams
201-----------------------
202
203:py:meth:`olefile.OleFileIO.listdir` returns a list of all the streams contained in the OLE file,
204including those stored in storages. Each stream is listed itself as a
205list, as described above.
206
207::
208
209    print(ole.listdir())
210
211Sample result:
212
213::
214
215    [['\x01CompObj'], ['\x05DocumentSummaryInformation'], ['\x05SummaryInformation']
216    , ['1Table'], ['Macros', 'PROJECT'], ['Macros', 'PROJECTwm'], ['Macros', 'VBA',
217    'Module1'], ['Macros', 'VBA', 'ThisDocument'], ['Macros', 'VBA', '_VBA_PROJECT']
218    , ['Macros', 'VBA', 'dir'], ['ObjectPool'], ['WordDocument']]
219
220As an option it is possible to choose if storages should also be listed,
221with or without streams (new in v0.26):
222
223::
224
225    ole.listdir (streams=False, storages=True)
226
227Test if known streams/storages exist:
228-------------------------------------
229
230:py:meth:`olefile.OleFileIO.exists` checks if a given stream or storage exists in the OLE file
231(new in v0.16). The provided path is case-insensitive.
232
233::
234
235    if ole.exists('worddocument'):
236        print("This is a Word document.")
237        if ole.exists('macros/vba'):
238             print("This document seems to contain VBA macros.")
239
240Read data from a stream
241-----------------------
242
243:py:meth:`olefile.OleFileIO.openstream` opens a stream as a file-like object. The provided path
244is case-insensitive.
245
246The following example extracts the "Pictures" stream from a PPT file:
247
248::
249
250    pics = ole.openstream('Pictures')
251    data = pics.read()
252
253Get information about a stream/storage
254--------------------------------------
255
256Several methods can provide the size, type and timestamps of a given
257stream/storage:
258
259:py:meth:`olefile.OleFileIO.get_size` returns the size of a stream in bytes (new in v0.16):
260
261::
262
263    s = ole.get_size('WordDocument')
264
265:py:meth:`olefile.OleFileIO.get_type` returns the type of a stream/storage, as one of the
266following constants: :py:data:`olefile.STGTY_STREAM` for a stream, :py:data:`olefile.STGTY_STORAGE` for a
267storage, :py:data:`olefile.STGTY_ROOT` for the root entry, and ``False`` for a non existing
268path (new in v0.15).
269
270::
271
272    t = ole.get_type('WordDocument')
273
274:py:meth:`olefile.OleFileIO.getctime` and :py:meth:`olefile.OleFileIO.getmtime` return the creation and
275modification timestamps of a stream/storage, as a Python datetime object
276with UTC timezone. Please note that these timestamps are only present if
277the application that created the OLE file explicitly stored them, which
278is rarely the case. When not present, these methods return None (new in
279v0.26).
280
281::
282
283    c = ole.getctime('WordDocument')
284    m = ole.getmtime('WordDocument')
285
286The root storage is a special case: You can get its creation and
287modification timestamps using the OleFileIO.root attribute (new in
288v0.26):
289
290::
291
292    c = ole.root.getctime()
293    m = ole.root.getmtime()
294
295Note: all these methods are case-insensitive.
296
297Overwriting a sector
298--------------------
299
300The :py:meth:`olefile.OleFileIO.write_sect` method can overwrite any sector of the file. If the
301provided data is smaller than the sector size (normally 512 bytes,
302sometimes 4KB), data is padded with null characters. (new in v0.40)
303
304Here is an example:
305
306::
307
308    ole.write_sect(0x17, b'TEST')
309
310Note: following the `MS-CFB
311specifications <http://msdn.microsoft.com/en-us/library/dd942138.aspx>`__,
312sector 0 is actually the second sector of the file. You may use -1 as
313index to write the first sector.
314
315Overwriting a stream
316--------------------
317
318The :py:meth:`olefile.OleFileIO.write_stream` method can overwrite an existing stream in the file.
319The new stream data must be the exact same size as the existing one. Since v0.45,
320this method can write streams of any size (stored in the main FAT or the MiniFAT).
321
322For example, you may change text in a MS Word document:
323
324::
325
326    ole = olefile.OleFileIO('test.doc', write_mode=True)
327    data = ole.openstream('WordDocument').read()
328    data = data.replace(b'foo', b'bar')
329    ole.write_stream('WordDocument', data)
330    ole.close()
331
332(new in v0.40)
333
334Extract metadata
335----------------
336
337:py:meth:`olefile.OleFileIO.get_metadata` will check if standard property streams exist, parse all
338the properties they contain, and return an :py:class:`olefile.OleFileIO.OleMetadata` object with the
339found properties as attributes (new in v0.24).
340
341::
342
343    meta = ole.get_metadata()
344    print('Author:', meta.author)
345    print('Title:', meta.title)
346    print('Creation date:', meta.create_time)
347    # print all metadata:
348    meta.dump()
349
350Available attributes include:
351
352::
353
354    codepage, title, subject, author, keywords, comments, template,
355    last_saved_by, revision_number, total_edit_time, last_printed, create_time,
356    last_saved_time, num_pages, num_words, num_chars, thumbnail,
357    creating_application, security, codepage_doc, category, presentation_target,
358    bytes, lines, paragraphs, slides, notes, hidden_slides, mm_clips,
359    scale_crop, heading_pairs, titles_of_parts, manager, company, links_dirty,
360    chars_with_spaces, unused, shared_doc, link_base, hlinks, hlinks_changed,
361    version, dig_sig, content_type, content_status, language, doc_version
362
363See the source code of the :py:class:`olefile.OleFileIO.OleMetadata` class for more information.
364
365Parse a property stream
366-----------------------
367
368:py:meth:`olefile.OleFileIO.getproperties` can be used to parse any property stream that is
369not handled by get_metadata. It returns a dictionary indexed by
370integers. Each integer is the index of the property, pointing to its
371value. For example in the standard property stream
372``'\x05SummaryInformation'``, the document title is property
373#2, and the subject is #3.
374
375::
376
377    p = ole.getproperties('specialprops')
378
379By default as in the original PIL version, timestamp properties are
380converted into a number of seconds since Jan 1,1601. With the option
381``convert_time``, you can obtain more convenient Python datetime objects
382(UTC timezone). If some time properties should not be converted (such as
383total editing time in ``'\x05SummaryInformation'``), the list
384of indexes can be passed as no\_conversion (new in v0.25):
385
386::
387
388    p = ole.getproperties('specialprops', convert_time=True, no_conversion=[10])
389
390Close the OLE file
391------------------
392
393Unless your application is a simple script that terminates after
394processing an OLE file, do not forget to close each OleFileIO object
395after parsing to close the file on disk. (new in v0.22)
396
397::
398
399    ole.close()
400
401
402Enable logging
403--------------
404
405See :py:func:`olefile.enable_logging`
406
407Use olefile as a script for testing/debugging
408---------------------------------------------
409
410olefile can also be used as a script from the command-line to display
411the structure of an OLE file and its metadata, for example:
412
413::
414
415    olefile.py myfile.doc
416
417You can use the option ``-c`` to check that all streams can be read fully,
418and ``-d`` to generate very verbose debugging information.
419
420You may also add the option ``-l debug`` to display debugging messages
421(very verbose).
422
423