1================================= 2How to use olefile - API overview 3================================= 4 5This page is part of the documentation for 6`olefile <http://olefile.readthedocs.io/en/latest/>`__. It 7explains how to use all its features to parse and write OLE files. For 8more information about OLE files, see :doc:`OLE_Overview`. 9 10olefile can be used as an independent module or with PIL/Pillow. The 11main functions and methods are explained below. 12 13For more information, see also the :doc:`olefile`, sample code at 14the end of the module itself, and docstrings within the code. 15 16Import olefile 17-------------- 18 19When the :py:mod:`olefile` package has been installed, it can be imported in 20Python applications with this statement: 21 22:: 23 24 import olefile 25 26As of version 0.30, the code has been changed to be compatible with 27Python 3.x. As a consequence, compatibility with Python 2.5 or older is 28not provided anymore. 29 30Test if a file is an OLE container 31---------------------------------- 32 33Use :py:func:`olefile.isOleFile` to check if the first bytes of the file contain the 34Magic for OLE files, before opening it. isOleFile returns True if it is 35an OLE file, False otherwise (new in v0.16). 36 37:: 38 39 assert olefile.isOleFile('myfile.doc') 40 41The argument of isOleFile can be (new in v0.41): 42 43- the path of the file to open on disk (bytes or unicode string smaller 44 than 1536 bytes), 45- or a bytes string containing the file in memory. (bytes string longer 46 than 1535 bytes), 47- or a file-like object (with read and seek methods). 48 49Open an OLE file from disk 50-------------------------- 51 52Create an :py:class:`olefile.OleFileIO` object with the file path as parameter: 53 54:: 55 56 ole = olefile.OleFileIO('myfile.doc') 57 58Since olefile v0.46, the recommended way to open an OLE file is to use 59OleFileIO as a context manager, using the "with" clause: 60 61:: 62 63 with olefile.OleFileIO('myfile.doc') as ole 64 # perform all operations on the ole object 65 66This guarantees that the OleFileIO object is closed when exiting 67the with block, even if an exception is triggered. 68It will call :py:meth:`olefile.OleFileIO.close` automatically. 69 70(new in v0.46) 71 72 73Open an OLE file from a bytes string 74------------------------------------ 75 76This is useful if the file is already stored in memory as a bytes 77string. 78 79:: 80 81 ole = olefile.OleFileIO(s) 82 83 84Note: olefile checks the size of the string provided as argument to 85determine if it is a file path or the content of an OLE file. An OLE 86file cannot be smaller than 1536 bytes. If the string is larger than 871535 bytes, then it is expected to contain an OLE file, otherwise it is 88expected to be a file path. 89 90(new in v0.41) 91 92Open an OLE file from a file-like object 93---------------------------------------- 94 95This is useful if the file is not on disk but only available as a 96file-like object (with read, seek and tell methods). 97 98:: 99 100 ole = olefile.OleFileIO(f) 101 102If the file-like object does not have seek or tell methods, the easiest 103solution is to read the file entirely in a bytes string before parsing: 104 105:: 106 107 data = f.read() 108 ole = olefile.OleFileIO(data) 109 110How to handle malformed OLE files 111--------------------------------- 112 113By default, the parser is configured to be as robust and permissive as 114possible, allowing to parse most malformed OLE files. Only fatal errors 115will raise an exception. It is possible to tell the parser to be more 116strict in order to raise exceptions for files that do not fully conform 117to the OLE specifications, using the ``raise_defect`` option (new in 118v0.14): 119 120:: 121 122 ole = olefile.OleFileIO('myfile.doc', raise_defects=olefile.DEFECT_INCORRECT) 123 124When the parsing is done, the list of non-fatal issues detected is 125available as a list in the :py:attr:`olefile.OleFileIO.parsing_issues` attribute of the OleFileIO 126object (new in 0.25): 127 128:: 129 130 print('Non-fatal issues raised during parsing:') 131 if ole.parsing_issues: 132 for exctype, msg in ole.parsing_issues: 133 print('- %s: %s' % (exctype.__name__, msg)) 134 else: 135 print('None') 136 137Open an OLE file in write mode 138------------------------------ 139 140Before using the write features, the OLE file must be opened in 141read/write mode, by using the option ``write_mode=True``: 142 143:: 144 145 ole = olefile.OleFileIO('test.doc', write_mode=True) 146 147(new in v0.40) 148 149The code for write features is new and it has not been thoroughly tested 150yet. See `issue #6 <https://github.com/decalage2/olefile/issues/6>`__ 151for the roadmap and the implementation status. If you encounter any 152issue, please send me your `feedback <http://www.decalage.info/en/contact>`__ 153or `report issues <https://github.com/decalage2/olefile/issues>`__. 154 155Syntax for stream and storage paths 156----------------------------------- 157 158Two different syntaxes are allowed for methods that need or return the 159path of streams and storages: 160 1611) Either a **list of strings** including all the storages from the root 162 up to the stream/storage name. For example a stream called 163 "WordDocument" at the root will have ``['WordDocument']`` as full path. A 164 stream called "ThisDocument" located in the storage "Macros/VBA" will 165 be ``['Macros', 'VBA', 'ThisDocument']``. This is the original syntax 166 from PIL. While hard to read and not very convenient, this syntax 167 works in all cases. 168 1692) Or a **single string with slashes** to separate storage and stream 170 names (similar to the Unix path syntax). The previous examples would 171 be ``'WordDocument'`` and ``'Macros/VBA/ThisDocument'``. This syntax is 172 easier, but may fail if a stream or storage name contains a slash 173 (which is normally not allowed, according to the Microsoft 174 specifications [MS-CFB]). (new in v0.15) 175 176Both are case-insensitive. 177 178Switching between the two is easy: 179 180:: 181 182 slash_path = '/'.join(list_path) 183 list_path = slash_path.split('/') 184 185**Encoding**: 186 187- Stream and Storage names are stored in Unicode format in OLE files, 188 which means they may contain special characters (e.g. Greek, 189 Cyrillic, Japanese, etc) that applications must support to avoid 190 exceptions. 191- **On Python 2.x**, all stream and storage paths are handled by 192 olefile in bytes strings, using the **UTF-8 encoding** by default. If 193 you need to use Unicode instead, add the option 194 ``path_encoding=None`` when creating the OleFileIO object. This is 195 new in v0.42. Olefile was using the Latin-1 encoding until v0.41, 196 therefore special characters were not supported. 197- **On Python 3.x**, all stream and storage paths are handled by 198 olefile in unicode strings, without encoding. 199 200Get the list of streams 201----------------------- 202 203:py:meth:`olefile.OleFileIO.listdir` returns a list of all the streams contained in the OLE file, 204including those stored in storages. Each stream is listed itself as a 205list, as described above. 206 207:: 208 209 print(ole.listdir()) 210 211Sample result: 212 213:: 214 215 [['\x01CompObj'], ['\x05DocumentSummaryInformation'], ['\x05SummaryInformation'] 216 , ['1Table'], ['Macros', 'PROJECT'], ['Macros', 'PROJECTwm'], ['Macros', 'VBA', 217 'Module1'], ['Macros', 'VBA', 'ThisDocument'], ['Macros', 'VBA', '_VBA_PROJECT'] 218 , ['Macros', 'VBA', 'dir'], ['ObjectPool'], ['WordDocument']] 219 220As an option it is possible to choose if storages should also be listed, 221with or without streams (new in v0.26): 222 223:: 224 225 ole.listdir (streams=False, storages=True) 226 227Test if known streams/storages exist: 228------------------------------------- 229 230:py:meth:`olefile.OleFileIO.exists` checks if a given stream or storage exists in the OLE file 231(new in v0.16). The provided path is case-insensitive. 232 233:: 234 235 if ole.exists('worddocument'): 236 print("This is a Word document.") 237 if ole.exists('macros/vba'): 238 print("This document seems to contain VBA macros.") 239 240Read data from a stream 241----------------------- 242 243:py:meth:`olefile.OleFileIO.openstream` opens a stream as a file-like object. The provided path 244is case-insensitive. 245 246The following example extracts the "Pictures" stream from a PPT file: 247 248:: 249 250 pics = ole.openstream('Pictures') 251 data = pics.read() 252 253Get information about a stream/storage 254-------------------------------------- 255 256Several methods can provide the size, type and timestamps of a given 257stream/storage: 258 259:py:meth:`olefile.OleFileIO.get_size` returns the size of a stream in bytes (new in v0.16): 260 261:: 262 263 s = ole.get_size('WordDocument') 264 265:py:meth:`olefile.OleFileIO.get_type` returns the type of a stream/storage, as one of the 266following constants: :py:data:`olefile.STGTY_STREAM` for a stream, :py:data:`olefile.STGTY_STORAGE` for a 267storage, :py:data:`olefile.STGTY_ROOT` for the root entry, and ``False`` for a non existing 268path (new in v0.15). 269 270:: 271 272 t = ole.get_type('WordDocument') 273 274:py:meth:`olefile.OleFileIO.getctime` and :py:meth:`olefile.OleFileIO.getmtime` return the creation and 275modification timestamps of a stream/storage, as a Python datetime object 276with UTC timezone. Please note that these timestamps are only present if 277the application that created the OLE file explicitly stored them, which 278is rarely the case. When not present, these methods return None (new in 279v0.26). 280 281:: 282 283 c = ole.getctime('WordDocument') 284 m = ole.getmtime('WordDocument') 285 286The root storage is a special case: You can get its creation and 287modification timestamps using the OleFileIO.root attribute (new in 288v0.26): 289 290:: 291 292 c = ole.root.getctime() 293 m = ole.root.getmtime() 294 295Note: all these methods are case-insensitive. 296 297Overwriting a sector 298-------------------- 299 300The :py:meth:`olefile.OleFileIO.write_sect` method can overwrite any sector of the file. If the 301provided data is smaller than the sector size (normally 512 bytes, 302sometimes 4KB), data is padded with null characters. (new in v0.40) 303 304Here is an example: 305 306:: 307 308 ole.write_sect(0x17, b'TEST') 309 310Note: following the `MS-CFB 311specifications <http://msdn.microsoft.com/en-us/library/dd942138.aspx>`__, 312sector 0 is actually the second sector of the file. You may use -1 as 313index to write the first sector. 314 315Overwriting a stream 316-------------------- 317 318The :py:meth:`olefile.OleFileIO.write_stream` method can overwrite an existing stream in the file. 319The new stream data must be the exact same size as the existing one. Since v0.45, 320this method can write streams of any size (stored in the main FAT or the MiniFAT). 321 322For example, you may change text in a MS Word document: 323 324:: 325 326 ole = olefile.OleFileIO('test.doc', write_mode=True) 327 data = ole.openstream('WordDocument').read() 328 data = data.replace(b'foo', b'bar') 329 ole.write_stream('WordDocument', data) 330 ole.close() 331 332(new in v0.40) 333 334Extract metadata 335---------------- 336 337:py:meth:`olefile.OleFileIO.get_metadata` will check if standard property streams exist, parse all 338the properties they contain, and return an :py:class:`olefile.OleFileIO.OleMetadata` object with the 339found properties as attributes (new in v0.24). 340 341:: 342 343 meta = ole.get_metadata() 344 print('Author:', meta.author) 345 print('Title:', meta.title) 346 print('Creation date:', meta.create_time) 347 # print all metadata: 348 meta.dump() 349 350Available attributes include: 351 352:: 353 354 codepage, title, subject, author, keywords, comments, template, 355 last_saved_by, revision_number, total_edit_time, last_printed, create_time, 356 last_saved_time, num_pages, num_words, num_chars, thumbnail, 357 creating_application, security, codepage_doc, category, presentation_target, 358 bytes, lines, paragraphs, slides, notes, hidden_slides, mm_clips, 359 scale_crop, heading_pairs, titles_of_parts, manager, company, links_dirty, 360 chars_with_spaces, unused, shared_doc, link_base, hlinks, hlinks_changed, 361 version, dig_sig, content_type, content_status, language, doc_version 362 363See the source code of the :py:class:`olefile.OleFileIO.OleMetadata` class for more information. 364 365Parse a property stream 366----------------------- 367 368:py:meth:`olefile.OleFileIO.getproperties` can be used to parse any property stream that is 369not handled by get_metadata. It returns a dictionary indexed by 370integers. Each integer is the index of the property, pointing to its 371value. For example in the standard property stream 372``'\x05SummaryInformation'``, the document title is property 373#2, and the subject is #3. 374 375:: 376 377 p = ole.getproperties('specialprops') 378 379By default as in the original PIL version, timestamp properties are 380converted into a number of seconds since Jan 1,1601. With the option 381``convert_time``, you can obtain more convenient Python datetime objects 382(UTC timezone). If some time properties should not be converted (such as 383total editing time in ``'\x05SummaryInformation'``), the list 384of indexes can be passed as no\_conversion (new in v0.25): 385 386:: 387 388 p = ole.getproperties('specialprops', convert_time=True, no_conversion=[10]) 389 390Close the OLE file 391------------------ 392 393Unless your application is a simple script that terminates after 394processing an OLE file, do not forget to close each OleFileIO object 395after parsing to close the file on disk. (new in v0.22) 396 397:: 398 399 ole.close() 400 401 402Enable logging 403-------------- 404 405See :py:func:`olefile.enable_logging` 406 407Use olefile as a script for testing/debugging 408--------------------------------------------- 409 410olefile can also be used as a script from the command-line to display 411the structure of an OLE file and its metadata, for example: 412 413:: 414 415 olefile.py myfile.doc 416 417You can use the option ``-c`` to check that all streams can be read fully, 418and ``-d`` to generate very verbose debugging information. 419 420You may also add the option ``-l debug`` to display debugging messages 421(very verbose). 422 423