1/** \mainpage notitle 2 3\section Introduction 4 5FreeXL is an open source library to extract valid data from within an Excel 6(.xls) spreadsheet. 7 8The FreeXL design goals are: 9 - to be simple and lightweight 10 - to be stable, robust and efficient 11 - to be easily and universally portable. 12 - completely ignore any GUI-related oddity 13 14Note that the final goal means that FreeXL ignores at all fonts, sizes and 15alignments, and most formats. It ignores Pivot Table, Charts, Formulas, Visual 16Basic macros and so on. 17 18FreeXL is structurally simple and quite light-weight (typically 40-80K of object 19code, stripped). FreeXL has one key dependency - GNU libiconv, which is used for 20character set conversions. This is often provided as part of the C library on 21Linux systems, and is widely available. 22 23Building and installing FreeXL is straightforward: 24\verbatim 25./configure 26make 27make install 28\endverbatim 29 30Linking FreeXL to your own code is usually simple: 31\verbatim 32gcc my_program.c -o my_program -lfreexl 33\endverbatim 34 35On some systems you may have to provide a slightly more complex arrangement: 36\verbatim 37gcc -I/usr/local/include my_program.c -o my_program \ 38 -L/usr/local/lib -lfreexl -liconv -lm 39\endverbatim 40 41FreeXL also provides pkg-config support, so you can also do: 42\verbatim 43gcc -I/usr/local/include my_program.c -o my_program `pkg-config --libs freexl` 44\endverbatim 45 46I sincerely hope FreeXL could be useful to many of you. Excel *.xls spreadsheets 47are widespread, and although Microsoft itself is strongly pushing the new XML 48based formats, there is still a lot of legacy data stored in the older binary 49formats. 50 51So in an era of open data, a simple and easy way to extract data from .xls is 52surely useful. The original use of FreeXL was to support the SQLite / 53SpatiaLite VirtualXL driver (implementing direct access to .xls files via SQL). 54However there are many other possibilities, including use with shell scripts and 55simple wrappers for Python, Perl and other very high level languages. 56 57FreeXL is licensed under the MPL tri-license terms: you are free to choose the 58best-fit license between: 59 - the MPL 1.1 60 - the GPL v2.0 or any subsequent version 61 - the LGPL v2.1 or any subsequent version 62 63Enjoy, and happy coding 64*/ 65 66/** \page Origins The background story for FreeXL 67 68Where, when and why a new free software library was born... 69 70At the end of April 2011 Markus Neteler 71[http://en.wikipedia.org/wiki/Markus_Neteler] and I (Sandro) were in Udine in 72Northern Italy, attending the annual Italian gvSIG Users conference. So, on a 73lovely hot and sunny spring evening, accompanied by a picturesque sunset, we 74were sitting in the town centre in the pleasant Piazza Matteotti aka Piazza San 75Giacomo, peacefully drinking some spritz while eating chips and peanuts. 76 77\image html piazza.jpg 78\image latex piazza.eps "Piazza" width=12cm 79 80You'll have to admit, it was a really dangerous situation: as a general safety 81rule, never let two developers sit idle for too long. Some odd and crazy idea 82will inevitably occur to them (of course, drinking too much spritz can 83contribute as well). 84 85Markus was desperately attempting to convince me that implementing a 86VirtualTable driver for SpatiaLite supporting direct SQL reading ofExcel .xls 87files was a good initiative. Surely it would be useful for many users. I 88strongly resisted, fiercely fighting and rejecting such idea, on the basis that 89attempting to read proprietary closed formats such as Excel was a complete 90nonsense, and probably a very difficult if not impossible task. 91 92At the end of this very animated discussion, Markus pronounced the magic spell 93that suddenly convinced me about the absolute validity of his suggestions: 94<em>I can raise some funding for this project</em> After hearing such wise and 95significant words from Markus I immediately realized that developing a new 96driver for accessing Excel .xls documents surely was an exciting and useful task 97after all. 98 99So FreeXL / VirtualXL was conceived at that exact moment, and slowly 100began its development cycle. 101*/ 102 103/** \page Specs File format specifications and source information 104 105The .xls binary file format is extensively documented and is publicly available. 106The most authoritative source is made available by Microsoft at 107http://msdn.microsoft.com/en-us/library/cc313154%28v=office.12%29.aspx 108 109A simpler option is made available by Open Office: 110http://sc.openoffice.org/excelfileformat.pdf 111 112Searching the web you'll easily find several other valuable information sources. 113*/ 114 115/** \page Format About the .xls binary format 116 117What a .xls binary file really is 118 119(Prepare yourself to be continuously surprised by many unexpected revelations 120...) 121 122You may already know that there are many different versions of .xls files. 123Different versions have different capabilities. So we'll start by 124reviewing the Excel evolutionary history and and we'll introduce some Microsoft 125jargon because it's central to understanding the underlying operations. 126 127\section CFBF CFBF 128 129Unexpected Revelation #1: 130<em>There is no .xls file format. Its really a common file suffix applied to 131many different things.</em> 132 133Recent Microsoft Office document files are based on a common container layout 134named CFBF (Compound File Binary Format). This container format is the same for 135Excel (.xls), Word (.doc_ and PowerPoint (.ppt) amongst other applications. 136More information: 137 - http://en.wikipedia.org/wiki/Compound_File_Binary_Format 138 - http://msdn.microsoft.com/en-us/library/dd942138%28v=prot.13%29.aspx 139 140Unexpected Revelation #2: 141<em>CFBF is more of a file system than a file format</em> 142 143A CFBF file is divided into many equal-sized blocks named sectors. 144Such sectors cannot be directly accessed. In order to retrieve sectors in the 145expected logical order a FAT (File Allocation Table) is allocated within 146the CFBF file. A CFBF file is internally organized as if it was a raw physical 147disk. The design is based on Microsoft own FAT file-system as used by MS-DOS and 148early versions of Windows. The first sector of the CFBF file acts as if it was a 149kind-of MBR (Master Boot Record) - this first sector provides information about 150the layout and type of the CFBF file, such as block/sector size and version. 151A FAT chain allows a reader to re-assemble the sectors in the required logical 152order. There is a list of free block, and very large files may use a 153double indirection (DIFAT - Double Indirection FAT). A CFBF file always has at 154least a root directory: but a complete directory tree can be provided. 155 156A CFBF file can contain many and many distinct independent files. Just to make 157things a little clearer, Microsoft calls such pseudo-files (I mean: the many 158fake ones contained within the real CFBF file) <em>streams</em> 159 160The practical consequence is that any software tool attempting to access an 161Excel binary document must first be able to correctly access this CFBF 162container format. 163 164\section BIFF BIFF 165Unexpected Revelation #3: 166<em>An Excel document will contain a stream (pseudo-file) named Workbook in 167the root directory of the CFBF file (filesystem).</em> 168 169The Workbook stream is internally structured accordingly to the BIFF (Binary 170Interchange File Format) specifications. You can think of the BIFF as the real 171Excel binary format (following more conventional naming rules). Several BIFF 172versions were introduced during the years: and there are significant differences 173between them. 174 175An useful correspondence table relating corresponding Excel and BIFF versions: 176 177<table> 178 <tr> 179 <th> 180 <b>Excel Version</b> 181 </th> 182 <th> 183 <b>Commercial Name</b> 184 </th> 185 <th> 186 <b>BIFF Version</b> 187 </th> 188 <th> 189 <b>Release Year</b> 190 </th> 191 <th> 192 <b>Notes</b> 193 </th> 194 </tr> 195 <tr> 196 <td>2.x</td> 197 <td>Excel 2.0</td> 198 <td>BIFF2</td> 199 <td>1987</td> 200 <td>Before CFBF. File is the BIFF stream, containing a single 201 worksheet.</td> 202 </tr> 203 <tr> 204 <td>3.0</td> 205 <td>Excel 3.0</td> 206 <td>BIFF3</td> 207 <td>1990</td> 208 <td>Before CFBF. File is the BIFF stream, containing a single 209 worksheet.</td> 210 </tr> 211 <tr> 212 <td>4.0</td> 213 <td>Excel 4.0</td> 214 <td>BIFF4</td> 215 <td>1992</td> 216 <td>Before CFBF. File is the BIFF stream, containing a single 217 worksheet.</td> 218 </tr> 219 <tr> 220 <td>5.0</td> 221 <td>Excel 5.0</td> 222 <td>BIFF5</td> 223 <td>1993</td> 224 <td>Starting with BIFF5, a single Workbook can internally store many 225 individual Worksheets. The BIFF stream is stored in the CFBF file 226 container.</td> 227 </tr> 228 <tr> 229 <td>7.0</td> 230 <td>Excel 95</td> 231 <td>BIFF5</td> 232 <td>1995</td> 233 <td></td> 234 </tr> 235 <tr> 236 <td>8.0</td> 237 <td>Excel 98</td> 238 <td>BIFF8</td> 239 <td>1998</td> 240 <td></td> 241 </tr> 242 <tr> 243 <td>9.0</td> 244 <td>Excel 2000</td> 245 <td>BIFF8</td> 246 <td>1999</td> 247 <td></td> 248 </tr> 249 <tr> 250 <td>10.0</td> 251 <td>Excel XP</td> 252 <td>BIFF8</td> 253 <td>2001</td> 254 <td></td> 255 </tr> 256 <tr> 257 <td>11.0</td> 258 <td>Excel 2003</td> 259 <td>BIFF8</td> 260 <td>2003</td> 261 <td></td> 262 </tr> 263 <tr> 264 <td>12.0</td> 265 <td>Excel 2007</td> 266 <td>BIFF8</td> 267 <td>2007</td> 268 <td>Introduced alternate XML format, which is usually the default 269 for new files.</td> 270 </tr> 271 <tr> 272 <td>14.0</td> 273 <td>Excel 2010</td> 274 <td>BIFF8</td> 275 <td>2010</td> 276 <td>XML format is usually the default for new files.</td> 277 </tr> 278</table> 279 280Note that FreeXL does not support the new XML format which is a completely 281different and unrelated format. 282 283Perhaps you are now expecting that BIFF will simply and directly encode your 284spreadsheet data. Unfortunately, you should have know better given the steps 285we took to get here... 286 287Unexpected Revelation #4: 288<em>Any BIFF stream (pseudo-file stored within a CFBF container file) is 289internally organized as a collection of variable-length records.</em>. 290 291Each record starts with 292 - a 16 bit unsigned integer specifying the record type 293 - another 16 bit unsigned integer specifies the record data length (in bytes) 294 excluding the standard type-size prefix. 295 296Note that there are many different record types, and the record size / layout 297may differ for different BIFF versions. 298 299Three record types have an absolutely special meaning: 300 - a BOF [Beginning Of File] record marks starting of a different sub-stream. 301 - an EOF [End Of File] record marks ending of current sub-stream. 302 - a CONTINUE record means that the previous record exceeded the maximum size 303 for a record, and the previous record data payload will be spanned on 304 following CONTINUE records for as many CONTINUE records as are required to 305 store the full data size. 306 307Unexpected Revelation #5: 308<em>So a BIFF stream (pseudo-file) isn't really a file - it's more like a 309collection of individual sub-streams, each one of which is enclosed between BOF 310/ EOF markers.</em> 311 312The most recent BIFF8 requires that at least the following internal 313sub-streams are be defined: 314 - the first sub-stream contains workbook level global data and metadata, such 315 as author, password protection, styles, formats, window settings and so on 316 - list of individual worksheets included into the Workbook, where each 317 worksheet is identified by a name and by a type (data Worksheet, Chart, 318 Visual Basic module ... visible, hidden ...), and relative offset position of 319 the corresponding BOF record allows for fast positioning. 320 - any text string is stored here into the SST [Shared String Table], so 321 individual text cells simply refer the corresponding SST entry by index (in all 322 previous BIFF version text strings are directly stored into the appropriate 323 cell). 324 - any subsequent sub-stream represents a single Worksheet, and the most 325 relevant data stored at Worksheet level are dimension (number of valid rows and 326 columns) and any cell value data. 327 328We will now see how BIFF encodes individual data types with several further 329amazing surprises are still to come. Be prepared! 330 331\section BIFF Supported data types and encoding 332 333Leaving aside special values such as images, OLE, COM, Visual Basic related 334items and so on, the basic data types are supported in BIFF: 335 - text strings 336 - numbers (both integers and decimals) 337 - dates, date-times and times 338 - NULL (empty cell) 339 340Note that any multi-byte value is stored in BIFF accordingly using Little Endian 341byte ordering (i.e. least significant byte comes first, most significant byte 342comes last). 343 344<table> 345 <tr> 346 <th><b>BIFF Record Type</b></th> 347 <th><b>BIFF Version</b></th> 348 <th><b>Content Type</b></th> 349 </tr> 350 <tr> 351 <td>INTEGER</td> 352 <td>BIFF2</td> 353 <td>16 bit unsigned integer</td> 354 </tr> 355 <tr> 356 <td>NUMBER</td> 357 <td>BIFF2 BIFF3 BIFF4 BIFF5 BIFF8</td> 358 <td>64 bit floating point (double precision)</td> 359 </tr> 360 <tr> 361 <td>RK</td> 362 <td>BIFF3 BIFF4 BIFF5 BIFF8</td> 363 <td>number, variant-type: INTEGER FLOAT DATE DATETIME TIME 364 (please see the corresponding detailed description)</td> 365 </tr> 366 <tr> 367 <td>MULRK</td> 368 <td>BIFF5 BIFF8</td> 369 <td>a variable-sized array of elementary RK values. 370 associated to a range of consecutive cells on the same row </td> 371 </tr> 372 <tr> 373 <td>LABEL</td> 374 <td>BIFF2 BIFF3 BIFF4 BIFF5 BIFF8</td> 375 <td>text string, variable-length. 376 (please see the corresponding detailed description)</td> 377 </tr> 378 <tr> 379 <td>LABELSST</td> 380 <td>BIFF8</td> 381 <td>text string, variable-length. based on the global SST 382 [Shared String Table] stored at the workbook level, so a LABELSST simply 383 requires providing the corresponding SST index.</td> 384 </tr> 385</table> 386 387So the BIFF record type that is easy to handle is NUMBER, which is essentially 388a C-style <em>double</em>. Other record types require additional handling. 389 390\subsection RK RK values 391 392An RK value is a 32 bit value. 393 394The least significant two bits are a bit-mask (in little endian order, so the 395least two significant bits in the first byte that is read): 396 - if 0x02 is set the RK value represents a 30 bit signed integer, otherwise it 397 represents a 64 bit floating point double precision number requiring special 398 reconstruction. 399 - if 0x01 is set the corresponding value needs to be divided by 100, so even an 400 integer actually becomes a floating point double precision. 401 402When interpreting RK values as a signed integer, right shifting two bits is 403required: 404\code 405int value = rk_value >> 2; 406\endcode 407 408When interpreting RK values as a 64 bit floating point, two steps are required: 409 - the RK value requires appropriate masking: 410\code 411int value = rk_value & 0xfffffffc; 412\endcode 413 - then the 32 bit value will be copied into a 64 bit buffer, and the least 414 significant four bytes need to be initialized as zeroes: 0x00000000. 415 416As a final step, if 0x01 was set into the bit-mask, now we have to divide by 417100 before returning the effective cell value. So for 32 bit integers: 418\code 419double final_value = (double)value / 100.0; 420\endcode 421and for 64 bit floats: 422\code 423double final_value = value / 100.0; 424\endcode 425 426\subsection Text Text values 427 428Any BIFF version from BIFF2 to BIFF5 simply supports CodePage based character 429encoding, i.e. each character simply requires 8 bits to be represented (single 430byte). Correct representation of characters requires knowing which one CodePage 431table has to be applied. This can be determined from the workbook or 432worksheet metadata (it is the CODEPAGE record). 433 434BIFF8 is much more sophisticated, since any text 435string is usually encoded as Unicode in UTF-16 Little Endian [UTF-16LE] format. 436This encoding is a multi-byte encoding (two bytes are required to represent a 437single character), but being universal no character table is required. 438 439BIFF text strings are never null-terminated. The actual length is always 440explicitly stated, as an 8 bit unsigned int or as a 16 bit unsigned int 441(depending on BIFF versions). 442 443FreeXL is intended to be strictly interoperable with SQLite and SpatiaLite, so 444any text string has to be converted to UTF-8 encoding. GNU libiconv can easily 445handle any required charset conversion. So we can simply fetch the appropriate 446bytes, then call iconv() as appropriate, and we'll immediately get back the 447corresponding UTF-8 encoded text string. 448 449Converting Unicode based text strings is a little more complex, because 450each Unicode string is prefixed by a mask byte, specifying how the string is 451encoded: 452 - if 0x01 is set, then the string really is 16 bit per character Unicode, 453 otherwise a stripped notation is used instead. Stripped notation means 454 that the characters are actually represented as single bytes, so already 455 have the UTF8 equivalent. 456 - if 0x04 and/or 0x08 are set, than some further variable-length data 457 (providing information on text decoration such as italics, bold, underline) is 458 inserted immediately before and after the text string itself, so we must 459 carefully skip over this extra data so to maintain the right byte alignment. 460 461Note that the string length is expressed in characters, not in bytes, so the 462actual length in bytes is twice the indicated length. 463 464\subsection Dates Retrieving Date, DateTime and Time values. 465 466Dates, DateTimes and Time values are also a little complicated. Any Date is 467expressed as an Integer (number of days since the conventional reference day): 468 - for Windows Excel the reference day (day 0) is 1900, January 1 469 - for Mac Excel the reference day (day 0) is 1904, January 2 470 471There is no possible ambiguity, because the DATEMODE metadata record 472specifies tells which reference day is to be used. 473 474An odd bug affects Excel, which (incorrectly) treats 1900 as a leap year. 475Therefore, the non-existent 29 February 1900 has to be included in the days 476calculation so to get the expected Date value. 477 478Any Time is expressed as a Fraction (percent of seconds since midnight). 4790.5 corresponds to 12:00:00 (midday), 0.25 corresponds to 06:00:00, 0.75 480corresponds to 18:00:00 and so on. 481 482So a DateTime is simply the sum of a Date value and of a Time value. Dates 483can be represented by Integers: but Times and DateTimes require a floating point 484number. 485 486The complication with Dates, DateTimes and Time values is that the data-type 487does not specify when a cell values has to be interpreted as a Date or Time - 488it is simply an Integer or Float numbers like any other. A further indirection 489has to applied so to correctly recognize Dates, DateTimes and Times: 490 - each NUMBER, RK or MULRK value exposes an index referencing the XF 491(Extended Format) entry associated with the corresponding cell. 492 - each XF record specifies an unique combination of font, alignment, color 493 and so on, however a further indirection specifies the corresponding FORMAT 494 entry 495 - each FORMAT record specifies an output format, such as M/D/YY, h:mm:ss AM/PM 496 or M/D/YY h:mm: and this finally gives us a good chance to guess which cell 497 values are intended to represent Date/Time values. 498 499Both XF and FORMAT records are globally stored at the Workbook level, and 500represent ordered arrays. 501 502If you haven't yet given up, if you aren't yet become totally mind-boggled, 503and if you are still awake and conscious, then you now know how .xls files 504are internally organized and structured. 505 506Be happy and feel proud of yourself. 507*/ 508 509/** \page OtherTools Other tools and libraries 510There an impressively wide choice of Free and open source libraries and tools 511supporting the .xls format. A sample: 512 513 - Gnumeric [http://projects.gnome.org/gnumeric/] seems to be the pioneer of 514them all, and probably was the first FLOSS tool able to read and write .xls 515(circa 2001). Part of the Gnome Office project. 516 - KSpread / Caligra Tables [http://www.calligra-suite.org/tables/] / KCells 517[http://www.koffice.org/kcells/] are similar (but now distinct) programs 518from the KOffice and Calligra Office projects. 519 - Open Office Calc [http://www.openoffice.org/product/calc.html] and 520LibreOffice Calc [http://www.libreoffice.org/features/calc/] are similar 521(but now distinct) spreadsheet applications originally from the StartOffice 522code base. Probably the most comprehensive support in FLOSS. 523 - Apache POI-HSSF [http://poi.apache.org/spreadsheet/index.html] is a 524sophisticated Java library fully supporting .xls files. 525 - JExcelAPI [http://jexcelapi.sourceforge.net/] is another Java library 526(much simpler and lighter than POI-HSSF) supporting .xls files. 527 - several C or C++ libraries exist as well: quite curiously one is named 528libxls and another xlslib, but they are two absolutely distinct and unrelated 529packages 530 - There are other implementations are available based on .NET or PHP. 531 532A quick critical review: 533 - GUI tools implementations are difficult to re-use. They focus on import of 534 all formulas, GUI presentation and so on, which is really a different use. 535 - Java libraries seem to be really interesting, but Java is difficult to call 536 from a C or C++ program. 537 - Several C/C++ libraries exist, but none of them seems to be sufficient and 538 stable as required. Some are still marked to be “beta-stage” despite 539 being released some four or five years ago - project activity seems to be 540 very low, and download statistics are discouraging. 541 542Conclusion: a suitable C/C++ library supporting data extraction from .xls files 543doesn't seem to exists: or at least, there is no obvious reference choice. 544 545So we'll go on the hardest way, we'll develop yet another .xls reading 546library: FreeXL. 547*/ 548