1/** \mainpage notitle
2
3\section Introduction
4
5FreeXL is an open source library to extract valid data from within an Excel
6(.xls) spreadsheet.
7
8The FreeXL design goals are:
9 - to be simple and lightweight
10 - to be stable, robust and efficient
11 - to be easily and universally portable.
12 - completely ignore any GUI-related oddity
13
14Note that the final goal means that FreeXL ignores at all fonts, sizes and
15alignments, and most formats. It ignores Pivot Table, Charts, Formulas, Visual
16Basic macros and so on.
17
18FreeXL is structurally simple and quite light-weight (typically 40-80K of object
19code, stripped). FreeXL has one key dependency - GNU libiconv, which is used for
20character set conversions. This is often provided as part of the C library on
21Linux systems, and is widely available.
22
23Building and installing FreeXL is straightforward:
24\verbatim
25./configure
26make
27make install
28\endverbatim
29
30Linking FreeXL to your own code is usually simple:
31\verbatim
32gcc my_program.c -o my_program -lfreexl
33\endverbatim
34
35On some systems you may have to provide a slightly more complex arrangement:
36\verbatim
37gcc -I/usr/local/include my_program.c -o my_program \
38  -L/usr/local/lib -lfreexl -liconv -lm
39\endverbatim
40
41FreeXL also provides pkg-config support, so you can also do:
42\verbatim
43gcc -I/usr/local/include my_program.c -o my_program `pkg-config --libs freexl`
44\endverbatim
45
46I sincerely hope FreeXL could be useful to many of you. Excel *.xls spreadsheets
47are widespread, and although Microsoft itself is strongly pushing the new XML
48based formats, there is still a lot of legacy data stored in the older binary
49formats.
50
51So in an era of open data, a simple and easy way to extract data from .xls is
52surely useful. The original use of FreeXL was to support the SQLite /
53SpatiaLite VirtualXL driver (implementing direct access to .xls files via SQL).
54However there are many other possibilities, including use with shell scripts and
55simple wrappers for Python, Perl and other very high level languages.
56
57FreeXL is licensed under the MPL tri-license terms: you are free to choose the
58best-fit license between:
59 - the MPL 1.1
60 - the GPL v2.0 or any subsequent version
61 - the LGPL v2.1  or any subsequent version
62
63Enjoy, and happy coding
64*/
65
66/** \page Origins The background story for FreeXL
67
68Where, when and why a new free software library was born...
69
70At the end of April 2011 Markus Neteler
71[http://en.wikipedia.org/wiki/Markus_Neteler] and I (Sandro) were in Udine in
72Northern Italy, attending the annual Italian gvSIG Users conference. So, on a
73lovely hot and sunny spring evening, accompanied by a picturesque sunset, we
74were sitting in the town centre in the pleasant Piazza Matteotti aka Piazza San
75Giacomo, peacefully drinking some spritz while eating chips and peanuts.
76
77\image html piazza.jpg
78\image latex piazza.eps "Piazza" width=12cm
79
80You'll have to admit, it was a really dangerous situation: as a general safety
81rule, never let two developers sit idle for too long. Some odd and crazy idea
82will inevitably occur to them  (of course, drinking too much spritz can
83contribute as well).
84
85Markus was desperately attempting to convince me that implementing a
86VirtualTable driver for SpatiaLite supporting direct SQL reading ofExcel .xls
87files was a good initiative. Surely it would be useful for many users. I
88strongly resisted, fiercely fighting and rejecting such idea, on the basis that
89attempting to read proprietary closed formats such as Excel was a complete
90nonsense, and probably a very difficult if not impossible task.
91
92At the end of this very animated discussion, Markus pronounced the magic spell
93that suddenly convinced me about the absolute validity of his suggestions:
94<em>I can raise some funding for this project</em> After hearing such wise and
95significant words from Markus I immediately realized that developing a new
96driver for accessing Excel .xls documents surely was an exciting and useful task
97after all.
98
99So FreeXL / VirtualXL was conceived at that exact moment, and slowly
100began its development cycle.
101*/
102
103/** \page Specs File format specifications and source information
104
105The .xls binary file format is extensively documented and is publicly available.
106The most authoritative source is made available by Microsoft at
107http://msdn.microsoft.com/en-us/library/cc313154%28v=office.12%29.aspx
108
109A simpler option is made available by Open Office:
110http://sc.openoffice.org/excelfileformat.pdf
111
112Searching the web you'll easily find several other valuable information sources.
113*/
114
115/** \page Format About the .xls binary format
116
117What a .xls binary file really is
118
119(Prepare yourself to be continuously surprised by many unexpected revelations
120...)
121
122You may already know that there are many different versions of .xls files.
123Different versions have different capabilities. So we'll start by
124reviewing the Excel evolutionary history and and we'll introduce some Microsoft
125jargon because it's central to understanding the underlying operations.
126
127\section CFBF CFBF
128
129Unexpected Revelation #1:
130<em>There is no .xls file format. Its really a common file suffix applied to
131many different things.</em>
132
133Recent Microsoft Office document files are based on a common container layout
134named CFBF (Compound File Binary Format). This container format is the same for
135Excel (.xls), Word (.doc_ and PowerPoint (.ppt) amongst other applications.
136More information:
137 - http://en.wikipedia.org/wiki/Compound_File_Binary_Format
138 - http://msdn.microsoft.com/en-us/library/dd942138%28v=prot.13%29.aspx
139
140Unexpected Revelation #2:
141<em>CFBF  is more of a file system than a file format</em>
142
143A CFBF file is divided into many equal-sized blocks named sectors.
144Such sectors cannot be directly accessed. In order to retrieve sectors in the
145expected logical order a FAT (File Allocation Table) is allocated within
146the CFBF file. A CFBF file is internally organized as if it was a raw physical
147disk. The design is based on Microsoft own FAT file-system as used by MS-DOS and
148early versions of Windows. The first sector of the CFBF file acts as if it was a
149kind-of MBR (Master Boot Record) - this first sector provides information about
150the layout and type of the CFBF file, such as block/sector size and version.
151A FAT chain allows a reader to re-assemble the sectors in the required logical
152order. There is a list of free block, and very large files may use a
153double indirection (DIFAT - Double Indirection FAT). A CFBF file always has at
154least a root directory: but a complete directory tree can be provided.
155
156A CFBF file can contain many and many distinct independent files. Just to make
157things a little clearer, Microsoft calls such pseudo-files (I mean: the many
158fake ones contained within the real CFBF file) <em>streams</em>
159
160The practical consequence is that any software tool attempting to access an
161Excel binary document must first be able to correctly access this CFBF
162container format.
163
164\section BIFF BIFF
165Unexpected Revelation #3:
166<em>An Excel document will contain a stream (pseudo-file) named Workbook in
167the root directory of the CFBF file (filesystem).</em>
168
169The Workbook stream is internally structured accordingly to the BIFF (Binary
170Interchange File Format) specifications. You can think of the BIFF as the real
171Excel binary format (following more conventional naming rules). Several BIFF
172versions were introduced during the years: and there are significant differences
173between them.
174
175An useful correspondence table relating corresponding Excel and BIFF versions:
176
177<table>
178    <tr>
179	<th>
180	    <b>Excel Version</b>
181	</th>
182	<th>
183	    <b>Commercial Name</b>
184	</th>
185	<th>
186	    <b>BIFF Version</b>
187	</th>
188	<th>
189	    <b>Release Year</b>
190	</th>
191	<th>
192	    <b>Notes</b>
193	</th>
194    </tr>
195    <tr>
196	<td>2.x</td>
197	<td>Excel 2.0</td>
198	<td>BIFF2</td>
199	<td>1987</td>
200	<td>Before CFBF. File is the BIFF stream, containing a single
201	worksheet.</td>
202    </tr>
203    <tr>
204	<td>3.0</td>
205	<td>Excel 3.0</td>
206	<td>BIFF3</td>
207	<td>1990</td>
208	<td>Before CFBF. File is the BIFF stream, containing a single
209	worksheet.</td>
210    </tr>
211    <tr>
212	<td>4.0</td>
213	<td>Excel 4.0</td>
214	<td>BIFF4</td>
215	<td>1992</td>
216	<td>Before CFBF. File is the BIFF stream, containing a single
217	worksheet.</td>
218    </tr>
219    <tr>
220	<td>5.0</td>
221	<td>Excel 5.0</td>
222	<td>BIFF5</td>
223	<td>1993</td>
224	<td>Starting with BIFF5, a single Workbook can internally store many
225	individual Worksheets. The BIFF stream is stored in the CFBF file
226	container.</td>
227    </tr>
228    <tr>
229	<td>7.0</td>
230	<td>Excel 95</td>
231	<td>BIFF5</td>
232	<td>1995</td>
233	<td></td>
234    </tr>
235    <tr>
236	<td>8.0</td>
237	<td>Excel 98</td>
238	<td>BIFF8</td>
239	<td>1998</td>
240	<td></td>
241    </tr>
242    <tr>
243	<td>9.0</td>
244	<td>Excel 2000</td>
245	<td>BIFF8</td>
246	<td>1999</td>
247	<td></td>
248    </tr>
249    <tr>
250	<td>10.0</td>
251	<td>Excel XP</td>
252	<td>BIFF8</td>
253	<td>2001</td>
254	<td></td>
255    </tr>
256    <tr>
257	<td>11.0</td>
258	<td>Excel 2003</td>
259	<td>BIFF8</td>
260	<td>2003</td>
261	<td></td>
262    </tr>
263    <tr>
264	<td>12.0</td>
265	<td>Excel 2007</td>
266	<td>BIFF8</td>
267	<td>2007</td>
268	<td>Introduced alternate XML format, which is usually the default
269	for new files.</td>
270    </tr>
271    <tr>
272	<td>14.0</td>
273	<td>Excel 2010</td>
274	<td>BIFF8</td>
275	<td>2010</td>
276	<td>XML format is usually the default for new files.</td>
277    </tr>
278</table>
279
280Note that FreeXL does not support the new XML format which is a completely
281different and unrelated format.
282
283Perhaps you are now expecting that BIFF will simply and directly encode your
284spreadsheet data. Unfortunately, you should have know better given the steps
285we took to get here...
286
287Unexpected Revelation #4:
288<em>Any BIFF stream (pseudo-file stored within a CFBF container file) is
289internally organized as a collection of variable-length records.</em>.
290
291Each record starts with
292 - a 16 bit unsigned integer specifying the record type
293 - another 16 bit unsigned integer specifies the record data length (in bytes)
294 excluding the standard type-size prefix.
295
296Note that there are many different record types, and the record size / layout
297may differ for different BIFF versions.
298
299Three record types have an absolutely special meaning:
300 - a BOF [Beginning Of File] record marks starting of a different sub-stream.
301 - an EOF [End Of File] record marks ending of current sub-stream.
302 - a CONTINUE record means that the previous record exceeded the maximum size
303 for a record, and the previous record data payload will be spanned on
304 following CONTINUE records for as many CONTINUE records as are required to
305 store the full data size.
306
307Unexpected Revelation #5:
308<em>So a BIFF stream (pseudo-file) isn't really a file - it's more like a
309collection of individual sub-streams, each one of which is enclosed between BOF
310/ EOF markers.</em>
311
312The most recent BIFF8 requires that at least the following internal
313sub-streams are be defined:
314 - the first sub-stream contains workbook level global data and metadata, such
315 as author, password protection, styles, formats, window settings and so on
316 - list of individual worksheets included into the Workbook, where each
317 worksheet is identified by a name and by a type (data Worksheet, Chart,
318 Visual Basic module ... visible, hidden ...), and relative offset position of
319 the corresponding BOF record allows for fast positioning.
320 - any text string is stored here into the SST [Shared String Table], so
321 individual text cells simply refer the corresponding SST entry by index (in all
322 previous BIFF version text strings are directly stored into the appropriate
323 cell).
324 - any subsequent sub-stream represents a single Worksheet, and the most
325 relevant data stored at Worksheet level are dimension (number of valid rows and
326 columns) and any cell value data.
327
328We will now see how BIFF encodes individual data types with several further
329amazing surprises are still to come. Be prepared!
330
331\section BIFF Supported data types and encoding
332
333Leaving aside special values such as images, OLE, COM, Visual Basic related
334items and so on, the basic data types are supported in BIFF:
335 - text strings
336 - numbers (both integers and decimals)
337 - dates, date-times and times
338 - NULL (empty cell)
339
340Note that any multi-byte value is stored in BIFF accordingly using Little Endian
341byte ordering (i.e. least significant byte comes first, most significant byte
342comes last).
343
344<table>
345    <tr>
346	<th><b>BIFF Record Type</b></th>
347	<th><b>BIFF Version</b></th>
348	<th><b>Content Type</b></th>
349    </tr>
350    <tr>
351	<td>INTEGER</td>
352	<td>BIFF2</td>
353	<td>16 bit unsigned integer</td>
354    </tr>
355    <tr>
356	<td>NUMBER</td>
357	<td>BIFF2 BIFF3 BIFF4 BIFF5 BIFF8</td>
358	<td>64 bit floating point (double precision)</td>
359    </tr>
360    <tr>
361	<td>RK</td>
362	<td>BIFF3 BIFF4 BIFF5 BIFF8</td>
363	<td>number, variant-type: INTEGER FLOAT DATE DATETIME TIME
364	(please see the corresponding detailed description)</td>
365    </tr>
366    <tr>
367	<td>MULRK</td>
368	<td>BIFF5 BIFF8</td>
369	<td>a variable-sized array of elementary RK values.
370	associated to a range of consecutive cells on the same row </td>
371    </tr>
372    <tr>
373	<td>LABEL</td>
374	<td>BIFF2 BIFF3 BIFF4 BIFF5 BIFF8</td>
375	<td>text string, variable-length.
376	(please see the corresponding detailed description)</td>
377    </tr>
378    <tr>
379	<td>LABELSST</td>
380	<td>BIFF8</td>
381	<td>text string, variable-length. based on the global SST
382	[Shared String Table] stored at the workbook level, so a LABELSST simply
383	requires providing the corresponding SST index.</td>
384    </tr>
385</table>
386
387So the BIFF record type that is easy to handle is NUMBER, which is essentially
388a C-style <em>double</em>. Other record types require additional handling.
389
390\subsection RK RK values
391
392An RK value is a 32 bit value.
393
394The least significant two bits are a bit-mask (in little endian order, so the
395least two significant bits in the first byte that is read):
396 - if 0x02 is set the RK value represents a 30 bit signed integer, otherwise it
397 represents a 64 bit floating point double precision number requiring special
398 reconstruction.
399 - if 0x01 is set the corresponding value needs to be divided by 100, so even an
400 integer actually becomes a floating point double precision.
401
402When interpreting RK values as a signed integer, right shifting two bits is
403required:
404\code
405int value = rk_value >> 2;
406\endcode
407
408When interpreting RK values as a 64 bit floating point, two steps are required:
409 - the RK value requires appropriate  masking:
410\code
411int value = rk_value & 0xfffffffc;
412\endcode
413 - then the 32 bit value will be copied into a 64 bit buffer, and the least
414 significant four bytes need to be initialized as zeroes: 0x00000000.
415
416As a final step, if 0x01 was set into the bit-mask, now we have to divide by
417100 before returning the effective cell value. So for 32 bit integers:
418\code
419double final_value = (double)value / 100.0;
420\endcode
421and for 64 bit floats:
422\code
423double final_value = value / 100.0;
424\endcode
425
426\subsection Text Text values
427
428Any BIFF version from BIFF2 to BIFF5 simply supports CodePage based character
429encoding, i.e. each character simply requires 8 bits to be represented  (single
430byte). Correct representation of characters requires knowing which one CodePage
431table has to be applied. This can be determined from the workbook or
432worksheet metadata (it is the CODEPAGE record).
433
434BIFF8 is much more sophisticated, since any text
435string is usually encoded as Unicode in UTF-16 Little Endian [UTF-16LE] format.
436This encoding is a multi-byte encoding (two bytes are required to represent a
437single character), but being universal no character table is required.
438
439BIFF text strings are never null-terminated. The actual length is always
440explicitly stated, as an 8 bit unsigned int or as a 16 bit unsigned int
441(depending on BIFF versions).
442
443FreeXL is intended to be strictly interoperable with SQLite and SpatiaLite, so
444any text string has to be converted to UTF-8 encoding. GNU libiconv can easily
445handle any required charset conversion. So we can simply fetch the appropriate
446bytes, then call iconv() as appropriate, and we'll immediately get back the
447corresponding UTF-8 encoded text string.
448
449Converting Unicode based text strings is a little more complex, because
450each Unicode string is prefixed by a mask byte, specifying how the string is
451encoded:
452 - if 0x01 is set, then the string really is 16 bit per character Unicode,
453 otherwise a stripped notation is used instead. Stripped notation means
454 that the characters are actually represented as single bytes, so already
455 have the UTF8 equivalent.
456 - if 0x04 and/or 0x08 are set, than some further variable-length data
457 (providing information on text decoration such as italics, bold, underline) is
458 inserted immediately before and after the text string itself, so we must
459 carefully skip over this extra data so to maintain the right byte alignment.
460
461Note that the string length is expressed in characters, not in bytes, so the
462actual length in bytes is twice the indicated length.
463
464\subsection Dates Retrieving Date, DateTime and Time values.
465
466Dates, DateTimes and Time values are also a little complicated. Any Date is
467expressed as an Integer (number of days since the conventional reference day):
468 - for Windows Excel the reference day (day 0) is 1900, January 1
469 - for Mac Excel the reference day (day 0) is 1904, January 2
470
471There is no possible ambiguity, because the DATEMODE metadata record
472specifies tells which reference day is to be used.
473
474An odd bug affects Excel, which (incorrectly) treats 1900 as a leap year.
475Therefore, the non-existent 29 February 1900 has to be included in the days
476calculation so to get the expected Date value.
477
478Any Time is expressed as a Fraction (percent of seconds since midnight).
4790.5 corresponds to 12:00:00 (midday), 0.25 corresponds to 06:00:00, 0.75
480corresponds to 18:00:00 and so on.
481
482So a DateTime is simply the sum of a Date value and of a Time value. Dates
483can be represented by Integers: but Times and DateTimes require a floating point
484number.
485
486The complication with Dates, DateTimes and Time values is that the data-type
487does not specify when a cell values has to be interpreted as a Date or Time -
488it is simply an Integer or Float numbers like any other. A further indirection
489has to applied so to correctly recognize Dates, DateTimes and Times:
490 - each NUMBER, RK or MULRK value exposes an index referencing the XF
491(Extended Format) entry associated with the corresponding cell.
492 - each XF record specifies an unique combination of font, alignment, color
493 and so on, however a further indirection specifies the corresponding FORMAT
494 entry
495 - each FORMAT record specifies an output format, such as M/D/YY, h:mm:ss AM/PM
496 or M/D/YY h:mm: and this finally gives us a good chance to guess which cell
497 values are intended to represent Date/Time values.
498
499Both XF and FORMAT records are globally stored at the Workbook level, and
500represent ordered arrays.
501
502If you haven't yet given up, if you aren't yet become totally mind-boggled,
503and if you are still awake and conscious, then you now know how .xls files
504are internally organized and structured.
505
506Be happy and feel proud of yourself.
507*/
508
509/** \page OtherTools Other tools and libraries
510There an impressively wide choice of Free and open source libraries and tools
511supporting the .xls format. A sample:
512
513 - Gnumeric [http://projects.gnome.org/gnumeric/] seems to be the pioneer of
514them all, and probably was the first FLOSS tool able to read and write .xls
515(circa 2001). Part of the Gnome Office project.
516 - KSpread / Caligra Tables [http://www.calligra-suite.org/tables/] / KCells
517[http://www.koffice.org/kcells/] are similar (but now distinct) programs
518from the KOffice and Calligra Office projects.
519 - Open Office Calc [http://www.openoffice.org/product/calc.html] and
520LibreOffice Calc [http://www.libreoffice.org/features/calc/] are similar
521(but now distinct) spreadsheet applications originally from the StartOffice
522code base. Probably the most comprehensive support in FLOSS.
523 - Apache POI-HSSF [http://poi.apache.org/spreadsheet/index.html] is a
524sophisticated Java library fully supporting .xls files.
525 - JExcelAPI [http://jexcelapi.sourceforge.net/] is another Java library
526(much simpler and lighter than POI-HSSF) supporting .xls files.
527 - several C or C++ libraries exist as well: quite curiously one is named
528libxls and another xlslib, but they are two absolutely distinct and unrelated
529packages
530 - There are other implementations are available based on .NET or PHP.
531
532A quick critical review:
533 - GUI tools implementations are difficult to re-use. They focus on import of
534 all formulas, GUI presentation and so on, which is really a different use.
535 - Java libraries seem to be really interesting, but Java is difficult to call
536 from a C or C++ program.
537 - Several C/C++ libraries exist, but none of them seems to be sufficient and
538  stable as required. Some are still marked to be “beta-stage” despite
539  being released some four or five years ago - project activity seems to be
540  very low, and download statistics are discouraging.
541
542Conclusion: a suitable C/C++ library supporting data extraction from .xls files
543doesn't seem to exists: or at least, there is no obvious reference choice.
544
545So we'll go on the hardest way, we'll develop yet another .xls reading
546library: FreeXL.
547*/
548