1:source: http://www.pytables.org/moin/FAQ
2:revision: 95
3:date: 2011-06-13 08:40:20
4:author: FrancescAlted
5
6.. py:currentmodule:: tables
7
8===
9FAQ
10===
11
12General questions
13=================
14
15What is PyTables?
16-----------------
17
18PyTables is a package for managing hierarchical datasets designed to
19efficiently cope with extremely large amounts of data.
20
21It is built on top of the HDF5_  library, the `Python language`_ and the
22NumPy_ package.
23It features an object-oriented interface that, combined with C extensions
24for the performance-critical parts of the code, makes it a fast yet
25extremely easy-to-use tool for interactively storing and retrieving very
26large amounts of data.
27
28
29What are PyTables' licensing terms?
30-----------------------------------
31
32PyTables is free for both commercial and non-commercial use, under the terms
33of the BSD license.
34
35.. todo:
36
37    link to the BSD license http://opensource.org/licenses/BSD-3-Clause
38    or to a local copy
39
40
41I'm having problems. How can I get support?
42-------------------------------------------
43
44The most common and efficient way is to subscribe (remember you *need* to
45subscribe prior to send messages) to the PyTables `users mailing list`_, and
46send there a brief description of your issue and, if possible, a short script
47that can reproduce it.
48Hopefully, someone on the list will be able to help you.
49It is also a good idea to check out the `archives of the user's list`_ (you may
50want to check the `Gmane archives`_ instead) so as to see if the answer to your
51question has already been dealed with.
52
53
54Why HDF5?
55---------
56
57HDF5_ is the underlying C library and file format that enables PyTables to
58efficiently deal with the data.  It has been chosen for the following reasons:
59
60* Designed to efficiently manage very large datasets.
61* Lets you organize datasets hierarchically.
62* Very flexible and well tested in scientific environments.
63* Good maintenance and improvement rate.
64* Technical excellence (`R&D 100 Award`_).
65* **It's Open Source software**
66
67
68Why Python?
69-----------
70
711. Python is interactive.
72
73   People familiar with data processing understand how powerful command line
74   interfaces are for exploring mathematical relationships and scientific data
75   sets.  Python provides an interactive environment with the added benefit of
76   a full featured programming language behind it.
77
782. Python is productive for beginners and experts alike.
79
80   PyTables is targeted at engineers, scientists, system analysts, financial
81   analysts, and others who consider programming a necessary evil.  Any time
82   spent learning a language or tracking down bugs is time spent not solving
83   their real problem.  Python has a short learning curve and most people can
84   do real and useful work with it in a day of learning.  Its clean syntax and
85   interactive nature facilitate this.
86
873. Python is data-handling friendly.
88
89   Python comes with nice idioms that make the access to data much easier:
90   general slicing (i.e. ``data[start:stop:step]``), list comprehensions,
91   iterators, generators ... are constructs that make the interaction with your
92   data very easy.
93
94
95Why NumPy?
96----------
97
98NumPy_ is a Python package to efficiently deal with large datasets
99**in-memory**, providing containers for homogeneous data, heterogeneous data,
100and string arrays.
101PyTables uses these NumPy containers as *in-memory buffers* to push the I/O
102bandwith towards the platform limits.
103
104
105Where can PyTables be applied?
106==============================
107
108In all the scenarios where one needs to deal with large datasets:
109
110* Industrial applications
111
112  - Data acquisition in real time
113  - Quality control
114  - Fast data processing
115
116* Scientific applications
117
118  - Meteorology, oceanography
119  - Numerical simulations
120  - Medicine (biological sensors, general data gathering & processing)
121
122* Information systems
123
124  - System log monitoring & consolidation
125  - Tracing of routing data
126  - Alert systems in security
127
128
129Is PyTables safe?
130-----------------
131
132Well, first of all, let me state that PyTables does not support transactional
133features yet (we don't even know if we will ever be motivated to implement
134this!), so there is always the risk that you can lose your data in case of an
135unexpected event while writing (like a power outage, system shutdowns ...).
136Having said that, if your typical scenarios are *write once, read many*, then
137the use of PyTables is perfectly safe, even for dealing extremely large amounts
138of data.
139
140
141Can PyTables be used in concurrent access scenarios?
142----------------------------------------------------
143
144It depends. Concurrent reads are no problem at all. However, whenever a process
145(or thread) is trying to write, then problems will start to appear.  First,
146PyTables doesn't support locking at any level, so several process writing
147concurrently to the same PyTables file will probably end up corrupting it, so
148don't do this!  Even having only one process writing and the others reading is
149a hairy thing, because the reading processes might be reading incomplete data
150from a concurrent data writing operation.
151
152The solution would be to lock the file while writing and unlock it after a
153flush over the file has been performed.  Also, in order to avoid cache (HDF5_,
154PyTables) problems with read apps, you would need to re-open your files
155whenever you are going to issue a read operation.  If a re-opening operation is
156unacceptable in terms of speed, you may want to do all your I/O operations in
157one single process (or thread) and communicate the results via sockets,
158:class:`Queue.Queue` objects (in case of using threads), or whatever, with the
159client process/thread.
160
161The examples directory contains two scripts demonstrating methods of accessing a
162PyTables file from multiple processes.
163
164The first, *multiprocess_access_queues.py*, uses a
165:class:`multiprocessing.Queue` object to transfer read and write requests from
166multiple *DataProcessor* processes to a single process responsible for all
167access to the PyTables file.  The results of read requests are then transferred
168back to the originating processes using other :class:`Queue` objects.
169
170The second example script, *multiprocess_access_benchmarks.py*, demonstrates
171and benchmarks four methods of transferring PyTables array data between
172processes.  The four methods are:
173
174 * Using :class:`multiprocessing.Pipe` from the Python standard library.
175 * Using a memory mapped file that is shared between two processes.  The NumPy
176   array associated with the file is passed as the *out* argument to the
177   :meth:`tables.Array.read` method.
178 * Using a Unix domain socket.  Note that this example uses the 'abstract
179   namespace' and will only work under Linux.
180 * Using an IPv4 socket.
181
182
183What kind of containers does PyTables implement?
184------------------------------------------------
185
186PyTables does support a series of data containers that address specific needs
187of the user. Below is a brief description of them:
188
189::class:`Table`:
190    Lets you deal with heterogeneous datasets. Allows compression. Enlargeable.
191    Supports nested types. Good performance for read/writing data.
192::class:`Array`:
193    Provides quick and dirty array handling. Not compression allowed.
194    Not enlargeable. Can be used only with relatively small datasets (i.e.
195    those that fit in memory). It provides the fastest I/O speed.
196::class:`CArray`:
197    Provides compressed array support. Not enlargeable. Good speed when
198    reading/writing.
199::class:`EArray`:
200    Most general array support. Compressible and enlargeable. It is pretty
201    fast at extending, and very good at reading.
202::class:`VLArray`:
203    Supports collections of homogeneous data with a variable number of entries.
204    Compressible and enlargeable. I/O is not very fast.
205::class:`Group`:
206    The structural component.
207    A hierarchically-addressable container for HDF5 nodes (each of these
208    containers, including Group, are nodes), similar to a directory in a
209    UNIX filesystem.
210
211Please refer to the  :doc:`usersguide/libref` for more specific information.
212
213
214Cool! I'd like to see some examples of use.
215-------------------------------------------
216
217Sure. Go to the HowToUse section to find simple examples that will help you
218getting started.
219
220
221Can you show me some screenshots?
222---------------------------------
223
224Well, PyTables is not a graphical library by itself.  However, you may want to
225check out ViTables_, a GUI tool to browse and edit PyTables & HDF5_ files.
226
227
228Is PyTables a replacement for a relational database?
229----------------------------------------------------
230
231No, by no means. PyTables lacks many features that are standard in most
232relational databases.  In particular, it does not have support for
233relationships (beyond the hierarchical one, of course) between datasets and it
234does not have transactional features.  PyTables is more focused on speed and
235dealing with really large datasets, than implementing the above features.  In
236that sense, PyTables can be best viewed as a *teammate* of a relational
237database.
238
239For example, if you have very large tables in your existing relational
240database, they will take lots of space on disk, potentially reducing the
241performance of the relational engine.  In such a case, you can move those huge
242tables out of your existing relational database to PyTables, and let your
243relational engine do what it does best (i.e.  manage relatively small or medium
244datasets with potentially complex relationships), and use PyTables for what it
245has been designed for (i.e. manage large amounts of data which are loosely
246related).
247
248
249How can PyTables be fast if it is written in an interpreted language like Python?
250---------------------------------------------------------------------------------
251
252Actually, all of the critical I/O code in PyTables is a thin layer of code on
253top of HDF5_, which is a very efficient C library. Cython_ is used as the
254*glue* language to generate "wrappers" around HDF5 calls so that they can be
255used in Python.  Also, the use of an efficient numerical package such as NumPy_
256makes the most costly operations effectively run at C speed.  Finally,
257time-critical loops are usually implemented in Cython_ (which, if used
258properly, allows to generate code that runs at almost pure C speeds).
259
260
261If it is designed to deal with very large datasets, then PyTables should consume a lot of memory, shouldn't it?
262---------------------------------------------------------------------------------------------------------------
263
264Well, you already know that PyTables sits on top of HDF5, Python and NumPy_,
265and if we add its own logic (~7500 lines of code in Python, ~3000 in Cython and
266~4000 in C), then we should conclude that PyTables isn't effectively a paradigm
267of lightness.
268
269Having said that, PyTables (as HDF5_ itself) tries very hard to optimize the
270memory consumption by implementing a series of features like dynamic
271determination of buffer sizes, *Least Recently Used* cache for keeping unused
272nodes out of memory, and extensive use of compact NumPy_ data containers.
273Moreover, PyTables is in a relatively mature state and most memory leaks have
274been already addressed and fixed.
275
276Just to give you an idea of what you can expect, a PyTables program can deal
277with a table with around 30 columns and 1 million entries using as low as 13 MB
278of memory (on a 32-bit platform).  All in all, it is not that much, is it?.
279
280
281Why was PyTables born?
282----------------------
283
284Because, back in August 2002, one of its authors (`Francesc Alted`_) had a need
285to save lots of hierarchical data in an efficient way for later post-processing
286it.  After trying out several approaches, he found that they presented distinct
287inconveniences.  For example, working with file sizes larger than, say, 100 MB,
288was rather painful with ZODB (it took lots of memory with the version available
289by that time).
290
291The netCDF3_ interface provided by `Scientific Python`_ was great, but it did
292not allow to structure the hierarchically; besides, netCDF3_ only supports
293homogeneous datasets, not heterogeneous ones (i.e. tables). (As an aside,
294netCDF4_ overcomes many of the limitations of netCDF3_, although curiously
295enough, it is based on top of HDF5_, the library chosen as the base for
296PyTables from the very beginning.)
297
298So, he decided to give HDF5_ a try, start doing his own wrappings to it and
299voilà, this is how the first public release of PyTables (0.1) saw the light in
300October 2002, three months after his itch started to eat him ;-).
301
302
303Does PyTables have a client-server interface?
304---------------------------------------------
305
306Not by itself, but you may be interested in using PyTables through pydap_, a
307Python implementation of the OPeNDAP_ protocol.  Have a look at the `PyTables
308plugin` of pydap_.
309
310
311How does PyTables compare with the h5py project?
312------------------------------------------------
313
314Well, they are similar in that both packages are Python interfaces to the HDF5_
315library, but there are some important differences to be noted.  h5py_ is an
316attempt to map the HDF5_ feature set to NumPy_ as closely as possible.  In
317addition, it also provides access to nearly all of the HDF5_ C API.
318
319Instead, PyTables builds up an additional abstraction layer on top of HDF5_ and
320NumPy_ where it implements things like an enhanced type system, an :ref:`engine
321for enabling complex queries <searchOptim>`, an `efficient computational
322kernel`_, `advanced indexing capabilities`_ or an undo/redo feature, to name
323just a few.  This additional layer also allows PyTables to be relatively
324independent of its underlying libraries (and their possible limitations).  For
325example, PyTables can support HDF5_ data types like `enumerated` or `time` that
326are available in the HDF5_ library but not in the NumPy_ package; or even
327perform powerful complex queries that are not implemented directly in neither
328HDF5_ nor NumPy_.
329
330Furthermore, PyTables also tries hard to be a high performance interface to
331HDF5/NumPy, implementing niceties like internal LRU caches for nodes and other
332data and metadata, :ref:`automatic computation of optimal chunk sizes
333<chunksizeFineTune>` for the datasets, a variety of compressors, ranging from
334slow but efficient (bzip2_) to extremely fast ones (Blosc_) in addition to the
335standard `zlib`_.  Another difference is that PyTables makes use of numexpr_ so
336as to accelerate internal computations (for example, in evaluating complex
337queries) to a maximum.
338
339For contrasting with other opinions, you may want to check the PyTables/h5py
340comparison in a similar entry of the `FAQ of h5py`_.
341
342
343I've found a bug.  What do I do?
344--------------------------------
345
346The PyTables development team works hard to make this eventuality as rare as
347possible, but, as in any software made by human beings, bugs do occur.  If you
348find any bug, please tell us by file a bug report in the `issue tracker`_ on
349GitHub_.
350
351
352Is it possible to get involved in PyTables development?
353-------------------------------------------------------
354
355Indeed. We are keen for more people to help out contributing code, unit tests,
356documentation, and helping out maintaining this wiki. Drop us a mail on the
357`users mailing list` and tell us in which area do you want to work.
358
359
360How can I cite PyTables?
361------------------------
362
363The recommended way to cite PyTables in a paper or a presentation is as
364following:
365
366* Author: Francesc Alted, Ivan Vilata and others
367* Title: PyTables: Hierarchical Datasets in Python
368* Year: 2002 -
369* URL: http://www.pytables.org
370
371Here's an example of a BibTeX entry::
372
373    @Misc{,
374      author =    {PyTables Developers Team},
375      title =     {{PyTables}: Hierarchical Datasets in {Python}},
376      year =      {2002--},
377      url = "http://www.pytables.org/"
378    }
379
380
381PyTables 2.x issues
382===================
383
384I'm having problems migrating my apps from PyTables 1.x into PyTables 2.x. Please, help!
385----------------------------------------------------------------------------------------
386
387Sure.  However, you should first check out the :doc:`MIGRATING_TO_2.x`
388document.
389It should provide hints to the most frequently asked questions on this regard.
390
391
392For combined searches like `table.where('(x<5) & (x>3)')`, why was a `&` operator chosen instead of an `and`?
393-------------------------------------------------------------------------------------------------------------
394
395Search expressions are in fact Python expressions written as strings, and they
396are evaluated as such.  This has the advantage of not having to learn a new
397syntax, but it also implies some limitations with logical `and` and `or`
398operators, namely that they can not be overloaded in Python.  Thus, it is
399impossible right now to get an element-wise operation out of an expression like
400`'array1 and array2'`.  That's why one has to choose some other operator, being
401`&` and `|` the most similar to their C counterparts `&&` and `||`, which
402aren't available in Python either.
403
404You should be careful about expressions like `'x<5 & x>3'` and others like `'3
405< x < 5'` which ''won't work as expected'', because of the different operator
406precedence and the absence of an overloaded logical `and` operator.  More on
407this in the appendix about condition syntax in the `HDF5 manual`_.
408
409There are quite a few packages affected by those limitations including NumPy_
410themselves and SQLObject_, and there have been quite longish discussions about
411adding the possibility of overloading logical operators to Python (see `PEP
412335`_ and `this thread`__ for more details).
413
414__ https://mail.python.org/pipermail/python-dev/2004-September/048763.html
415
416
417I can not select rows using in-kernel queries with a condition that involves an UInt64Col. Why?
418-----------------------------------------------------------------------------------------------
419
420This turns out to be a limitation of the numexpr_ package.  Internally,
421numexpr_ uses a limited set of types for doing calculations, and unsigned
422integers are always upcasted to the immediate signed integer that can fit the
423information.  The problem here is that there is not a (standard) signed integer
424that can be used to keep the information of a 64-bit unsigned integer.
425
426So, your best bet right now is to avoid `uint64` types if you can.  If you
427absolutely need `uint64`, the only way for doing selections with this is
428through regular Python selections.  For example, if your table has a `colM`
429column which is declared as an `UInt64Col`, then you can still filter its
430values with::
431
432    [row['colN'] for row in table if row['colM'] < X]
433
434
435However, this approach will generally lead to slow speed (specially on Win32
436platforms, where the values will be converted to Python `long` values).
437
438
439I'm already using PyTables 2.x but I'm still getting numarray objects instead of NumPy ones!
440--------------------------------------------------------------------------------------------
441
442This is most probably due to the fact that you are using a file created with
443PyTables 1.x series.  By default, PyTables 1.x was setting an HDF5 attribute
444`FLAVOR` with the value `'numarray'` to all leaves.  Now, PyTables 2.x sees
445this attribute and obediently converts the internal object (truly a NumPy
446object) into a `numarray` one.  For PyTables 2.x files the `FLAVOR` attribute
447will only be saved when explicitly set via the `leaf.flavor` property (or when
448passing data to an :class:`Array` or :class:`Table` at creation time), so you
449will be able to distinguish default flavors from user-set ones by checking the
450existence of the `FLAVOR` attribute.
451
452Meanwhile, if you don't want to receive `numarray` objects when reading old
453files, you have several possibilities:
454
455* Remove the flavor for your datasets by hand::
456
457     for leaf in h5file.walkNodes(classname='Leaf'):
458         del leaf.flavor
459
460* Use the :program:'ptrepack` utility with the flag `--upgrade-flavors`
461  so as to convert all flavors in old files to the default (effectively by
462  removing the `FLAVOR` attribute).
463* Remove the `numarray` (and/or `Numeric`) package from your system.
464  Then PyTables 2.x will return you pure NumPy objects (it can't be
465  otherwise!).
466
467
468Installation issues
469===================
470
471Windows
472-------
473
474Error when importing tables
475~~~~~~~~~~~~~~~~~~~~~~~~~~~
476
477You have installed the binary installer for Windows and, when importing the
478*tables* package you are getting an error like::
479
480    The command in "0x6714a822" refers to memory in "0x012011a0". The
481    procedure "written" could not be executed.
482    Click to ok to terminate.
483    Click to abort to debug the program.
484
485This problem can be due to a series of reasons, but the most probable one is
486that you have a version of a DLL library that is needed by PyTables and it is
487not at the correct version.  Please, double-check the versions of the required
488libraries for PyTables and install newer versions, if needed. In most cases,
489this solves the issue.
490
491In case you continue getting problems, there are situations where other
492programs do install libraries in the PATH that are **optional** to PyTables
493(for example BZIP2 or LZO), but that they will be used if they are found in
494your system (i.e. anywhere in your :envvar:`PATH`).  So, if you find any of
495these libraries in your PATH, upgrade it to the latest version available (you
496don't need to re-install PyTables).
497
498
499Can't find LZO binaries for Windows
500~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
501
502Unfortunately, the LZO binaries for Windows seems to be unavailable from its
503usual place at http://gnuwin32.sourceforge.net/packages/lzo.htm.  So, in order
504to allow people to be able to install this excellent compressor easily, we have
505packaged the LZO binaries in a zip file available at:
506http://www.pytables.org/download/lzo-win.  This zip file follows the same
507structure that a typical GnuWin32_ package, so it is just a matter of unpacking
508it in your ``GNUWIN32`` directory and following the :ref:`instructions
509<prerequisitesBinInst>` in the `PyTables Manual`_.
510
511Hopefully somebody else will take care again of maintaining LZO for Windows
512again.
513
514
515Testing issues
516==============
517
518Tests fail when running from IPython
519------------------------------------
520
521You may be getting errors related with Doctest when running the test suite from
522IPython.  This is a known limitation in IPython (see
523http://lists.ipython.scipy.org/pipermail/ipython-dev/2007-April/002859.html).
524Try running the test suite from the vanilla Python interpreter instead.
525
526
527Tests fail when running from Python 2.5 and Numeric is installed
528----------------------------------------------------------------
529
530`Numeric` doesn't get well with Python 2.5, even on 32-bit platforms.  This is
531a consequence of `Numeric` not being maintained anymore and you should consider
532migrating to NumPy as soon as possible.  To get rid of these errors, just
533uninstall `Numeric`.
534
535
536-----
537
538
539.. target-notes::
540
541.. _HDF5: http://www.hdfgroup.org/HDF5
542.. _`Python language`: http://www.python.org
543.. _NumPy: http://www.numpy.org
544.. _`users mailing list`: https://groups.google.com/group/pytables-users
545.. _`archives of the user's list`: https://sourceforge.net/p/pytables/mailman/pytables-users/
546.. _`Gmane archives`: http://www.mail-archive.com/pytables-users@lists.sourceforge.net/
547.. _`R&D 100 Award`: http://www.hdfgroup.org/HDF5/RD100-2002/
548.. _ViTables: http://vitables.org
549.. _Cython: http://www.cython.org
550.. _`Francesc Alted`: http://www.pytables.org/moin/FrancescAlted
551.. _netCDF3: http://www.unidata.ucar.edu/software/netcdf
552.. _`Scientific Python`: http://dirac.cnrs-orleans.fr/plone/software/scientificpython
553.. _netCDF4: http://www.unidata.ucar.edu/software/netcdf
554.. _pydap: http://www.pydap.org
555.. _OPeNDAP: http://opendap.org
556.. _`PyTables plugin`: http://pydap.org/plugins/hdf5.html
557.. _`PyTables Manual`: http://www.pytables.org/docs/manual
558.. _h5py: http://www.h5py.org
559.. _`efficient computational kernel`: http://www.pytables.org/moin/ComputingKernel
560.. _`advanced indexing capabilities`: http://www.pytables.org/moin/PyTablesPro
561.. _`automatic computation of optimal chunk sizes`: http://www.pytables.org/docs/manual/ch05.html#chunksizeFineTune
562.. _bzip2: http://www.bzip.org
563.. _Blosc: http://blosc.pytables.org
564.. _`zlib`: http://zlib.net
565.. _numexpr: https://github.com/pydata/numexpr
566.. _`FAQ of h5py`: http://docs.h5py.org/en/latest/faq.html#what-s-the-difference-between-h5py-and-pytables
567.. _`issue tracker`: https://github.com/PyTables/PyTables/issues
568.. _GitHub: https://github.com
569.. _`HDF5 manual`: http://www.hdfgroup.org/HDF5/doc/RM/RM_H5T.html
570.. _SQLObject: http://sqlobject.org
571.. _`PEP 335`: http://www.python.org/dev/peps/pep-0335
572.. _GnuWin32: http://gnuwin32.sourceforge.net
573
574
575.. todo:: fix links that point to wiki pages
576
577