1.. sectionauthor:: Pierre Gerard-Marchant <pierregmcode@gmail.com>
2
3*********************************************
4Importing data with :func:`~numpy.genfromtxt`
5*********************************************
6
7NumPy provides several functions to create arrays from tabular data.
8We focus here on the :func:`~numpy.genfromtxt` function.
9
10In a nutshell, :func:`~numpy.genfromtxt` runs two main loops.  The first
11loop converts each line of the file in a sequence of strings.  The second
12loop converts each string to the appropriate data type.  This mechanism is
13slower than a single loop, but gives more flexibility.  In particular,
14:func:`~numpy.genfromtxt` is able to take missing data into account, when
15other faster and simpler functions like :func:`~numpy.loadtxt` cannot.
16
17.. note::
18
19   When giving examples, we will use the following conventions::
20
21       >>> import numpy as np
22       >>> from io import StringIO
23
24
25
26Defining the input
27==================
28
29The only mandatory argument of :func:`~numpy.genfromtxt` is the source of
30the data. It can be a string, a list of strings, a generator or an open
31file-like object with a ``read`` method, for example, a file or
32:class:`io.StringIO` object. If a single string is provided, it is assumed
33to be the name of a local or remote file. If a list of strings or a generator
34returning strings is provided, each string is treated as one line in a file.
35When the URL of a remote file is passed, the file is automatically downloaded
36to the current directory and opened.
37
38Recognized file types are text files and archives.  Currently, the function
39recognizes ``gzip`` and ``bz2`` (``bzip2``) archives.  The type of
40the archive is determined from the extension of the file: if the filename
41ends with ``'.gz'``, a ``gzip`` archive is expected; if it ends with
42``'bz2'``, a ``bzip2`` archive is assumed.
43
44
45
46Splitting the lines into columns
47================================
48
49The ``delimiter`` argument
50--------------------------
51
52Once the file is defined and open for reading, :func:`~numpy.genfromtxt`
53splits each non-empty line into a sequence of strings.  Empty or commented
54lines are just skipped.  The ``delimiter`` keyword is used to define
55how the splitting should take place.
56
57Quite often, a single character marks the separation between columns.  For
58example, comma-separated files (CSV) use a comma (``,``) or a semicolon
59(``;``) as delimiter::
60
61   >>> data = u"1, 2, 3\n4, 5, 6"
62   >>> np.genfromtxt(StringIO(data), delimiter=",")
63   array([[ 1.,  2.,  3.],
64          [ 4.,  5.,  6.]])
65
66Another common separator is ``"\t"``, the tabulation character.  However,
67we are not limited to a single character, any string will do.  By default,
68:func:`~numpy.genfromtxt` assumes ``delimiter=None``, meaning that the line
69is split along white spaces (including tabs) and that consecutive white
70spaces are considered as a single white space.
71
72Alternatively, we may be dealing with a fixed-width file, where columns are
73defined as a given number of characters.  In that case, we need to set
74``delimiter`` to a single integer (if all the columns have the same
75size) or to a sequence of integers (if columns can have different sizes)::
76
77   >>> data = u"  1  2  3\n  4  5 67\n890123  4"
78   >>> np.genfromtxt(StringIO(data), delimiter=3)
79   array([[   1.,    2.,    3.],
80          [   4.,    5.,   67.],
81          [ 890.,  123.,    4.]])
82   >>> data = u"123456789\n   4  7 9\n   4567 9"
83   >>> np.genfromtxt(StringIO(data), delimiter=(4, 3, 2))
84   array([[ 1234.,   567.,    89.],
85          [    4.,     7.,     9.],
86          [    4.,   567.,     9.]])
87
88
89The ``autostrip`` argument
90--------------------------
91
92By default, when a line is decomposed into a series of strings, the
93individual entries are not stripped of leading nor trailing white spaces.
94This behavior can be overwritten by setting the optional argument
95``autostrip`` to a value of ``True``::
96
97   >>> data = u"1, abc , 2\n 3, xxx, 4"
98   >>> # Without autostrip
99   >>> np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5")
100   array([['1', ' abc ', ' 2'],
101          ['3', ' xxx', ' 4']], dtype='<U5')
102   >>> # With autostrip
103   >>> np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5", autostrip=True)
104   array([['1', 'abc', '2'],
105          ['3', 'xxx', '4']], dtype='<U5')
106
107
108The ``comments`` argument
109-------------------------
110
111The optional argument ``comments`` is used to define a character
112string that marks the beginning of a comment.  By default,
113:func:`~numpy.genfromtxt` assumes ``comments='#'``.  The comment marker may
114occur anywhere on the line.  Any character present after the comment
115marker(s) is simply ignored::
116
117   >>> data = u"""#
118   ... # Skip me !
119   ... # Skip me too !
120   ... 1, 2
121   ... 3, 4
122   ... 5, 6 #This is the third line of the data
123   ... 7, 8
124   ... # And here comes the last line
125   ... 9, 0
126   ... """
127   >>> np.genfromtxt(StringIO(data), comments="#", delimiter=",")
128   array([[1., 2.],
129          [3., 4.],
130          [5., 6.],
131          [7., 8.],
132          [9., 0.]])
133
134.. versionadded:: 1.7.0
135
136    When ``comments`` is set to ``None``, no lines are treated as comments.
137
138.. note::
139
140   There is one notable exception to this behavior: if the optional argument
141   ``names=True``, the first commented line will be examined for names.
142
143
144Skipping lines and choosing columns
145===================================
146
147The ``skip_header`` and ``skip_footer`` arguments
148---------------------------------------------------------------
149
150The presence of a header in the file can hinder data processing.  In that
151case, we need to use the ``skip_header`` optional argument.  The
152values of this argument must be an integer which corresponds to the number
153of lines to skip at the beginning of the file, before any other action is
154performed.  Similarly, we can skip the last ``n`` lines of the file by
155using the ``skip_footer`` attribute and giving it a value of ``n``::
156
157   >>> data = u"\n".join(str(i) for i in range(10))
158   >>> np.genfromtxt(StringIO(data),)
159   array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])
160   >>> np.genfromtxt(StringIO(data),
161   ...               skip_header=3, skip_footer=5)
162   array([ 3.,  4.])
163
164By default, ``skip_header=0`` and ``skip_footer=0``, meaning that no lines
165are skipped.
166
167
168The ``usecols`` argument
169------------------------
170
171In some cases, we are not interested in all the columns of the data but
172only a few of them.  We can select which columns to import with the
173``usecols`` argument.  This argument accepts a single integer or a
174sequence of integers corresponding to the indices of the columns to import.
175Remember that by convention, the first column has an index of 0.  Negative
176integers behave the same as regular Python negative indexes.
177
178For example, if we want to import only the first and the last columns, we
179can use ``usecols=(0, -1)``::
180
181   >>> data = u"1 2 3\n4 5 6"
182   >>> np.genfromtxt(StringIO(data), usecols=(0, -1))
183   array([[ 1.,  3.],
184          [ 4.,  6.]])
185
186If the columns have names, we can also select which columns to import by
187giving their name to the ``usecols`` argument, either as a sequence
188of strings or a comma-separated string::
189
190   >>> data = u"1 2 3\n4 5 6"
191   >>> np.genfromtxt(StringIO(data),
192   ...               names="a, b, c", usecols=("a", "c"))
193   array([(1.0, 3.0), (4.0, 6.0)],
194         dtype=[('a', '<f8'), ('c', '<f8')])
195   >>> np.genfromtxt(StringIO(data),
196   ...               names="a, b, c", usecols=("a, c"))
197       array([(1.0, 3.0), (4.0, 6.0)],
198             dtype=[('a', '<f8'), ('c', '<f8')])
199
200
201
202
203Choosing the data type
204======================
205
206The main way to control how the sequences of strings we have read from the
207file are converted to other types is to set the ``dtype`` argument.
208Acceptable values for this argument are:
209
210* a single type, such as ``dtype=float``.
211  The output will be 2D with the given dtype, unless a name has been
212  associated with each column with the use of the ``names`` argument
213  (see below).  Note that ``dtype=float`` is the default for
214  :func:`~numpy.genfromtxt`.
215* a sequence of types, such as ``dtype=(int, float, float)``.
216* a comma-separated string, such as ``dtype="i4,f8,|U3"``.
217* a dictionary with two keys ``'names'`` and ``'formats'``.
218* a sequence of tuples ``(name, type)``, such as
219  ``dtype=[('A', int), ('B', float)]``.
220* an existing :class:`numpy.dtype` object.
221* the special value ``None``.
222  In that case, the type of the columns will be determined from the data
223  itself (see below).
224
225In all the cases but the first one, the output will be a 1D array with a
226structured dtype.  This dtype has as many fields as items in the sequence.
227The field names are defined with the ``names`` keyword.
228
229
230When ``dtype=None``, the type of each column is determined iteratively from
231its data.  We start by checking whether a string can be converted to a
232boolean (that is, if the string matches ``true`` or ``false`` in lower
233cases); then whether it can be converted to an integer, then to a float,
234then to a complex and eventually to a string.  This behavior may be changed
235by modifying the default mapper of the
236:class:`~numpy.lib._iotools.StringConverter` class.
237
238The option ``dtype=None`` is provided for convenience.  However, it is
239significantly slower than setting the dtype explicitly.
240
241
242
243Setting the names
244=================
245
246The ``names`` argument
247----------------------
248
249A natural approach when dealing with tabular data is to allocate a name to
250each column.  A first possibility is to use an explicit structured dtype,
251as mentioned previously::
252
253   >>> data = StringIO("1 2 3\n 4 5 6")
254   >>> np.genfromtxt(data, dtype=[(_, int) for _ in "abc"])
255   array([(1, 2, 3), (4, 5, 6)],
256         dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
257
258Another simpler possibility is to use the ``names`` keyword with a
259sequence of strings or a comma-separated string::
260
261   >>> data = StringIO("1 2 3\n 4 5 6")
262   >>> np.genfromtxt(data, names="A, B, C")
263   array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)],
264         dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
265
266In the example above, we used the fact that by default, ``dtype=float``.
267By giving a sequence of names, we are forcing the output to a structured
268dtype.
269
270We may sometimes need to define the column names from the data itself.  In
271that case, we must use the ``names`` keyword with a value of
272``True``.  The names will then be read from the first line (after the
273``skip_header`` ones), even if the line is commented out::
274
275   >>> data = StringIO("So it goes\n#a b c\n1 2 3\n 4 5 6")
276   >>> np.genfromtxt(data, skip_header=1, names=True)
277   array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)],
278         dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])
279
280The default value of ``names`` is ``None``.  If we give any other
281value to the keyword, the new names will overwrite the field names we may
282have defined with the dtype::
283
284   >>> data = StringIO("1 2 3\n 4 5 6")
285   >>> ndtype=[('a',int), ('b', float), ('c', int)]
286   >>> names = ["A", "B", "C"]
287   >>> np.genfromtxt(data, names=names, dtype=ndtype)
288   array([(1, 2.0, 3), (4, 5.0, 6)],
289         dtype=[('A', '<i8'), ('B', '<f8'), ('C', '<i8')])
290
291
292The ``defaultfmt`` argument
293---------------------------
294
295If ``names=None`` but a structured dtype is expected, names are defined
296with the standard NumPy default of ``"f%i"``, yielding names like ``f0``,
297``f1`` and so forth::
298
299   >>> data = StringIO("1 2 3\n 4 5 6")
300   >>> np.genfromtxt(data, dtype=(int, float, int))
301   array([(1, 2.0, 3), (4, 5.0, 6)],
302         dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<i8')])
303
304In the same way, if we don't give enough names to match the length of the
305dtype, the missing names will be defined with this default template::
306
307   >>> data = StringIO("1 2 3\n 4 5 6")
308   >>> np.genfromtxt(data, dtype=(int, float, int), names="a")
309   array([(1, 2.0, 3), (4, 5.0, 6)],
310         dtype=[('a', '<i8'), ('f0', '<f8'), ('f1', '<i8')])
311
312We can overwrite this default with the ``defaultfmt`` argument, that
313takes any format string::
314
315   >>> data = StringIO("1 2 3\n 4 5 6")
316   >>> np.genfromtxt(data, dtype=(int, float, int), defaultfmt="var_%02i")
317   array([(1, 2.0, 3), (4, 5.0, 6)],
318         dtype=[('var_00', '<i8'), ('var_01', '<f8'), ('var_02', '<i8')])
319
320.. note::
321
322   We need to keep in mind that ``defaultfmt`` is used only if some names
323   are expected but not defined.
324
325
326Validating names
327----------------
328
329NumPy arrays with a structured dtype can also be viewed as
330:class:`~numpy.recarray`, where a field can be accessed as if it were an
331attribute.  For that reason, we may need to make sure that the field name
332doesn't contain any space or invalid character, or that it does not
333correspond to the name of a standard attribute (like ``size`` or
334``shape``), which would confuse the interpreter.  :func:`~numpy.genfromtxt`
335accepts three optional arguments that provide a finer control on the names:
336
337   ``deletechars``
338      Gives a string combining all the characters that must be deleted from
339      the name. By default, invalid characters are
340      ``~!@#$%^&*()-=+~\|]}[{';:
341      /?.>,<``.
342   ``excludelist``
343      Gives a list of the names to exclude, such as ``return``, ``file``,
344      ``print``...  If one of the input name is part of this list, an
345      underscore character (``'_'``) will be appended to it.
346   ``case_sensitive``
347      Whether the names should be case-sensitive (``case_sensitive=True``),
348      converted to upper case (``case_sensitive=False`` or
349      ``case_sensitive='upper'``) or to lower case
350      (``case_sensitive='lower'``).
351
352
353
354Tweaking the conversion
355=======================
356
357The ``converters`` argument
358---------------------------
359
360Usually, defining a dtype is sufficient to define how the sequence of
361strings must be converted.  However, some additional control may sometimes
362be required.  For example, we may want to make sure that a date in a format
363``YYYY/MM/DD`` is converted to a :class:`~datetime.datetime` object, or that
364a string like ``xx%`` is properly converted to a float between 0 and 1.  In
365such cases, we should define conversion functions with the ``converters``
366arguments.
367
368The value of this argument is typically a dictionary with column indices or
369column names as keys and a conversion functions as values.  These
370conversion functions can either be actual functions or lambda functions. In
371any case, they should accept only a string as input and output only a
372single element of the wanted type.
373
374In the following example, the second column is converted from as string
375representing a percentage to a float between 0 and 1::
376
377   >>> convertfunc = lambda x: float(x.strip(b"%"))/100.
378   >>> data = u"1, 2.3%, 45.\n6, 78.9%, 0"
379   >>> names = ("i", "p", "n")
380   >>> # General case .....
381   >>> np.genfromtxt(StringIO(data), delimiter=",", names=names)
382   array([(1., nan, 45.), (6., nan, 0.)],
383         dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])
384
385We need to keep in mind that by default, ``dtype=float``.  A float is
386therefore expected for the second column.  However, the strings ``' 2.3%'``
387and ``' 78.9%'`` cannot be converted to float and we end up having
388``np.nan`` instead.  Let's now use a converter::
389
390   >>> # Converted case ...
391   >>> np.genfromtxt(StringIO(data), delimiter=",", names=names,
392   ...               converters={1: convertfunc})
393   array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)],
394         dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])
395
396The same results can be obtained by using the name of the second column
397(``"p"``) as key instead of its index (1)::
398
399   >>> # Using a name for the converter ...
400   >>> np.genfromtxt(StringIO(data), delimiter=",", names=names,
401   ...               converters={"p": convertfunc})
402   array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)],
403         dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])
404
405
406Converters can also be used to provide a default for missing entries.  In
407the following example, the converter ``convert`` transforms a stripped
408string into the corresponding float or into -999 if the string is empty.
409We need to explicitly strip the string from white spaces as it is not done
410by default::
411
412   >>> data = u"1, , 3\n 4, 5, 6"
413   >>> convert = lambda x: float(x.strip() or -999)
414   >>> np.genfromtxt(StringIO(data), delimiter=",",
415   ...               converters={1: convert})
416   array([[   1., -999.,    3.],
417          [   4.,    5.,    6.]])
418
419
420
421
422Using missing and filling values
423--------------------------------
424
425Some entries may be missing in the dataset we are trying to import.  In a
426previous example, we used a converter to transform an empty string into a
427float.  However, user-defined converters may rapidly become cumbersome to
428manage.
429
430The :func:`~numpy.genfromtxt` function provides two other complementary
431mechanisms: the ``missing_values`` argument is used to recognize
432missing data and a second argument, ``filling_values``, is used to
433process these missing data.
434
435``missing_values``
436------------------
437
438By default, any empty string is marked as missing.  We can also consider
439more complex strings, such as ``"N/A"`` or ``"???"`` to represent missing
440or invalid data.  The ``missing_values`` argument accepts three kind
441of values:
442
443   a string or a comma-separated string
444      This string will be used as the marker for missing data for all the
445      columns
446   a sequence of strings
447      In that case, each item is associated to a column, in order.
448   a dictionary
449      Values of the dictionary are strings or sequence of strings.  The
450      corresponding keys can be column indices (integers) or column names
451      (strings). In addition, the special key ``None`` can be used to
452      define a default applicable to all columns.
453
454
455``filling_values``
456------------------
457
458We know how to recognize missing data, but we still need to provide a value
459for these missing entries.  By default, this value is determined from the
460expected dtype according to this table:
461
462=============  ==============
463Expected type  Default
464=============  ==============
465``bool``       ``False``
466``int``        ``-1``
467``float``      ``np.nan``
468``complex``    ``np.nan+0j``
469``string``     ``'???'``
470=============  ==============
471
472We can get a finer control on the conversion of missing values with the
473``filling_values`` optional argument.  Like
474``missing_values``, this argument accepts different kind of values:
475
476   a single value
477      This will be the default for all columns
478   a sequence of values
479      Each entry will be the default for the corresponding column
480   a dictionary
481      Each key can be a column index or a column name, and the
482      corresponding value should be a single object.  We can use the
483      special key ``None`` to define a default for all columns.
484
485In the following example, we suppose that the missing values are flagged
486with ``"N/A"`` in the first column and by ``"???"`` in the third column.
487We wish to transform these missing values to 0 if they occur in the first
488and second column, and to -999 if they occur in the last column::
489
490    >>> data = u"N/A, 2, 3\n4, ,???"
491    >>> kwargs = dict(delimiter=",",
492    ...               dtype=int,
493    ...               names="a,b,c",
494    ...               missing_values={0:"N/A", 'b':" ", 2:"???"},
495    ...               filling_values={0:0, 'b':0, 2:-999})
496    >>> np.genfromtxt(StringIO(data), **kwargs)
497    array([(0, 2, 3), (4, 0, -999)],
498          dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])
499
500
501``usemask``
502-----------
503
504We may also want to keep track of the occurrence of missing data by
505constructing a boolean mask, with ``True`` entries where data was missing
506and ``False`` otherwise.  To do that, we just have to set the optional
507argument ``usemask`` to ``True`` (the default is ``False``).  The
508output array will then be a :class:`~numpy.ma.MaskedArray`.
509
510
511.. unpack=None, loose=True, invalid_raise=True)
512
513
514Shortcut functions
515==================
516
517In addition to :func:`~numpy.genfromtxt`, the :mod:`numpy.lib.npyio` module
518provides several convenience functions derived from
519:func:`~numpy.genfromtxt`.  These functions work the same way as the
520original, but they have different default values.
521
522:func:`~numpy.npyio.recfromtxt`
523   Returns a standard :class:`numpy.recarray` (if ``usemask=False``) or a
524   :class:`~numpy.ma.mrecords.MaskedRecords` array (if ``usemaske=True``).  The
525   default dtype is ``dtype=None``, meaning that the types of each column
526   will be automatically determined.
527:func:`~numpy.npyio.recfromcsv`
528   Like :func:`~numpy.npyio.recfromtxt`, but with a default ``delimiter=","``.
529