1.. sectionauthor:: Pierre Gerard-Marchant <pierregmcode@gmail.com> 2 3********************************************* 4Importing data with :func:`~numpy.genfromtxt` 5********************************************* 6 7NumPy provides several functions to create arrays from tabular data. 8We focus here on the :func:`~numpy.genfromtxt` function. 9 10In a nutshell, :func:`~numpy.genfromtxt` runs two main loops. The first 11loop converts each line of the file in a sequence of strings. The second 12loop converts each string to the appropriate data type. This mechanism is 13slower than a single loop, but gives more flexibility. In particular, 14:func:`~numpy.genfromtxt` is able to take missing data into account, when 15other faster and simpler functions like :func:`~numpy.loadtxt` cannot. 16 17.. note:: 18 19 When giving examples, we will use the following conventions:: 20 21 >>> import numpy as np 22 >>> from io import StringIO 23 24 25 26Defining the input 27================== 28 29The only mandatory argument of :func:`~numpy.genfromtxt` is the source of 30the data. It can be a string, a list of strings, a generator or an open 31file-like object with a ``read`` method, for example, a file or 32:class:`io.StringIO` object. If a single string is provided, it is assumed 33to be the name of a local or remote file. If a list of strings or a generator 34returning strings is provided, each string is treated as one line in a file. 35When the URL of a remote file is passed, the file is automatically downloaded 36to the current directory and opened. 37 38Recognized file types are text files and archives. Currently, the function 39recognizes ``gzip`` and ``bz2`` (``bzip2``) archives. The type of 40the archive is determined from the extension of the file: if the filename 41ends with ``'.gz'``, a ``gzip`` archive is expected; if it ends with 42``'bz2'``, a ``bzip2`` archive is assumed. 43 44 45 46Splitting the lines into columns 47================================ 48 49The ``delimiter`` argument 50-------------------------- 51 52Once the file is defined and open for reading, :func:`~numpy.genfromtxt` 53splits each non-empty line into a sequence of strings. Empty or commented 54lines are just skipped. The ``delimiter`` keyword is used to define 55how the splitting should take place. 56 57Quite often, a single character marks the separation between columns. For 58example, comma-separated files (CSV) use a comma (``,``) or a semicolon 59(``;``) as delimiter:: 60 61 >>> data = u"1, 2, 3\n4, 5, 6" 62 >>> np.genfromtxt(StringIO(data), delimiter=",") 63 array([[ 1., 2., 3.], 64 [ 4., 5., 6.]]) 65 66Another common separator is ``"\t"``, the tabulation character. However, 67we are not limited to a single character, any string will do. By default, 68:func:`~numpy.genfromtxt` assumes ``delimiter=None``, meaning that the line 69is split along white spaces (including tabs) and that consecutive white 70spaces are considered as a single white space. 71 72Alternatively, we may be dealing with a fixed-width file, where columns are 73defined as a given number of characters. In that case, we need to set 74``delimiter`` to a single integer (if all the columns have the same 75size) or to a sequence of integers (if columns can have different sizes):: 76 77 >>> data = u" 1 2 3\n 4 5 67\n890123 4" 78 >>> np.genfromtxt(StringIO(data), delimiter=3) 79 array([[ 1., 2., 3.], 80 [ 4., 5., 67.], 81 [ 890., 123., 4.]]) 82 >>> data = u"123456789\n 4 7 9\n 4567 9" 83 >>> np.genfromtxt(StringIO(data), delimiter=(4, 3, 2)) 84 array([[ 1234., 567., 89.], 85 [ 4., 7., 9.], 86 [ 4., 567., 9.]]) 87 88 89The ``autostrip`` argument 90-------------------------- 91 92By default, when a line is decomposed into a series of strings, the 93individual entries are not stripped of leading nor trailing white spaces. 94This behavior can be overwritten by setting the optional argument 95``autostrip`` to a value of ``True``:: 96 97 >>> data = u"1, abc , 2\n 3, xxx, 4" 98 >>> # Without autostrip 99 >>> np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5") 100 array([['1', ' abc ', ' 2'], 101 ['3', ' xxx', ' 4']], dtype='<U5') 102 >>> # With autostrip 103 >>> np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5", autostrip=True) 104 array([['1', 'abc', '2'], 105 ['3', 'xxx', '4']], dtype='<U5') 106 107 108The ``comments`` argument 109------------------------- 110 111The optional argument ``comments`` is used to define a character 112string that marks the beginning of a comment. By default, 113:func:`~numpy.genfromtxt` assumes ``comments='#'``. The comment marker may 114occur anywhere on the line. Any character present after the comment 115marker(s) is simply ignored:: 116 117 >>> data = u"""# 118 ... # Skip me ! 119 ... # Skip me too ! 120 ... 1, 2 121 ... 3, 4 122 ... 5, 6 #This is the third line of the data 123 ... 7, 8 124 ... # And here comes the last line 125 ... 9, 0 126 ... """ 127 >>> np.genfromtxt(StringIO(data), comments="#", delimiter=",") 128 array([[1., 2.], 129 [3., 4.], 130 [5., 6.], 131 [7., 8.], 132 [9., 0.]]) 133 134.. versionadded:: 1.7.0 135 136 When ``comments`` is set to ``None``, no lines are treated as comments. 137 138.. note:: 139 140 There is one notable exception to this behavior: if the optional argument 141 ``names=True``, the first commented line will be examined for names. 142 143 144Skipping lines and choosing columns 145=================================== 146 147The ``skip_header`` and ``skip_footer`` arguments 148--------------------------------------------------------------- 149 150The presence of a header in the file can hinder data processing. In that 151case, we need to use the ``skip_header`` optional argument. The 152values of this argument must be an integer which corresponds to the number 153of lines to skip at the beginning of the file, before any other action is 154performed. Similarly, we can skip the last ``n`` lines of the file by 155using the ``skip_footer`` attribute and giving it a value of ``n``:: 156 157 >>> data = u"\n".join(str(i) for i in range(10)) 158 >>> np.genfromtxt(StringIO(data),) 159 array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]) 160 >>> np.genfromtxt(StringIO(data), 161 ... skip_header=3, skip_footer=5) 162 array([ 3., 4.]) 163 164By default, ``skip_header=0`` and ``skip_footer=0``, meaning that no lines 165are skipped. 166 167 168The ``usecols`` argument 169------------------------ 170 171In some cases, we are not interested in all the columns of the data but 172only a few of them. We can select which columns to import with the 173``usecols`` argument. This argument accepts a single integer or a 174sequence of integers corresponding to the indices of the columns to import. 175Remember that by convention, the first column has an index of 0. Negative 176integers behave the same as regular Python negative indexes. 177 178For example, if we want to import only the first and the last columns, we 179can use ``usecols=(0, -1)``:: 180 181 >>> data = u"1 2 3\n4 5 6" 182 >>> np.genfromtxt(StringIO(data), usecols=(0, -1)) 183 array([[ 1., 3.], 184 [ 4., 6.]]) 185 186If the columns have names, we can also select which columns to import by 187giving their name to the ``usecols`` argument, either as a sequence 188of strings or a comma-separated string:: 189 190 >>> data = u"1 2 3\n4 5 6" 191 >>> np.genfromtxt(StringIO(data), 192 ... names="a, b, c", usecols=("a", "c")) 193 array([(1.0, 3.0), (4.0, 6.0)], 194 dtype=[('a', '<f8'), ('c', '<f8')]) 195 >>> np.genfromtxt(StringIO(data), 196 ... names="a, b, c", usecols=("a, c")) 197 array([(1.0, 3.0), (4.0, 6.0)], 198 dtype=[('a', '<f8'), ('c', '<f8')]) 199 200 201 202 203Choosing the data type 204====================== 205 206The main way to control how the sequences of strings we have read from the 207file are converted to other types is to set the ``dtype`` argument. 208Acceptable values for this argument are: 209 210* a single type, such as ``dtype=float``. 211 The output will be 2D with the given dtype, unless a name has been 212 associated with each column with the use of the ``names`` argument 213 (see below). Note that ``dtype=float`` is the default for 214 :func:`~numpy.genfromtxt`. 215* a sequence of types, such as ``dtype=(int, float, float)``. 216* a comma-separated string, such as ``dtype="i4,f8,|U3"``. 217* a dictionary with two keys ``'names'`` and ``'formats'``. 218* a sequence of tuples ``(name, type)``, such as 219 ``dtype=[('A', int), ('B', float)]``. 220* an existing :class:`numpy.dtype` object. 221* the special value ``None``. 222 In that case, the type of the columns will be determined from the data 223 itself (see below). 224 225In all the cases but the first one, the output will be a 1D array with a 226structured dtype. This dtype has as many fields as items in the sequence. 227The field names are defined with the ``names`` keyword. 228 229 230When ``dtype=None``, the type of each column is determined iteratively from 231its data. We start by checking whether a string can be converted to a 232boolean (that is, if the string matches ``true`` or ``false`` in lower 233cases); then whether it can be converted to an integer, then to a float, 234then to a complex and eventually to a string. This behavior may be changed 235by modifying the default mapper of the 236:class:`~numpy.lib._iotools.StringConverter` class. 237 238The option ``dtype=None`` is provided for convenience. However, it is 239significantly slower than setting the dtype explicitly. 240 241 242 243Setting the names 244================= 245 246The ``names`` argument 247---------------------- 248 249A natural approach when dealing with tabular data is to allocate a name to 250each column. A first possibility is to use an explicit structured dtype, 251as mentioned previously:: 252 253 >>> data = StringIO("1 2 3\n 4 5 6") 254 >>> np.genfromtxt(data, dtype=[(_, int) for _ in "abc"]) 255 array([(1, 2, 3), (4, 5, 6)], 256 dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')]) 257 258Another simpler possibility is to use the ``names`` keyword with a 259sequence of strings or a comma-separated string:: 260 261 >>> data = StringIO("1 2 3\n 4 5 6") 262 >>> np.genfromtxt(data, names="A, B, C") 263 array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)], 264 dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8')]) 265 266In the example above, we used the fact that by default, ``dtype=float``. 267By giving a sequence of names, we are forcing the output to a structured 268dtype. 269 270We may sometimes need to define the column names from the data itself. In 271that case, we must use the ``names`` keyword with a value of 272``True``. The names will then be read from the first line (after the 273``skip_header`` ones), even if the line is commented out:: 274 275 >>> data = StringIO("So it goes\n#a b c\n1 2 3\n 4 5 6") 276 >>> np.genfromtxt(data, skip_header=1, names=True) 277 array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)], 278 dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')]) 279 280The default value of ``names`` is ``None``. If we give any other 281value to the keyword, the new names will overwrite the field names we may 282have defined with the dtype:: 283 284 >>> data = StringIO("1 2 3\n 4 5 6") 285 >>> ndtype=[('a',int), ('b', float), ('c', int)] 286 >>> names = ["A", "B", "C"] 287 >>> np.genfromtxt(data, names=names, dtype=ndtype) 288 array([(1, 2.0, 3), (4, 5.0, 6)], 289 dtype=[('A', '<i8'), ('B', '<f8'), ('C', '<i8')]) 290 291 292The ``defaultfmt`` argument 293--------------------------- 294 295If ``names=None`` but a structured dtype is expected, names are defined 296with the standard NumPy default of ``"f%i"``, yielding names like ``f0``, 297``f1`` and so forth:: 298 299 >>> data = StringIO("1 2 3\n 4 5 6") 300 >>> np.genfromtxt(data, dtype=(int, float, int)) 301 array([(1, 2.0, 3), (4, 5.0, 6)], 302 dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<i8')]) 303 304In the same way, if we don't give enough names to match the length of the 305dtype, the missing names will be defined with this default template:: 306 307 >>> data = StringIO("1 2 3\n 4 5 6") 308 >>> np.genfromtxt(data, dtype=(int, float, int), names="a") 309 array([(1, 2.0, 3), (4, 5.0, 6)], 310 dtype=[('a', '<i8'), ('f0', '<f8'), ('f1', '<i8')]) 311 312We can overwrite this default with the ``defaultfmt`` argument, that 313takes any format string:: 314 315 >>> data = StringIO("1 2 3\n 4 5 6") 316 >>> np.genfromtxt(data, dtype=(int, float, int), defaultfmt="var_%02i") 317 array([(1, 2.0, 3), (4, 5.0, 6)], 318 dtype=[('var_00', '<i8'), ('var_01', '<f8'), ('var_02', '<i8')]) 319 320.. note:: 321 322 We need to keep in mind that ``defaultfmt`` is used only if some names 323 are expected but not defined. 324 325 326Validating names 327---------------- 328 329NumPy arrays with a structured dtype can also be viewed as 330:class:`~numpy.recarray`, where a field can be accessed as if it were an 331attribute. For that reason, we may need to make sure that the field name 332doesn't contain any space or invalid character, or that it does not 333correspond to the name of a standard attribute (like ``size`` or 334``shape``), which would confuse the interpreter. :func:`~numpy.genfromtxt` 335accepts three optional arguments that provide a finer control on the names: 336 337 ``deletechars`` 338 Gives a string combining all the characters that must be deleted from 339 the name. By default, invalid characters are 340 ``~!@#$%^&*()-=+~\|]}[{';: 341 /?.>,<``. 342 ``excludelist`` 343 Gives a list of the names to exclude, such as ``return``, ``file``, 344 ``print``... If one of the input name is part of this list, an 345 underscore character (``'_'``) will be appended to it. 346 ``case_sensitive`` 347 Whether the names should be case-sensitive (``case_sensitive=True``), 348 converted to upper case (``case_sensitive=False`` or 349 ``case_sensitive='upper'``) or to lower case 350 (``case_sensitive='lower'``). 351 352 353 354Tweaking the conversion 355======================= 356 357The ``converters`` argument 358--------------------------- 359 360Usually, defining a dtype is sufficient to define how the sequence of 361strings must be converted. However, some additional control may sometimes 362be required. For example, we may want to make sure that a date in a format 363``YYYY/MM/DD`` is converted to a :class:`~datetime.datetime` object, or that 364a string like ``xx%`` is properly converted to a float between 0 and 1. In 365such cases, we should define conversion functions with the ``converters`` 366arguments. 367 368The value of this argument is typically a dictionary with column indices or 369column names as keys and a conversion functions as values. These 370conversion functions can either be actual functions or lambda functions. In 371any case, they should accept only a string as input and output only a 372single element of the wanted type. 373 374In the following example, the second column is converted from as string 375representing a percentage to a float between 0 and 1:: 376 377 >>> convertfunc = lambda x: float(x.strip(b"%"))/100. 378 >>> data = u"1, 2.3%, 45.\n6, 78.9%, 0" 379 >>> names = ("i", "p", "n") 380 >>> # General case ..... 381 >>> np.genfromtxt(StringIO(data), delimiter=",", names=names) 382 array([(1., nan, 45.), (6., nan, 0.)], 383 dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')]) 384 385We need to keep in mind that by default, ``dtype=float``. A float is 386therefore expected for the second column. However, the strings ``' 2.3%'`` 387and ``' 78.9%'`` cannot be converted to float and we end up having 388``np.nan`` instead. Let's now use a converter:: 389 390 >>> # Converted case ... 391 >>> np.genfromtxt(StringIO(data), delimiter=",", names=names, 392 ... converters={1: convertfunc}) 393 array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)], 394 dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')]) 395 396The same results can be obtained by using the name of the second column 397(``"p"``) as key instead of its index (1):: 398 399 >>> # Using a name for the converter ... 400 >>> np.genfromtxt(StringIO(data), delimiter=",", names=names, 401 ... converters={"p": convertfunc}) 402 array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)], 403 dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')]) 404 405 406Converters can also be used to provide a default for missing entries. In 407the following example, the converter ``convert`` transforms a stripped 408string into the corresponding float or into -999 if the string is empty. 409We need to explicitly strip the string from white spaces as it is not done 410by default:: 411 412 >>> data = u"1, , 3\n 4, 5, 6" 413 >>> convert = lambda x: float(x.strip() or -999) 414 >>> np.genfromtxt(StringIO(data), delimiter=",", 415 ... converters={1: convert}) 416 array([[ 1., -999., 3.], 417 [ 4., 5., 6.]]) 418 419 420 421 422Using missing and filling values 423-------------------------------- 424 425Some entries may be missing in the dataset we are trying to import. In a 426previous example, we used a converter to transform an empty string into a 427float. However, user-defined converters may rapidly become cumbersome to 428manage. 429 430The :func:`~numpy.genfromtxt` function provides two other complementary 431mechanisms: the ``missing_values`` argument is used to recognize 432missing data and a second argument, ``filling_values``, is used to 433process these missing data. 434 435``missing_values`` 436------------------ 437 438By default, any empty string is marked as missing. We can also consider 439more complex strings, such as ``"N/A"`` or ``"???"`` to represent missing 440or invalid data. The ``missing_values`` argument accepts three kind 441of values: 442 443 a string or a comma-separated string 444 This string will be used as the marker for missing data for all the 445 columns 446 a sequence of strings 447 In that case, each item is associated to a column, in order. 448 a dictionary 449 Values of the dictionary are strings or sequence of strings. The 450 corresponding keys can be column indices (integers) or column names 451 (strings). In addition, the special key ``None`` can be used to 452 define a default applicable to all columns. 453 454 455``filling_values`` 456------------------ 457 458We know how to recognize missing data, but we still need to provide a value 459for these missing entries. By default, this value is determined from the 460expected dtype according to this table: 461 462============= ============== 463Expected type Default 464============= ============== 465``bool`` ``False`` 466``int`` ``-1`` 467``float`` ``np.nan`` 468``complex`` ``np.nan+0j`` 469``string`` ``'???'`` 470============= ============== 471 472We can get a finer control on the conversion of missing values with the 473``filling_values`` optional argument. Like 474``missing_values``, this argument accepts different kind of values: 475 476 a single value 477 This will be the default for all columns 478 a sequence of values 479 Each entry will be the default for the corresponding column 480 a dictionary 481 Each key can be a column index or a column name, and the 482 corresponding value should be a single object. We can use the 483 special key ``None`` to define a default for all columns. 484 485In the following example, we suppose that the missing values are flagged 486with ``"N/A"`` in the first column and by ``"???"`` in the third column. 487We wish to transform these missing values to 0 if they occur in the first 488and second column, and to -999 if they occur in the last column:: 489 490 >>> data = u"N/A, 2, 3\n4, ,???" 491 >>> kwargs = dict(delimiter=",", 492 ... dtype=int, 493 ... names="a,b,c", 494 ... missing_values={0:"N/A", 'b':" ", 2:"???"}, 495 ... filling_values={0:0, 'b':0, 2:-999}) 496 >>> np.genfromtxt(StringIO(data), **kwargs) 497 array([(0, 2, 3), (4, 0, -999)], 498 dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')]) 499 500 501``usemask`` 502----------- 503 504We may also want to keep track of the occurrence of missing data by 505constructing a boolean mask, with ``True`` entries where data was missing 506and ``False`` otherwise. To do that, we just have to set the optional 507argument ``usemask`` to ``True`` (the default is ``False``). The 508output array will then be a :class:`~numpy.ma.MaskedArray`. 509 510 511.. unpack=None, loose=True, invalid_raise=True) 512 513 514Shortcut functions 515================== 516 517In addition to :func:`~numpy.genfromtxt`, the :mod:`numpy.lib.npyio` module 518provides several convenience functions derived from 519:func:`~numpy.genfromtxt`. These functions work the same way as the 520original, but they have different default values. 521 522:func:`~numpy.npyio.recfromtxt` 523 Returns a standard :class:`numpy.recarray` (if ``usemask=False``) or a 524 :class:`~numpy.ma.mrecords.MaskedRecords` array (if ``usemaske=True``). The 525 default dtype is ``dtype=None``, meaning that the types of each column 526 will be automatically determined. 527:func:`~numpy.npyio.recfromcsv` 528 Like :func:`~numpy.npyio.recfromtxt`, but with a default ``delimiter=","``. 529