1.. py:currentmodule:: Orange.data.io 2 3################################ 4Loading and saving data (``io``) 5################################ 6 7:obj:`Orange.data.Table` supports loading from several file formats: 8 9* Comma-separated values (\*.csv) file, 10* Tab-separated values (\*.tab, \*.tsv) file, 11* Excel spreadsheet (\*.xls, \*.xlsx), 12* Basket file, 13* Python pickle. 14 15In addition, the text-based files (CSV, TSV) can be compressed with gzip, 16bzip2 or xz (e.g. \*.csv.gz). 17 18 19Header Format 20============= 21 22The data in CSV, TSV, and Excel files can be described in an extended 23three-line header format, or a condensed single-line header format. 24 25 26Three-line header format 27------------------------ 28 29A three-line header consists of: 30 311. **Feature names** on the first line. Feature names can include any combination 32 of characters. 33 342. **Feature types** on the second line. The type is determined automatically, 35 or, if set, can be any of the following: 36 37 * ``discrete`` (or ``d``) — imported as :obj:`Orange.data.DiscreteVariable`, 38 * a space-separated **list of discrete values**, like "``male female``", 39 which will result in :obj:`Orange.data.DiscreteVariable` with those values 40 and in that order. If the individual values contain a space character, it 41 needs to be escaped (prefixed) with, as common, a backslash ('\\') character. 42 * ``continuous`` (or ``c``) — imported as :obj:`Orange.data.ContinuousVariable`, 43 * ``string`` (or ``s``, or ``text``) — imported as :obj:`Orange.data.StringVariable`, 44 * ``time`` (or ``t``) — imported as :obj:`Orange.data.TimeVariable`, if the 45 values parse as `ISO 8601 <https://en.wikipedia.org/wiki/ISO_8601>`_ date/time formats, 46 * ``basket`` — used for storing sparse data. More on basket formats in a 47 dedicated section. 48 493. **Flags** (optional) on the third header line. Feature's flag can be empty, 50 or it can contain, space-separated, a consistent combination of: 51 52 * ``class`` (or ``c``) — feature will be imported as a class variable. 53 Most algorithms expect a single class variable. 54 * ``meta`` (or ``m``) — feature will be imported as a meta-attribute, just 55 describing the data instance but not actually used for learning, 56 * ``weight`` (or ``w``) — the feature marks the weight of examples (in 57 algorithms that support weighted examples), 58 * ``ignore`` (or ``i``) — feature will not be imported, 59 * ``<key>=<value>`` custom attributes. 60 61Example of iris dataset in Orange's three-line format 62(:download:`iris.tab <../../../../Orange/datasets/iris.tab>`). 63 64.. literalinclude:: ../../../../Orange/datasets/iris.tab 65 :lines: 1-7 66 67 68Single-line header format 69------------------------- 70 71Single-line header consists of feature names prefixed by an optional "``<flags>#``" 72string, i.e. flags followed by a hash ('#') sign. The flags can be a consistent 73combination of: 74 75* ``c`` for class feature, 76* ``i`` for feature to be ignored, 77* ``m`` for meta attributes (not used in learning), 78* ``C`` for features that are continuous, 79* ``D`` for features that are discrete, 80* ``T`` for features that represent date and/or time in one of the ISO 8601 81 formats, 82* ``S`` for string features. 83 84If some (all) names or flags are omitted, the names, types, and flags are 85discerned automatically, and correctly (most of the time). 86 87 88Baskets 89======= 90 91Baskets can be used for storing sparse data in tab delimited files. They were 92specifically designed for text mining needs. If text mining and sparse data is 93not your business, you can skip this section. 94 95Baskets are given as a list of space-separated ``<name>=<value>`` atoms. A 96continuous meta attribute named ``<name>`` will be created and added to the domain 97as optional if it is not already there. A meta value for that variable will be 98added to the example. If the value is 1, you can omit the ``=<value>`` part. 99 100It is not possible to put meta attributes of other types than continuous in the 101basket. 102 103A tab delimited file with a basket can look like this:: 104 105 K Ca b_foo Ba y 106 c c basket c c 107 meta i class 108 0.06 8.75 a b a c 0 1 109 0.48 b=2 d 0 1 110 0.39 7.78 0 1 111 0.57 8.22 c=13 0 1 112 113These are the examples read from such a file:: 114 115 [0.06, 1], {"Ca":8.75, "a":2.000, "b":1.000, "c":1.000} 116 [0.48, 1], {"Ca":?, "b":2.000, "d":1.000} 117 [0.39, 1], {"Ca":7.78} 118 [0.57, 1], {"Ca":8.22, "c":13.000} 119 120It is recommended to have the basket as the last column, especially if it 121contains a lot of data. 122 123Note a few things. The basket column's name, ``b_foo``, is not used. In the first 124example, the value of ``a`` is 2 since it appears twice. The ordinary meta 125attribute, ``Ca``, appears in all examples, even in those where its value is 126undefined. Meta attributes from the basket appear only where they are defined. 127This is due to the different nature of these meta attributes: ``Ca`` is required 128while the others are optional. :: 129 130 >>> d.domain.metas() 131 {-6: FloatVariable 'd', -22: FloatVariable 'Ca', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'} 132 133To fully understand all this, you should read the documentation on :ref:`meta 134attributes <meta-attributes>` in Domain and on the :ref:`basket file format 135<basket-format>` (a simple format that is limited to baskets only). 136 137.. _basket-format: 138 139Basket Format 140------------- 141 142Basket files (.basket) are suitable for representing sparse data. Each example 143is represented by a line in the file. The line is written as a comma-separated 144list of name-value pairs. Here's an example of such file. :: 145 146 nobody, expects, the, Spanish, Inquisition=5 147 our, chief, weapon, is, surprise=3, surprise=2, and, fear,fear, and, surprise 148 our, two, weapons, are, fear, and, surprise, and, ruthless, efficiency 149 to, the, Pope, and, nice, red, uniforms, oh damn 150 151The file contains four examples. The first examples has five attributes 152defined, "nobody", "expects", "the", "Spanish" and "Inquisition"; the first 153four have (the default) value of 1.0 and the last has a value of 5.0. 154 155The attributes that appear in the domain aren't defined in any headers or even 156separate files, as with other formats supported by Orange. 157 158If attribute appears more than once, its values are added. For instance, the 159value of attribute "surprise" in the second examples is 6.0 and the value of 160"fear" is 2.0; the former appears three times with values of 3.0, 2.0 and 1.0, 161and the latter appears twice with value of 1.0. 162 163All attributes are loaded as optional meta-attributes, so zero values don't 164take any memory (unless they are given, but initialized to zero). See also 165section on :ref:`meta attributes <meta-attributes>` in the reference for domain 166descriptors. 167 168Notice that at the time of writing this reference only association rules can 169directly use examples presented in the basket format. 170