1.. py:currentmodule:: Orange.data.io
2
3################################
4Loading and saving data (``io``)
5################################
6
7:obj:`Orange.data.Table` supports loading from several file formats:
8
9* Comma-separated values (\*.csv) file,
10* Tab-separated values (\*.tab, \*.tsv) file,
11* Excel spreadsheet (\*.xls, \*.xlsx),
12* Basket file,
13* Python pickle.
14
15In addition, the text-based files (CSV, TSV) can be compressed with gzip,
16bzip2 or xz (e.g. \*.csv.gz).
17
18
19Header Format
20=============
21
22The data in CSV, TSV, and Excel files can be described in an extended
23three-line header format, or a condensed single-line header format.
24
25
26Three-line header format
27------------------------
28
29A three-line header consists of:
30
311. **Feature names** on the first line. Feature names can include any combination
32   of characters.
33
342. **Feature types** on the second line. The type is determined automatically,
35   or, if set, can be any of the following:
36
37   * ``discrete`` (or ``d``) — imported as :obj:`Orange.data.DiscreteVariable`,
38   * a space-separated **list of discrete values**, like "``male female``",
39     which will result in :obj:`Orange.data.DiscreteVariable` with those values
40     and in that order. If the individual values contain a space character, it
41     needs to be escaped (prefixed) with, as common, a backslash ('\\') character.
42   * ``continuous`` (or ``c``) — imported as :obj:`Orange.data.ContinuousVariable`,
43   * ``string`` (or ``s``, or ``text``) — imported as :obj:`Orange.data.StringVariable`,
44   * ``time`` (or ``t``) — imported as :obj:`Orange.data.TimeVariable`, if the
45     values parse as `ISO 8601 <https://en.wikipedia.org/wiki/ISO_8601>`_ date/time formats,
46   * ``basket`` — used for storing sparse data. More on basket formats in a
47     dedicated section.
48
493. **Flags** (optional) on the third header line. Feature's flag can be empty,
50   or it can contain, space-separated, a consistent combination of:
51
52   * ``class`` (or ``c``) — feature will be imported as a class variable.
53     Most algorithms expect a single class variable.
54   * ``meta`` (or ``m``) — feature will be imported as a meta-attribute, just
55     describing the data instance but not actually used for learning,
56   * ``weight`` (or ``w``) — the feature marks the weight of examples (in
57     algorithms that support weighted examples),
58   * ``ignore`` (or ``i``) — feature will not be imported,
59   * ``<key>=<value>`` custom attributes.
60
61Example of iris dataset in Orange's three-line format
62(:download:`iris.tab <../../../../Orange/datasets/iris.tab>`).
63
64.. literalinclude:: ../../../../Orange/datasets/iris.tab
65   :lines: 1-7
66
67
68Single-line header format
69-------------------------
70
71Single-line header consists of feature names prefixed by an optional "``<flags>#``"
72string, i.e. flags followed by a hash ('#') sign. The flags can be a consistent
73combination of:
74
75* ``c`` for class feature,
76* ``i`` for feature to be ignored,
77* ``m`` for meta attributes (not used in learning),
78* ``C`` for features that are continuous,
79* ``D`` for features that are discrete,
80* ``T`` for features that represent date and/or time in one of the ISO 8601
81  formats,
82* ``S`` for string features.
83
84If some (all) names or flags are omitted, the names, types, and flags are
85discerned automatically, and correctly (most of the time).
86
87
88Baskets
89=======
90
91Baskets can be used for storing sparse data in tab delimited files. They were
92specifically designed for text mining needs. If text mining and sparse data is
93not your business, you can skip this section.
94
95Baskets are given as a list of space-separated ``<name>=<value>`` atoms. A
96continuous meta attribute named ``<name>`` will be created and added to the domain
97as optional if it is not already there. A meta value for that variable will be
98added to the example. If the value is 1, you can omit the ``=<value>`` part.
99
100It is not possible to put meta attributes of other types than continuous in the
101basket.
102
103A tab delimited file with a basket can look like this::
104
105    K       Ca      b_foo     Ba  y
106    c       c       basket    c   c
107            meta              i   class
108    0.06    8.75    a b a c   0   1
109    0.48            b=2 d     0   1
110    0.39    7.78              0   1
111    0.57    8.22    c=13      0   1
112
113These are the examples read from such a file::
114
115    [0.06, 1], {"Ca":8.75, "a":2.000, "b":1.000, "c":1.000}
116    [0.48, 1], {"Ca":?, "b":2.000, "d":1.000}
117    [0.39, 1], {"Ca":7.78}
118    [0.57, 1], {"Ca":8.22, "c":13.000}
119
120It is recommended to have the basket as the last column, especially if it
121contains a lot of data.
122
123Note a few things. The basket column's name, ``b_foo``, is not used. In the first
124example, the value of ``a`` is 2 since it appears twice. The ordinary meta
125attribute, ``Ca``, appears in all examples, even in those where its value is
126undefined. Meta attributes from the basket appear only where they are defined.
127This is due to the different nature of these meta attributes: ``Ca`` is required
128while the others are optional.  ::
129
130    >>> d.domain.metas()
131    {-6: FloatVariable 'd', -22: FloatVariable 'Ca', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'}
132
133To fully understand all this, you should read the documentation on :ref:`meta
134attributes <meta-attributes>` in Domain and on the :ref:`basket file format
135<basket-format>` (a simple format that is limited to baskets only).
136
137.. _basket-format:
138
139Basket Format
140-------------
141
142Basket files (.basket) are suitable for representing sparse data. Each example
143is represented by a line in the file. The line is written as a comma-separated
144list of name-value pairs. Here's an example of such file. ::
145
146    nobody, expects, the, Spanish, Inquisition=5
147    our, chief, weapon, is, surprise=3, surprise=2, and, fear,fear, and, surprise
148    our, two, weapons, are, fear, and, surprise, and, ruthless, efficiency
149    to, the, Pope, and, nice, red, uniforms, oh damn
150
151The file contains four examples. The first examples has five attributes
152defined, "nobody", "expects", "the", "Spanish" and "Inquisition"; the first
153four have (the default) value of 1.0 and the last has a value of 5.0.
154
155The attributes that appear in the domain aren't defined in any headers or even
156separate files, as with other formats supported by Orange.
157
158If attribute appears more than once, its values are added. For instance, the
159value of attribute "surprise" in the second examples is 6.0 and the value of
160"fear" is 2.0; the former appears three times with values of 3.0, 2.0 and 1.0,
161and the latter appears twice with value of 1.0.
162
163All attributes are loaded as optional meta-attributes, so zero values don't
164take any memory (unless they are given, but initialized to zero). See also
165section on :ref:`meta attributes <meta-attributes>` in the reference for domain
166descriptors.
167
168Notice that at the time of writing this reference only association rules can
169directly use examples presented in the basket format.
170