1The Data
2========
3
4.. index: data
5
6This section describes how to load the data in Orange. We also show how to explore the data, perform some basic statistics, and how to sample the data.
7
8Data Input
9----------
10
11..  index::
12    single: data; input
13
14Orange can read files in native tab-delimited format, or can load data from any of the major standard spreadsheet file types, like CSV and Excel. Native format starts with a header row with feature (column) names. The second header row gives the attribute type, which can be continuous, discrete, time, or string. The third header line contains meta information to identify dependent features (class), irrelevant features (ignore) or meta features (meta).
15More detailed specification is available in :doc:`../reference/data.io`.
16Here are the first few lines from a dataset :download:`lenses.tab <code/lenses.tab>`::
17
18    age       prescription  astigmatic    tear_rate     lenses
19    discrete  discrete      discrete      discrete      discrete
20                                                        class
21    young     myope         no            reduced       none
22    young     myope         no            normal        soft
23    young     myope         yes           reduced       none
24    young     myope         yes           normal        hard
25    young     hypermetrope  no            reduced       none
26
27
28Values are tab-limited. This dataset has four attributes (age of the patient, spectacle prescription, notion on astigmatism, and information on tear production rate) and an associated three-valued dependent variable encoding lens prescription for the patient (hard contact lenses, soft contact lenses, no lenses). Feature descriptions could use one letter only, so the header of this dataset could also read::
29
30    age       prescription  astigmatic    tear_rate     lenses
31    d         d             d             d             d
32                                                        c
33
34The rest of the table gives the data. Note that there are 5 instances in our table above. For the full dataset, check out or download :download:`lenses.tab <code/lenses.tab>`) to a target directory. You can also skip this step as Orange comes preloaded with several demo datasets, lenses being one of them. Now, open a python shell, import Orange and load the data:
35
36    >>> import Orange
37    >>> data = Orange.data.Table("lenses")
38    >>>
39
40Note that for the file name no suffix is needed, as Orange checks if any files in the current directory are of a readable type. The call to ``Orange.data.Table`` creates an object called ``data`` that holds your dataset and information about the lenses domain:
41
42    >>> data.domain.attributes
43    (DiscreteVariable('age', values=('pre-presbyopic', 'presbyopic', 'young')),
44     DiscreteVariable('prescription', values=('hypermetrope', 'myope')),
45     DiscreteVariable('astigmatic', values=('no', 'yes')),
46     DiscreteVariable('tear_rate', values=('normal', 'reduced')))
47    >>> data.domain.class_var
48    DiscreteVariable('lenses', values=('hard', 'none', 'soft'))
49    >>> for d in data[:3]:
50       ...:     print(d)
51       ...:
52    [young, myope, no, reduced | none]
53    [young, myope, no, normal | soft]
54    [young, myope, yes, reduced | none]
55    >>>
56
57The following script wraps-up everything we have done so far and lists first 5 data instances with ``soft`` prescription:
58
59.. literalinclude:: code/data-lenses.py
60
61Note that data is an object that holds both the data and information on the domain. We show above how to access attribute and class names, but there is much more information there, including that on feature type, set of values for categorical features, and other.
62
63Saving the Data
64---------------
65
66Data objects can be saved to a file:
67
68    >>> data.save("new_data.tab")
69    >>>
70
71This time, we have to provide the file extension to specify the output format. An extension for native Orange's data format is ".tab". The following code saves only the data items with myope perscription:
72
73.. literalinclude:: code/data-save.py
74
75We have created a new data table by passing the information on the structure of the data (``data.domain``) and a subset of data instances.
76
77Exploration of the Data Domain
78------------------------------
79
80..  index::
81    single: data; attributes
82..  index::
83    single: data; domain
84..  index::
85    single: data; class
86
87Data table stores information on data instances as well as on data domain. Domain holds the names of attributes, optional classes, their types and, and if categorical, the value names. The following code:
88
89..  literalinclude:: code/data-domain1.py
90
91outputs::
92
93    25 attributes: 14 continuous, 11 discrete
94    First three attributes: symboling, normalized-losses, make
95    Class: price
96
97Orange's objects often behave like Python lists and dictionaries, and can be indexed or accessed through feature names:
98
99..  literalinclude:: code/data-domain2.py
100    :lines: 5-
101
102The output of the above code is::
103
104    First attribute: symboling
105    Values of attribute 'fuel-type': diesel, gas
106
107Data Instances
108--------------
109
110..  index::
111    single: data; instances
112..  index::
113    single: data; examples
114
115Data table stores data instances (or examples). These can be indexed or traversed as any Python list. Data instances can be considered as vectors, accessed through element index, or through feature name.
116
117..  literalinclude:: code/data-instances1.py
118
119The script above displays the following output::
120
121    First three data instances:
122    [5.100, 3.500, 1.400, 0.200 | Iris-setosa]
123    [4.900, 3.000, 1.400, 0.200 | Iris-setosa]
124    [4.700, 3.200, 1.300, 0.200 | Iris-setosa]
125    25-th data instance:
126    [4.800, 3.400, 1.900, 0.200 | Iris-setosa]
127    Value of 'sepal width' for the first instance: 3.500
128    The 3rd value of the 25th data instance: 1.900
129
130The Iris dataset we have used above has four continuous attributes. Here's a script that computes their mean:
131
132..  literalinclude:: code/data-instances2.py
133    :lines: 3-
134
135The above script also illustrates indexing of data instances with objects that store features; in ``d[x]`` variable ``x`` is an Orange object. Here's the output::
136
137    Feature         Mean
138    sepal length    5.84
139    sepal width     3.05
140    petal length    3.76
141    petal width     1.20
142
143
144A slightly more complicated, but also more interesting, code that computes per-class averages:
145
146..  literalinclude:: code/data-instances3.py
147    :lines: 3-
148
149Of the four features, petal width and length look quite discriminative for the type of iris::
150
151    Feature             Iris-setosa Iris-versicolor  Iris-virginica
152    sepal length               5.01            5.94            6.59
153    sepal width                3.42            2.77            2.97
154    petal length               1.46            4.26            5.55
155    petal width                0.24            1.33            2.03
156
157Finally, here is a quick code that computes the class distribution for another dataset:
158
159..  literalinclude:: code/data-instances4.py
160
161Orange Datasets and NumPy
162-------------------------
163Orange datasets are actually wrapped `NumPy <http://www.numpy.org>`_ arrays. Wrapping is performed to retain the information about the feature names and values, and NumPy arrays are used for speed and compatibility with different machine learning toolboxes, like `scikit-learn <http://scikit-learn.org>`_, on which Orange relies. Let us display the values of these arrays for the first three data instances of the iris dataset::
164
165    >>> data = Orange.data.Table("iris")
166    >>> data.X[:3]
167    array([[ 5.1,  3.5,  1.4,  0.2],
168           [ 4.9,  3. ,  1.4,  0.2],
169           [ 4.7,  3.2,  1.3,  0.2]])
170    >>> data.Y[:3]
171    array([ 0.,  0.,  0.])
172
173Notice that we access the arrays for attributes and class separately, using ``data.X`` and ``data.Y``. Average values of attributes can then be computed efficiently by::
174
175    >>> import np as numpy
176    >>> np.mean(data.X, axis=0)
177    array([ 5.84333333,  3.054     ,  3.75866667,  1.19866667])
178
179We can also construct a (classless) dataset from a numpy array::
180
181    >>> X = np.array([[1,2], [4,5]])
182    >>> data = Orange.data.Table(X)
183    >>> data.domain
184    [Feature 1, Feature 2]
185
186If we want to provide meaninful names to attributes, we need to construct an appropriate data domain::
187
188    >>> domain = Orange.data.Domain([Orange.data.ContinuousVariable("lenght"),
189                                     Orange.data.ContinuousVariable("width")])
190    >>> data = Orange.data.Table(domain, X)
191    >>> data.domain
192    [lenght, width]
193
194Here is another example, this time with the construction of a dataset that includes a numerical class and different types of attributes:
195
196..  literalinclude:: code/data-domain-numpy.py
197    :lines: 4-
198
199Running of this scripts yields::
200
201    [[big, 3.400, circle | 42.000],
202     [small, 2.700, oval | 52.200],
203     [big, 1.400, square | 13.400]
204
205Meta Attributes
206---------------
207
208Often, we wish to include descriptive fields in the data that will not be used in any computation (distance estimation, modeling), but will serve for identification or additional information. These are called meta attributes, and are marked with ``meta`` in the third header row:
209
210..  literalinclude:: code/zoo.tab
211
212Values of meta attributes and all other (non-meta) attributes are treated similarly in Orange, but stored in separate numpy arrays:
213
214    >>> data = Orange.data.Table("zoo")
215    >>> data[0]["name"]
216    >>> data[0]["type"]
217    >>> for d in data:
218        ...:     print("{}/{}: {}".format(d["name"], d["type"], d["legs"]))
219        ...:
220    aardvark/mammal: 4
221    antelope/mammal: 4
222    bass/fish: 0
223    bear/mammal: 4
224    >>> data.X
225    array([[ 1.,  0.,  1.,  1.,  2.],
226           [ 1.,  0.,  1.,  1.,  2.],
227           [ 0.,  1.,  0.,  1.,  0.],
228           [ 1.,  0.,  1.,  1.,  2.]]))
229    >>> data.metas
230    array([['aardvark'],
231           ['antelope'],
232           ['bass'],
233           ['bear']], dtype=object))
234
235Meta attributes may be passed to ``Orange.data.Table`` after providing arrays for attribute and class values:
236
237..   literalinclude:: code/data-metas.py
238
239The script outputs::
240
241    [[2.200, 1625.000 | no] {houston, 10},
242     [0.300, 163.000 | yes] {ljubljana, -1}
243
244To construct a classless domain we could pass ``None`` for the class values.
245
246Missing Values
247--------------
248
249..  index::
250    single: data; missing values
251
252Consider the following exploration of the dataset on votes of the US senate::
253
254    >>> import numpy as np
255    >>> data = Orange.data.Table("voting.tab")
256    >>> data[2]
257    [?, y, y, ?, y, ... | democrat]
258    >>> np.isnan(data[2][0])
259    True
260    >>> np.isnan(data[2][1])
261    False
262
263The particular data instance included missing data (represented with '?') for the first and the fourth attribute. In the original dataset file, the missing values are, by default, represented with a blank space. We can now examine each attribute and report on proportion of data instances for which this feature was undefined:
264
265..  literalinclude:: code/data-missing.py
266    :lines: 4-
267
268First three lines of the output of this script are::
269
270     2.8% handicapped-infants
271    11.0% water-project-cost-sharing
272     2.5% adoption-of-the-budget-resolution
273
274A single-liner that reports on number of data instances with at least one missing value is::
275
276    >>> sum(any(np.isnan(d[x]) for x in data.domain.attributes) for d in data)
277    203
278
279.. sum([np.any(np.isnan(x)) for x in data.X])
280
281Data Selection and Sampling
282---------------------------
283
284..  index::
285    single: data; sampling
286
287Besides the name of the data file, ``Orange.data.Table`` can accept the data domain and a list of data items and returns a new dataset. This is useful for any data subsetting:
288
289..  literalinclude:: code/data-subsetting.py
290    :lines: 3-
291
292The code outputs::
293
294    Dataset instances: 150
295    Subset size: 99
296
297and inherits the data description (domain) from the original dataset. Changing the domain requires setting up a new domain descriptor. This feature is useful for any kind of feature selection:
298
299..  literalinclude:: code/data-feature-selection.py
300    :lines: 3-
301
302..  index::
303    single: feature; selection
304
305We could also construct a random sample of the dataset::
306
307    >>> sample = Orange.data.Table(data.domain, random.sample(data, 3))
308    >>> sample
309    [[6.000, 2.200, 4.000, 1.000 | Iris-versicolor],
310     [4.800, 3.100, 1.600, 0.200 | Iris-setosa],
311     [6.300, 3.400, 5.600, 2.400 | Iris-virginica]
312    ]
313
314or randomly sample the attributes:
315
316    >>> atts = random.sample(data.domain.attributes, 2)
317    >>> domain = Orange.data.Domain(atts, data.domain.class_var)
318    >>> new_data = Orange.data.Table(domain, data)
319    >>> new_data[0]
320    [5.100, 1.400 | Iris-setosa]
321