1The Data 2======== 3 4.. index: data 5 6This section describes how to load the data in Orange. We also show how to explore the data, perform some basic statistics, and how to sample the data. 7 8Data Input 9---------- 10 11.. index:: 12 single: data; input 13 14Orange can read files in native tab-delimited format, or can load data from any of the major standard spreadsheet file types, like CSV and Excel. Native format starts with a header row with feature (column) names. The second header row gives the attribute type, which can be continuous, discrete, time, or string. The third header line contains meta information to identify dependent features (class), irrelevant features (ignore) or meta features (meta). 15More detailed specification is available in :doc:`../reference/data.io`. 16Here are the first few lines from a dataset :download:`lenses.tab <code/lenses.tab>`:: 17 18 age prescription astigmatic tear_rate lenses 19 discrete discrete discrete discrete discrete 20 class 21 young myope no reduced none 22 young myope no normal soft 23 young myope yes reduced none 24 young myope yes normal hard 25 young hypermetrope no reduced none 26 27 28Values are tab-limited. This dataset has four attributes (age of the patient, spectacle prescription, notion on astigmatism, and information on tear production rate) and an associated three-valued dependent variable encoding lens prescription for the patient (hard contact lenses, soft contact lenses, no lenses). Feature descriptions could use one letter only, so the header of this dataset could also read:: 29 30 age prescription astigmatic tear_rate lenses 31 d d d d d 32 c 33 34The rest of the table gives the data. Note that there are 5 instances in our table above. For the full dataset, check out or download :download:`lenses.tab <code/lenses.tab>`) to a target directory. You can also skip this step as Orange comes preloaded with several demo datasets, lenses being one of them. Now, open a python shell, import Orange and load the data: 35 36 >>> import Orange 37 >>> data = Orange.data.Table("lenses") 38 >>> 39 40Note that for the file name no suffix is needed, as Orange checks if any files in the current directory are of a readable type. The call to ``Orange.data.Table`` creates an object called ``data`` that holds your dataset and information about the lenses domain: 41 42 >>> data.domain.attributes 43 (DiscreteVariable('age', values=('pre-presbyopic', 'presbyopic', 'young')), 44 DiscreteVariable('prescription', values=('hypermetrope', 'myope')), 45 DiscreteVariable('astigmatic', values=('no', 'yes')), 46 DiscreteVariable('tear_rate', values=('normal', 'reduced'))) 47 >>> data.domain.class_var 48 DiscreteVariable('lenses', values=('hard', 'none', 'soft')) 49 >>> for d in data[:3]: 50 ...: print(d) 51 ...: 52 [young, myope, no, reduced | none] 53 [young, myope, no, normal | soft] 54 [young, myope, yes, reduced | none] 55 >>> 56 57The following script wraps-up everything we have done so far and lists first 5 data instances with ``soft`` prescription: 58 59.. literalinclude:: code/data-lenses.py 60 61Note that data is an object that holds both the data and information on the domain. We show above how to access attribute and class names, but there is much more information there, including that on feature type, set of values for categorical features, and other. 62 63Saving the Data 64--------------- 65 66Data objects can be saved to a file: 67 68 >>> data.save("new_data.tab") 69 >>> 70 71This time, we have to provide the file extension to specify the output format. An extension for native Orange's data format is ".tab". The following code saves only the data items with myope perscription: 72 73.. literalinclude:: code/data-save.py 74 75We have created a new data table by passing the information on the structure of the data (``data.domain``) and a subset of data instances. 76 77Exploration of the Data Domain 78------------------------------ 79 80.. index:: 81 single: data; attributes 82.. index:: 83 single: data; domain 84.. index:: 85 single: data; class 86 87Data table stores information on data instances as well as on data domain. Domain holds the names of attributes, optional classes, their types and, and if categorical, the value names. The following code: 88 89.. literalinclude:: code/data-domain1.py 90 91outputs:: 92 93 25 attributes: 14 continuous, 11 discrete 94 First three attributes: symboling, normalized-losses, make 95 Class: price 96 97Orange's objects often behave like Python lists and dictionaries, and can be indexed or accessed through feature names: 98 99.. literalinclude:: code/data-domain2.py 100 :lines: 5- 101 102The output of the above code is:: 103 104 First attribute: symboling 105 Values of attribute 'fuel-type': diesel, gas 106 107Data Instances 108-------------- 109 110.. index:: 111 single: data; instances 112.. index:: 113 single: data; examples 114 115Data table stores data instances (or examples). These can be indexed or traversed as any Python list. Data instances can be considered as vectors, accessed through element index, or through feature name. 116 117.. literalinclude:: code/data-instances1.py 118 119The script above displays the following output:: 120 121 First three data instances: 122 [5.100, 3.500, 1.400, 0.200 | Iris-setosa] 123 [4.900, 3.000, 1.400, 0.200 | Iris-setosa] 124 [4.700, 3.200, 1.300, 0.200 | Iris-setosa] 125 25-th data instance: 126 [4.800, 3.400, 1.900, 0.200 | Iris-setosa] 127 Value of 'sepal width' for the first instance: 3.500 128 The 3rd value of the 25th data instance: 1.900 129 130The Iris dataset we have used above has four continuous attributes. Here's a script that computes their mean: 131 132.. literalinclude:: code/data-instances2.py 133 :lines: 3- 134 135The above script also illustrates indexing of data instances with objects that store features; in ``d[x]`` variable ``x`` is an Orange object. Here's the output:: 136 137 Feature Mean 138 sepal length 5.84 139 sepal width 3.05 140 petal length 3.76 141 petal width 1.20 142 143 144A slightly more complicated, but also more interesting, code that computes per-class averages: 145 146.. literalinclude:: code/data-instances3.py 147 :lines: 3- 148 149Of the four features, petal width and length look quite discriminative for the type of iris:: 150 151 Feature Iris-setosa Iris-versicolor Iris-virginica 152 sepal length 5.01 5.94 6.59 153 sepal width 3.42 2.77 2.97 154 petal length 1.46 4.26 5.55 155 petal width 0.24 1.33 2.03 156 157Finally, here is a quick code that computes the class distribution for another dataset: 158 159.. literalinclude:: code/data-instances4.py 160 161Orange Datasets and NumPy 162------------------------- 163Orange datasets are actually wrapped `NumPy <http://www.numpy.org>`_ arrays. Wrapping is performed to retain the information about the feature names and values, and NumPy arrays are used for speed and compatibility with different machine learning toolboxes, like `scikit-learn <http://scikit-learn.org>`_, on which Orange relies. Let us display the values of these arrays for the first three data instances of the iris dataset:: 164 165 >>> data = Orange.data.Table("iris") 166 >>> data.X[:3] 167 array([[ 5.1, 3.5, 1.4, 0.2], 168 [ 4.9, 3. , 1.4, 0.2], 169 [ 4.7, 3.2, 1.3, 0.2]]) 170 >>> data.Y[:3] 171 array([ 0., 0., 0.]) 172 173Notice that we access the arrays for attributes and class separately, using ``data.X`` and ``data.Y``. Average values of attributes can then be computed efficiently by:: 174 175 >>> import np as numpy 176 >>> np.mean(data.X, axis=0) 177 array([ 5.84333333, 3.054 , 3.75866667, 1.19866667]) 178 179We can also construct a (classless) dataset from a numpy array:: 180 181 >>> X = np.array([[1,2], [4,5]]) 182 >>> data = Orange.data.Table(X) 183 >>> data.domain 184 [Feature 1, Feature 2] 185 186If we want to provide meaninful names to attributes, we need to construct an appropriate data domain:: 187 188 >>> domain = Orange.data.Domain([Orange.data.ContinuousVariable("lenght"), 189 Orange.data.ContinuousVariable("width")]) 190 >>> data = Orange.data.Table(domain, X) 191 >>> data.domain 192 [lenght, width] 193 194Here is another example, this time with the construction of a dataset that includes a numerical class and different types of attributes: 195 196.. literalinclude:: code/data-domain-numpy.py 197 :lines: 4- 198 199Running of this scripts yields:: 200 201 [[big, 3.400, circle | 42.000], 202 [small, 2.700, oval | 52.200], 203 [big, 1.400, square | 13.400] 204 205Meta Attributes 206--------------- 207 208Often, we wish to include descriptive fields in the data that will not be used in any computation (distance estimation, modeling), but will serve for identification or additional information. These are called meta attributes, and are marked with ``meta`` in the third header row: 209 210.. literalinclude:: code/zoo.tab 211 212Values of meta attributes and all other (non-meta) attributes are treated similarly in Orange, but stored in separate numpy arrays: 213 214 >>> data = Orange.data.Table("zoo") 215 >>> data[0]["name"] 216 >>> data[0]["type"] 217 >>> for d in data: 218 ...: print("{}/{}: {}".format(d["name"], d["type"], d["legs"])) 219 ...: 220 aardvark/mammal: 4 221 antelope/mammal: 4 222 bass/fish: 0 223 bear/mammal: 4 224 >>> data.X 225 array([[ 1., 0., 1., 1., 2.], 226 [ 1., 0., 1., 1., 2.], 227 [ 0., 1., 0., 1., 0.], 228 [ 1., 0., 1., 1., 2.]])) 229 >>> data.metas 230 array([['aardvark'], 231 ['antelope'], 232 ['bass'], 233 ['bear']], dtype=object)) 234 235Meta attributes may be passed to ``Orange.data.Table`` after providing arrays for attribute and class values: 236 237.. literalinclude:: code/data-metas.py 238 239The script outputs:: 240 241 [[2.200, 1625.000 | no] {houston, 10}, 242 [0.300, 163.000 | yes] {ljubljana, -1} 243 244To construct a classless domain we could pass ``None`` for the class values. 245 246Missing Values 247-------------- 248 249.. index:: 250 single: data; missing values 251 252Consider the following exploration of the dataset on votes of the US senate:: 253 254 >>> import numpy as np 255 >>> data = Orange.data.Table("voting.tab") 256 >>> data[2] 257 [?, y, y, ?, y, ... | democrat] 258 >>> np.isnan(data[2][0]) 259 True 260 >>> np.isnan(data[2][1]) 261 False 262 263The particular data instance included missing data (represented with '?') for the first and the fourth attribute. In the original dataset file, the missing values are, by default, represented with a blank space. We can now examine each attribute and report on proportion of data instances for which this feature was undefined: 264 265.. literalinclude:: code/data-missing.py 266 :lines: 4- 267 268First three lines of the output of this script are:: 269 270 2.8% handicapped-infants 271 11.0% water-project-cost-sharing 272 2.5% adoption-of-the-budget-resolution 273 274A single-liner that reports on number of data instances with at least one missing value is:: 275 276 >>> sum(any(np.isnan(d[x]) for x in data.domain.attributes) for d in data) 277 203 278 279.. sum([np.any(np.isnan(x)) for x in data.X]) 280 281Data Selection and Sampling 282--------------------------- 283 284.. index:: 285 single: data; sampling 286 287Besides the name of the data file, ``Orange.data.Table`` can accept the data domain and a list of data items and returns a new dataset. This is useful for any data subsetting: 288 289.. literalinclude:: code/data-subsetting.py 290 :lines: 3- 291 292The code outputs:: 293 294 Dataset instances: 150 295 Subset size: 99 296 297and inherits the data description (domain) from the original dataset. Changing the domain requires setting up a new domain descriptor. This feature is useful for any kind of feature selection: 298 299.. literalinclude:: code/data-feature-selection.py 300 :lines: 3- 301 302.. index:: 303 single: feature; selection 304 305We could also construct a random sample of the dataset:: 306 307 >>> sample = Orange.data.Table(data.domain, random.sample(data, 3)) 308 >>> sample 309 [[6.000, 2.200, 4.000, 1.000 | Iris-versicolor], 310 [4.800, 3.100, 1.600, 0.200 | Iris-setosa], 311 [6.300, 3.400, 5.600, 2.400 | Iris-virginica] 312 ] 313 314or randomly sample the attributes: 315 316 >>> atts = random.sample(data.domain.attributes, 2) 317 >>> domain = Orange.data.Domain(atts, data.domain.class_var) 318 >>> new_data = Orange.data.Table(domain, data) 319 >>> new_data[0] 320 [5.100, 1.400 | Iris-setosa] 321