1# Preprocessing 2Sometimes, one or more preprocessing steps may need to be taken to transform the dataset before handing it off to a Learner. Some examples of preprocessing include feature extraction, standardization, normalization, imputation, and dimensionality reduction. 3 4## Transformers 5[Transformers](transformers/api.md) are objects that perform various preprocessing steps to the samples in a dataset. [Stateful](transformers/api.md#stateful) transformers are a type of transformer that must be *fitted* to a dataset. Fitting a dataset to a transformer is much like training a learner but in the context of preprocessing rather than inference. After fitting a stateful transformer, it will expect the features to be present in the same order when transforming subsequent datasets. A few transformers are *supervised* meaning they must be fitted with a [Labeled](datasets/labeled.md) dataset. [Elastic](transformers/api.md#elastic) transformers can have their fittings updated with new data after an initial fitting. 6 7### Transform a Dataset 8An example of a transformation is one that converts the categorical features of a dataset to continuous ones using a [*one hot*](https://en.wikipedia.org/wiki/One-hot) encoding. To accomplish this with the library, pass a [One Hot Encoder](transformers/one-hot-encoder.md) instance as an argument to the [Dataset](datasets/api.md) object's `apply()` method. Note that the `apply()` method also handles fitting a Stateful transformer automatically. 9 10```php 11use Rubix\ML\Transformers\OneHotEncoder; 12 13$dataset->apply(new OneHotEncoder()); 14``` 15 16Transformations can be chained by calling the `apply()` method fluently. 17 18```php 19use Rubix\ML\Transformers\RandomHotDeckImputer; 20use Rubix\ML\Transformers\OneHotEncoder; 21use Rubix\ML\Transformers\MinMaxNormalizer; 22 23$dataset->apply(new RandomHotDeckImputer(5)) 24 ->apply(new OneHotEncoder()) 25 ->apply(new MinMaxNormalizer()); 26``` 27 28!!! note 29 Transformers do not alter the labels in a dataset. Instead, you can use the `transformLabels()` method on a [Labeled](datasets/labeled.md#transform-labels) dataset instance. 30 31### Manually Fitting 32If you need to fit a [Stateful](transformers/api.md#stateful) transformer to a dataset other than the one it was meant to transform, you can fit the transformer manually by calling the `fit()` method before applying the transformation. 33 34```php 35use Rubix\ML\Transformers\WordCountVectorizer; 36 37$transformer = new WordCountVectorizer(5000); 38 39$transformer->fit($dataset1); 40 41$dataset2->apply($transformer); 42``` 43 44### Update Fitting 45To update the fitting of an [Elastic](transformers/api.md#elastic) transformer call the `update()` method with a new dataset. 46 47```php 48$transformer->update($dataset); 49``` 50 51## Transform a Single Column 52Sometimes, we just want to transform a single column of the dataset. In the example below, we use the `transformColumn()` method on the dataset object to perform a log transformation to a specified column offset by passing it the `log1p()` callback function to apply to each value in the column. 53 54```php 55$dataset->transformColumn(6, 'log1p'); 56``` 57 58In the next example, we'll convert the `null` values of another column to a special placeholder class `?`. 59 60```php 61$dataset->transformColumn(9, function ($value) { 62 return $value === null ? '?' : $value; 63}); 64``` 65 66## Standardization and Normalization 67Oftentimes, the continuous features of a dataset will be on different scales because they were measured by different methods. For example, age (0 - 100) and income (0 - 9,999,999) are on two widely different scales. Standardization is the processes of transforming a dataset such that the features are all on one common scale. Normalization is the special case where the transformed features have a range between 0 and 1. Depending on the transformer, it may operate on the columns or the rows of the dataset. 68 69| Transformer | Operates | Output Range | [Stateful](transformers/api.md#stateful) | [Elastic](transformers/api.md#elastic) | 70|---|---|---|---|---| 71| [L1 Normalizer](transformers/l1-normalizer.md) | Row-wise | [0, 1] | | | 72| [L2 Normalizer](transformers/l2-normalizer.md) | Row-wise | [0, 1] | | | 73| [Max Absolute Scaler](transformers/max-absolute-scaler.md) | Column-wise | [-1, 1] | ● | ● | 74| [Min Max Normalizer](transformers/min-max-normalizer.md) | Column-wise | [min, max] | ● | ● | 75| [Robust Standardizer](transformers/robust-standardizer.md) | Column-wise | [-∞, ∞] | ● | | 76| [Z Scale Standardizer](transformers/z-scale-standardizer.md) | Column-wise | [-∞, ∞] | ● | ● | 77 78## Feature Conversion 79Feature converters are transformers that convert feature columns of one data type to another by changing their representation. 80 81| Transformer | From | To | [Stateful](transformers/api.md#stateful) | [Elastic](transformers/api.md#elastic) | 82|---|---|---|---|---| 83| [Interval Discretizer](transformers/interval-discretizer.md) | Continuous | Categorical | ● | | 84| [One Hot Encoder](transformers/one-hot-encoder.md) | Categorical | Continuous | ● | | 85| [Numeric String Converter](transformers/numeric-string-converter.md) | Categorical | Continuous | | | 86| [Boolean Converter](transformers/boolean-converter.md) | Other | Categorical or Continuous | | | 87 88## Dimensionality Reduction 89Dimensionality reduction is a preprocessing technique for projecting a dataset onto a lower dimensional vector space. It allows a learner to train and infer quicker by producing a training set with fewer but more informative features. Dimensionality reducers can also be used to visualize datasets by outputting low (1 - 3) dimensionality embeddings for use in plotting software. 90 91| Transformer | Supervised | [Stateful](transformers/api.md#stateful) | [Elastic](transformers/api.md#elastic) | 92|---|---|---|---| 93| [Gaussian Random Projector](transformers/gaussian-random-projector.md) | | ● | | 94| [Linear Discriminant Analysis](transformers/linear-discriminant-analysis.md) | ● | ● | | 95| [Principal Component Analysis](transformers/principal-component-analysis.md) | | ● | | 96| [Sparse Random Projector](transformers/sparse-random-projector.md) | | ● | | 97| [Truncated SVD](transformers/truncated-svd.md) | | ● | | 98| [t-SNE](embedders/t-sne.md) | | | | 99 100## Feature Selection 101Similarly to dimensionality reduction, feature selection aims to reduce the number of features in a dataset, however, feature selection seeks to keep the best features as-is and drop the less informative ones entirely. Adding feature selection can help speed up training and inference by creating a more parsimonious model. It can also improve the performance of the model by removing *noise* features and features that are uncorrelated with the outcome. 102 103| Transformer | Supervised | [Stateful](transformers/api.md#stateful) | [Elastic](transformers/api.md#elastic) | 104|---|---|---|---| 105| [K Best Feature Selector](transformers/k-best-feature-selector.md) | ● | ● | | 106| [Recursive Feature Eliminator](transformers/recursive-feature-eliminator.md) | ● | ● | | 107 108## Imputation 109A technique for handling missing values in your dataset is a preprocessing step called *imputation*. Imputation is the process of replacing missing values with a pretty good guess. 110 111| Transformer | Data Compatibility | [Stateful](transformers/api.md#stateful) | [Elastic](transformers/api.md#elastic) | 112|---|---|---|---| 113| [KNN Imputer](transformers/knn-imputer.md) | Depends on distance kernel | ● | | 114| [Missing Data Imputer](transformers/missing-data-imputer.md) | Categorical, Continuous | ● | | 115| [Random Hot Deck Imputer](transformers/random-hot-deck-imputer.md) | Depends on distance kernel | ● | | 116 117## Text Transformers 118The library provides a number of transformers for natural language processing (NLP) and information retrieval (IR) tasks such as those for text cleaning, normalization, and feature extraction from raw text blobs. 119 120| Transformer | [Stateful](transformers/api.md#stateful) | [Elastic](transformers/api.md#elastic) | 121|---|---|---| 122| [HTML Stripper](transformers/html-stripper.md) | | | 123| [Regex Filter](transformers/regex-filter.md) | | | 124| [Text Normalizer](transformers/text-normalizer.md) | | | 125| [Multibyte Text Normalizer](transformers/multibyte-text-normalizer.md) | | | 126| [Stop Word Filter](transformers/stop-word-filter.md) | | | 127| [TF-IDF Transformer](transformers/tf-idf-transformer.md) | ● | ● | 128| [Whitespace Trimmer](transformers/whitespace-trimmer.md) | | | 129| [Word Count Vectorizer](transformers/word-count-vectorizer.md) | ● | | 130 131## Image Transformers 132These transformers operate on the high-level image data type. 133 134| Transformer | [Stateful](transformers/api.md#stateful) | [Elastic](transformers/api.md#elastic) | 135|---|---|---| 136| [Image Resizer](transformers/image-resizer.md) | | | 137| [Image Vectorizer](transformers/image-vectorizer.md) | ● | | 138 139### Persisting Transformers 140The persistence subsystem can be used to save and load any Stateful transformer that implements the [Persistable](persistable.md) interface. In the example below we'll fit a transformer to a dataset and then save it to the [Filesystem](persisters/filesystem.md) so we can load it in another process. 141 142```php 143use Rubix\ML\Persisters\Filesystem; 144 145$transformer->fit($dataset); 146 147$persister = new Filesystem('example.transformer'); 148 149$persister->save($transformer); 150``` 151 152Then, to load the transformer in another process call the `load()` method on the [Persister](persisters/api.md) instance. 153 154```php 155$persister = new Filesystem('example.transformer'); 156 157$transformer = $persister->load(); 158 159$dataset->apply($transformer); 160``` 161 162## Transformer Pipelines 163The [Pipeline](pipeline.md) meta-estimator helps you automate a series of transformations applied to the input dataset to an estimator. With a Pipeline, any dataset object passed to will automatically be fitted and/or transformed before it arrives in the estimator's context. In addition, transformer fittings can be saved alongside the model data when the Pipeline is persisted. 164 165```php 166use Rubix\ML\Pipeline; 167use Rubix\ML\Transformers\RandomHotDeckImputer; 168use Rubix\ML\Transformers\OneHotEncoder; 169use Rubix\ML\Transformers\ZScaleStandardizer; 170use Rubix\ML\Clusterers\KMeans; 171 172$estimator = new Pipeline([ 173 new RandomHotDeckImputer(5), 174 new OneHotEncoder(), 175 new ZScaleStandardizer(), 176], new KMeans(10, 256)); 177``` 178 179Calling `train()` or `partial()` will result in the transformers being fitted or updated before being passed to the Softmax Classifier. 180 181```php 182$estimator->train($dataset); // Transformers fitted and applied 183 184$estimator->partial($dataset); // Transformers updated and applied 185``` 186 187Any time a dataset is passed to the Pipeline it will automatically be transformed before being handed to the underlying estimator. 188 189```php 190$predictions = $estimator->predict($dataset); // Dataset transformed automatically 191``` 192 193You can save the transformer fittings alongside the model data by persisting the entire Pipeline object or by wrapping it in a [Persistent Model](persistent-model.md) meta-estimator. 194 195```php 196$persister = new Filesystem('example.model'); 197 198$persister->save($estimator); 199``` 200 201## Advanced Preprocessing 202In some cases, certain features of a dataset may require a different set of preprocessing steps than the others. In such a case, we are able to extract only certain features, preprocess them, and then join them to another set of features. In the example below, we'll extract just the text reviews and their sentiment labels into a dataset object and put the sample's category, number of clicks, and ratings into another one using two [Column Pickers](extractors/column-picker.md). Then, we can apply a separate set of transformations to each set of features and use the `join()` method to combine them into a single dataset. We can even apply another set of transformations to the dataset after that. 203 204```php 205use Rubix\ML\Dataset\Labeled; 206use Rubix\ML\Extractors\ColumnPicker; 207use Rubix\ML\Extractors\NDJSON; 208use Rubix\ML\Dataset\Unlabeled; 209use Rubix\ML\Transformers\TextNormalizer; 210use Rubix\ML\Transformers\WordCountVectorizer; 211use Rubix\ML\Transformers\TfIdfTransformer; 212use Rubix\ML\Transformers\OneHotEncoder; 213use Rubix\ML\Transformers\ZScaleStandardizer; 214 215$extractor1 = new ColumnPicker(new NDJSON('dataset.ndjson'), [ 216 'review', 'sentiment', 217]); 218 219$extractor2 = new ColumnPicker(new NDJSON('dataset.ndjson'), [ 220 'category', 'clicks', 'rating', 221]); 222 223$dataset1 = Labeled::fromIterator($extractor1) 224 ->apply(new TextNormalizer()) 225 ->apply(new WordCountVectorizer(5000)) 226 ->apply(new IfIdfTransformer()); 227 228$dataset2 = Unlabeled::fromIterator($extractor2) 229 ->apply(new OneHotEncoder()); 230 231$dataset = $dataset1->join($dataset2) 232 ->apply(new ZScaleStandardizer()); 233``` 234 235## Filtering Records 236In some cases, you may want to remove entire rows from the dataset. For example, you may want to remove records that contain features with abnormally low/high values as these samples can be interpreted as noise. The `filterByColumn()` method on the dataset object uses a callback function to determine whether or not to return a row in the new dataset by the value of the feature at a given column offset. 237 238```php 239$tallPeople = $dataset->filterByColumn(3, function ($value) { 240 return $value > 178.5; 241}); 242``` 243 244## De-duplication 245When it is undesirable for a dataset to contain duplicate records, you can remove all duplicates by calling the `deduplicate()` method on the dataset object. 246 247```php 248$dataset->deduplicate(); 249``` 250 251!!! note 252 The O(N^2) time complexity of de-duplication may be prohibitive for large datasets. 253 254## Saving a Dataset 255If you ever want to preprocess a dataset and then save it for later you can do so by calling one of the conversion methods (`toCSV()`, `toNDJSON()`) on the [Dataset](datasets/api.md#encode-the-dataset) object. Then, call the `write()` method on the returned encoding object to save the data to a file at a given path like in the example below. 256 257```php 258use Rubix\ML\Transformers\MissingDataImputer; 259 260$dataset->apply(new MissingDataImputer())->toCSV()->write('dataset.csv'); 261``` 262