1% Generated by roxygen2: do not edit by hand
2% Please edit documentation in R/gafs.R
3\name{gafs.default}
4\alias{gafs.default}
5\alias{gafs}
6\alias{gafs.recipe}
7\title{Genetic algorithm feature selection}
8\usage{
9\method{gafs}{default}(
10  x,
11  y,
12  iters = 10,
13  popSize = 50,
14  pcrossover = 0.8,
15  pmutation = 0.1,
16  elite = 0,
17  suggestions = NULL,
18  differences = TRUE,
19  gafsControl = gafsControl(),
20  ...
21)
22
23\method{gafs}{recipe}(
24  x,
25  data,
26  iters = 10,
27  popSize = 50,
28  pcrossover = 0.8,
29  pmutation = 0.1,
30  elite = 0,
31  suggestions = NULL,
32  differences = TRUE,
33  gafsControl = gafsControl(),
34  ...
35)
36}
37\arguments{
38\item{x}{An object where samples are in rows and features are in columns.
39This could be a simple matrix, data frame or other type (e.g. sparse
40matrix). For the recipes method, \code{x} is a recipe object. See Details below}
41
42\item{y}{a numeric or factor vector containing the outcome for each sample}
43
44\item{iters}{number of search iterations}
45
46\item{popSize}{number of subsets evaluated at each iteration}
47
48\item{pcrossover}{the crossover probability}
49
50\item{pmutation}{the mutation probability}
51
52\item{elite}{the number of best subsets to survive at each generation}
53
54\item{suggestions}{a binary matrix of subsets strings to be included in the
55initial population. If provided the number of columns must match the number
56of columns in \code{x}}
57
58\item{differences}{a logical: should the difference in fitness values with
59and without each predictor be calculated?}
60
61\item{gafsControl}{a list of values that define how this function acts. See
62\code{\link{gafsControl}} and URL.}
63
64\item{...}{additional arguments to be passed to other methods}
65
66\item{data}{Data frame from which variables specified in
67\code{formula} or \code{recipe} are preferentially to be taken.}
68}
69\value{
70an object of class \code{gafs}
71}
72\description{
73Supervised feature selection using genetic algorithms
74}
75\details{
76\code{\link{gafs}} conducts a supervised binary search of the predictor
77space using a genetic algorithm. See Mitchell (1996) and Scrucca (2013) for
78more details on genetic algorithms.
79
80This function conducts the search of the feature space repeatedly within
81resampling iterations. First, the training data are split be whatever
82resampling method was specified in the control function. For example, if
8310-fold cross-validation is selected, the entire genetic algorithm is
84conducted 10 separate times. For the first fold, nine tenths of the data are
85used in the search while the remaining tenth is used to estimate the
86external performance since these data points were not used in the search.
87
88During the genetic algorithm, a measure of fitness is needed to guide the
89search. This is the internal measure of performance. During the search, the
90data that are available are the instances selected by the top-level
91resampling (e.g. the nine tenths mentioned above). A common approach is to
92conduct another resampling procedure. Another option is to use a holdout set
93of samples to determine the internal estimate of performance (see the
94holdout argument of the control function). While this is faster, it is more
95likely to cause overfitting of the features and should only be used when a
96large amount of training data are available. Yet another idea is to use a
97penalized metric (such as the AIC statistic) but this may not exist for some
98metrics (e.g. the area under the ROC curve).
99
100The internal estimates of performance will eventually overfit the subsets to
101the data. However, since the external estimate is not used by the search, it
102is able to make better assessments of overfitting. After resampling, this
103function determines the optimal number of generations for the GA.
104
105Finally, the entire data set is used in the last execution of the genetic
106algorithm search and the final model is built on the predictor subset that
107is associated with the optimal number of generations determined by
108resampling (although the update function can be used to manually set the
109number of generations).
110
111This is an example of the output produced when \code{gafsControl(verbose =
112TRUE)} is used:
113
114\preformatted{
115Fold2 1 0.715 (13)
116Fold2 2 0.715->0.737 (13->17, 30.4\%) *
117Fold2 3 0.737->0.732 (17->14, 24.0\%)
118Fold2 4 0.737->0.769 (17->23, 25.0\%) *
119}
120
121For the second resample (e.g. fold 2), the best subset across all
122individuals tested in the first generation contained 13 predictors and was
123associated with a fitness value of 0.715. The second generation produced a
124better subset containing 17 samples with an associated fitness values of
1250.737 (and improvement is symbolized by the \code{*}. The percentage listed
126is the Jaccard similarity between the previous best individual (with 13
127predictors) and the new best. The third generation did not produce a better
128fitness value but the fourth generation did.
129
130The search algorithm can be parallelized in several places: \enumerate{
131\item each externally resampled GA can be run independently (controlled by
132the \code{allowParallel} option of \code{\link{gafsControl}}) \item within a
133GA, the fitness calculations at a particular generation can be run in
134parallel over the current set of individuals (see the \code{genParallel}
135option in \code{\link{gafsControl}}) \item if inner resampling is used,
136these can be run in parallel (controls depend on the function used. See, for
137example, \code{\link[caret]{trainControl}}) \item any parallelization of the
138individual model fits. This is also specific to the modeling function.  }
139
140It is probably best to pick one of these areas for parallelization and the
141first is likely to produces the largest decrease in run-time since it is the
142least likely to incur multiple re-starting of the worker processes. Keep in
143mind that if multiple levels of parallelization occur, this can effect the
144number of workers and the amount of memory required exponentially.
145}
146\examples{
147
148\dontrun{
149set.seed(1)
150train_data <- twoClassSim(100, noiseVars = 10)
151test_data  <- twoClassSim(10,  noiseVars = 10)
152
153## A short example
154ctrl <- gafsControl(functions = rfGA,
155                    method = "cv",
156                    number = 3)
157
158rf_search <- gafs(x = train_data[, -ncol(train_data)],
159                  y = train_data$Class,
160                  iters = 3,
161                  gafsControl = ctrl)
162
163rf_search
164  }
165
166}
167\references{
168Kuhn M and Johnson K (2013), Applied Predictive Modeling,
169Springer, Chapter 19 \url{http://appliedpredictivemodeling.com}
170
171Scrucca L (2013). GA: A Package for Genetic Algorithms in R. Journal of
172Statistical Software, 53(4), 1-37. \url{https://www.jstatsoft.org/article/view/v053i04}
173
174Mitchell M (1996), An Introduction to Genetic Algorithms, MIT Press.
175
176\url{https://en.wikipedia.org/wiki/Jaccard_index}
177}
178\seealso{
179\code{\link{gafsControl}}, \code{\link{predict.gafs}},
180\code{\link{caretGA}}, \code{\link{rfGA}} \code{\link{treebagGA}}
181}
182\author{
183Max Kuhn, Luca Scrucca (for GA internals)
184}
185\keyword{models}
186