1% Generated by roxygen2: do not edit by hand 2% Please edit documentation in R/safs.R 3\name{safs} 4\alias{safs} 5\alias{safs.default} 6\alias{safs.recipe} 7\title{Simulated annealing feature selection} 8\usage{ 9safs(x, ...) 10 11\method{safs}{default}(x, y, iters = 10, differences = TRUE, safsControl = safsControl(), ...) 12 13\method{safs}{recipe}(x, data, iters = 10, differences = TRUE, safsControl = safsControl(), ...) 14} 15\arguments{ 16\item{x}{An object where samples are in rows and features are in columns. 17This could be a simple matrix, data frame or other type (e.g. sparse 18matrix). For the recipes method, \code{x} is a recipe object. See Details below.} 19 20\item{\dots}{arguments passed to the classification or regression routine 21specified in the function \code{safsControl$functions$fit}} 22 23\item{y}{a numeric or factor vector containing the outcome for each sample.} 24 25\item{iters}{number of search iterations} 26 27\item{differences}{a logical: should the difference in fitness values with 28and without each predictor be calculated?} 29 30\item{safsControl}{a list of values that define how this function acts. See 31\code{\link{safsControl}} and URL.} 32 33\item{data}{an object of class \code{\link{rfe}}.} 34} 35\value{ 36an object of class \code{safs} 37} 38\description{ 39Supervised feature selection using simulated annealing 40 41\code{\link{safs}} conducts a supervised binary search of the predictor 42space using simulated annealing (SA). See Kirkpatrick et al (1983) for more 43information on this search algorithm. 44 45This function conducts the search of the feature space repeatedly within 46resampling iterations. First, the training data are split be whatever 47resampling method was specified in the control function. For example, if 4810-fold cross-validation is selected, the entire simulated annealing search 49is conducted 10 separate times. For the first fold, nine tenths of the data 50are used in the search while the remaining tenth is used to estimate the 51external performance since these data points were not used in the search. 52 53During the search, a measure of fitness (i.e. SA energy value) is needed to 54guide the search. This is the internal measure of performance. During the 55search, the data that are available are the instances selected by the 56top-level resampling (e.g. the nine tenths mentioned above). A common 57approach is to conduct another resampling procedure. Another option is to 58use a holdout set of samples to determine the internal estimate of 59performance (see the holdout argument of the control function). While this 60is faster, it is more likely to cause overfitting of the features and should 61only be used when a large amount of training data are available. Yet another 62idea is to use a penalized metric (such as the AIC statistic) but this may 63not exist for some metrics (e.g. the area under the ROC curve). 64 65The internal estimates of performance will eventually overfit the subsets to 66the data. However, since the external estimate is not used by the search, it 67is able to make better assessments of overfitting. After resampling, this 68function determines the optimal number of iterations for the SA. 69 70Finally, the entire data set is used in the last execution of the simulated 71annealing algorithm search and the final model is built on the predictor 72subset that is associated with the optimal number of iterations determined 73by resampling (although the update function can be used to manually set the 74number of iterations). 75 76This is an example of the output produced when \code{safsControl(verbose = 77TRUE)} is used: 78 79\preformatted{ 80Fold03 1 0.401 (11) 81Fold03 2 0.401->0.410 (11+1, 91.7\%) * 82Fold03 3 0.410->0.396 (12+1, 92.3\%) 0.969 A 83Fold03 4 0.410->0.370 (12+2, 85.7\%) 0.881 84Fold03 5 0.410->0.399 (12+2, 85.7\%) 0.954 A 85Fold03 6 0.410->0.399 (12+1, 78.6\%) 0.940 A 86Fold03 7 0.410->0.428 (12+2, 73.3\%) * 87} 88 89The text "Fold03" indicates that this search is for the third 90cross-validation fold. The initial subset of 11 predictors had a fitness 91value of 0.401. The next iteration added a single feature the the existing 92best subset of 11 (as indicated by "11+1") that increased the fitness value 93to 0.410. This new solution, which has a Jaccard similarity value of 91.7\% 94to the current best solution, is automatically accepted. The third iteration 95adds another feature to the current set of 12 but does not improve the 96fitness. The acceptance probability for this difference is shown to be 9795.6\% and the "A" indicates that this new sub-optimal subset is accepted. 98The fourth iteration does not show an increase and is not accepted. Note 99that the Jaccard similarity value of 85.7\% is the similarity to the current 100best solution (from iteration 2) and the "12+2" indicates that there are two 101additional features added from the current best that contains 12 predictors. 102 103The search algorithm can be parallelized in several places: \enumerate{ 104\item each externally resampled SA can be run independently (controlled by 105the \code{allowParallel} option of \code{\link{safsControl}}) \item if inner 106resampling is used, these can be run in parallel (controls depend on the 107function used. See, for example, \code{\link[caret]{trainControl}}) \item 108any parallelization of the individual model fits. This is also specific to 109the modeling function. } 110 111It is probably best to pick one of these areas for parallelization and the 112first is likely to produces the largest decrease in run-time since it is the 113least likely to incur multiple re-starting of the worker processes. Keep in 114mind that if multiple levels of parallelization occur, this can effect the 115number of workers and the amount of memory required exponentially. 116} 117\examples{ 118 119\dontrun{ 120 121set.seed(1) 122train_data <- twoClassSim(100, noiseVars = 10) 123test_data <- twoClassSim(10, noiseVars = 10) 124 125## A short example 126ctrl <- safsControl(functions = rfSA, 127 method = "cv", 128 number = 3) 129 130rf_search <- safs(x = train_data[, -ncol(train_data)], 131 y = train_data$Class, 132 iters = 3, 133 safsControl = ctrl) 134 135rf_search 136} 137 138} 139\references{ 140\url{http://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html} 141 142\url{http://topepo.github.io/caret/feature-selection-using-simulated-annealing.html} 143 144Kuhn and Johnson (2013), Applied Predictive Modeling, Springer 145 146Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983). Optimization by 147simulated annealing. Science, 220(4598), 671. 148} 149\seealso{ 150\code{\link{safsControl}}, \code{\link{predict.safs}} 151} 152\author{ 153Max Kuhn 154} 155\keyword{models} 156