1\name{redun} 2\alias{redun} 3\alias{print.redun} 4\title{Redundancy Analysis} 5\description{ 6Uses flexible parametric additive models (see \code{\link{areg}} and its 7use of regression splines) to 8determine how well each variable can be predicted from the remaining 9variables. Variables are dropped in a stepwise fashion, removing the 10most predictable variable at each step. The remaining variables are used 11to predict. The process continues until no variable still in the list 12of predictors can be predicted with an \eqn{R^2} or adjusted \eqn{R^2} 13of at least \code{r2} or until dropping the variable with the highest 14\eqn{R^2} (adjusted or ordinary) would cause a variable that was dropped 15earlier to no longer be predicted at least at the \code{r2} level from 16the now smaller list of predictors. 17} 18\usage{ 19redun(formula, data=NULL, subset=NULL, r2 = 0.9, 20 type = c("ordinary", "adjusted"), nk = 3, tlinear = TRUE, 21 allcat=FALSE, minfreq=0, iterms=FALSE, pc=FALSE, pr = FALSE, ...) 22\method{print}{redun}(x, digits=3, long=TRUE, ...) 23} 24\arguments{ 25 \item{formula}{a formula. Enclose a variable in \code{I()} to force 26 linearity.} 27 \item{data}{a data frame} 28 \item{subset}{usual subsetting expression} 29 \item{r2}{ordinary or adjusted \eqn{R^2} cutoff for redundancy} 30 \item{type}{specify \code{"adjusted"} to use adjusted \eqn{R^2}} 31 \item{nk}{number of knots to use for continuous variables. Use 32 \code{nk=0} to force linearity for all variables.} 33 \item{tlinear}{set to \code{FALSE} to allow a variable to be automatically 34 nonlinearly transformed (see \code{areg}) while being predicted. By 35 default, only continuous variables on the right hand side (i.e., while 36 they are being predictors) are automatically transformed, using 37 regression splines. Estimating transformations for target (dependent) 38 variables causes more overfitting than doing so for predictors.} 39 \item{allcat}{set to \code{TRUE} to ensure that all categories of 40 categorical variables having more than two categories are redundant 41 (see details below)} 42 \item{minfreq}{For a binary or categorical variable, there must be at 43 least two categories with at least \code{minfreq} observations or 44 the variable will be dropped and not checked for redundancy against 45 other variables. \code{minfreq} also specifies the minimum 46 frequency of a category or its complement 47 before that category is considered when \code{allcat=TRUE}.} 48 \item{iterms}{set to \code{TRUE} to consider derived terms (dummy 49 variables and nonlinear spline components) as separate variables. 50 This will perform a redundancy analysis on pieces of the variables.} 51 \item{pc}{if \code{iterms=TRUE} you can set \code{pc} to \code{TRUE} 52 to replace the submatrix of terms corresponding to each variable 53 with the orthogonal principal components before doing the redundancy 54 analysis. The components are based on the correlation matrix.} 55 \item{pr}{set to \code{TRUE} to monitor progress of the stepwise algorithm} 56 \item{\dots}{arguments to pass to \code{dataframeReduce} to remove 57 "difficult" variables from \code{data} if \code{formula} is 58 \code{~.} to use all variables in \code{data} (\code{data} must be 59 specified when these arguments are used). Ignored for \code{print}.} 60 \item{x}{an object created by \code{redun}} 61 \item{digits}{number of digits to which to round \eqn{R^2} values when 62 printing} 63 \item{long}{set to \code{FALSE} to prevent the \code{print} method 64 from printing the \eqn{R^2} history and the original \eqn{R^2} with 65 which each variable can be predicted from ALL other variables.} 66} 67\value{an object of class \code{"redun"}} 68\details{ 69A categorical variable is deemed 70redundant if a linear combination of dummy variables representing it can 71be predicted from a linear combination of other variables. For example, 72if there were 4 cities in the data and each city's rainfall was also 73present as a variable, with virtually the same rainfall reported for all 74observations for a city, city would be redundant given rainfall (or 75vice-versa; the one declared redundant would be the first one in the 76formula). If two cities had the same rainfall, \code{city} might be 77declared redundant even though tied cities might be deemed non-redundant 78in another setting. To ensure that all categories may be predicted well 79from other variables, use the \code{allcat} option. To ignore 80categories that are too infrequent or too frequent, set \code{minfreq} 81to a nonzero integer. When the number of observations in the category 82is below this number or the number of observations not in the category 83is below this number, no attempt is made to predict observations being 84in that category individually for the purpose of redundancy detection.} 85\author{ 86Frank Harrell 87\cr 88Department of Biostatistics 89\cr 90Vanderbilt University 91\cr 92\email{f.harrell@vanderbilt.edu} 93} 94\seealso{\code{\link{areg}}, \code{\link{dataframeReduce}}, 95 \code{\link{transcan}}, \code{\link{varclus}}, 96 \code{\link[subselect]{genetic}}} 97\examples{ 98set.seed(1) 99n <- 100 100x1 <- runif(n) 101x2 <- runif(n) 102x3 <- x1 + x2 + runif(n)/10 103x4 <- x1 + x2 + x3 + runif(n)/10 104x5 <- factor(sample(c('a','b','c'),n,replace=TRUE)) 105x6 <- 1*(x5=='a' | x5=='c') 106redun(~x1+x2+x3+x4+x5+x6, r2=.8) 107redun(~x1+x2+x3+x4+x5+x6, r2=.8, minfreq=40) 108redun(~x1+x2+x3+x4+x5+x6, r2=.8, allcat=TRUE) 109# x5 is no longer redundant but x6 is 110} 111\keyword{smooth} 112\keyword{regression} 113\keyword{multivariate} 114\keyword{methods} 115\keyword{models} 116\concept{data reduction} 117