1\name{redun}
2\alias{redun}
3\alias{print.redun}
4\title{Redundancy Analysis}
5\description{
6Uses flexible parametric additive models (see \code{\link{areg}} and its
7use of regression splines) to
8determine how well each variable can be predicted from the remaining
9variables.  Variables are dropped in a stepwise fashion, removing the
10most predictable variable at each step. The remaining variables are used
11to predict.  The process continues until no variable still in the list
12of predictors can be predicted with an \eqn{R^2} or adjusted \eqn{R^2}
13of at least \code{r2} or until dropping the variable with the highest
14\eqn{R^2} (adjusted or ordinary) would cause a variable that was dropped
15earlier to no longer be predicted at least at the \code{r2} level from
16the now smaller list of predictors.
17}
18\usage{
19redun(formula, data=NULL, subset=NULL, r2 = 0.9,
20      type = c("ordinary", "adjusted"), nk = 3, tlinear = TRUE,
21      allcat=FALSE, minfreq=0, iterms=FALSE, pc=FALSE, pr = FALSE, ...)
22\method{print}{redun}(x, digits=3, long=TRUE, ...)
23}
24\arguments{
25  \item{formula}{a formula.  Enclose a variable in \code{I()} to force
26	linearity.}
27  \item{data}{a data frame}
28  \item{subset}{usual subsetting expression}
29  \item{r2}{ordinary or adjusted \eqn{R^2} cutoff for redundancy}
30  \item{type}{specify \code{"adjusted"} to use adjusted \eqn{R^2}}
31  \item{nk}{number of knots to use for continuous variables.  Use
32	\code{nk=0} to force linearity for all variables.}
33  \item{tlinear}{set to \code{FALSE} to allow a variable to be automatically
34	nonlinearly transformed (see \code{areg}) while being predicted.  By
35  default, only continuous variables on the right hand side (i.e., while
36  they are being predictors) are automatically transformed, using
37  regression splines.  Estimating transformations for target (dependent)
38  variables causes more overfitting than doing so for predictors.}
39  \item{allcat}{set to \code{TRUE} to ensure that all categories of
40	categorical variables having more than two categories are redundant
41	(see details below)}
42  \item{minfreq}{For a binary or categorical variable, there must be at
43	least two categories with at least \code{minfreq} observations or
44	the variable will be dropped and not checked for redundancy against
45	other variables.  \code{minfreq} also specifies the minimum
46	frequency of a category or its complement
47	before that category is considered when \code{allcat=TRUE}.}
48  \item{iterms}{set to \code{TRUE} to consider derived terms (dummy
49	variables and nonlinear spline components) as separate variables.
50	This will perform a redundancy analysis on pieces of the variables.}
51  \item{pc}{if \code{iterms=TRUE} you can set \code{pc} to \code{TRUE}
52	to replace the submatrix of terms corresponding to each variable
53	with the orthogonal principal components before doing the redundancy
54	analysis.  The components are based on the correlation matrix.}
55  \item{pr}{set to \code{TRUE} to monitor progress of the stepwise algorithm}
56  \item{\dots}{arguments to pass to \code{dataframeReduce} to remove
57	"difficult" variables from \code{data} if \code{formula} is
58	\code{~.} to use all variables in \code{data} (\code{data} must be
59	specified when these arguments are used).  Ignored for \code{print}.}
60  \item{x}{an object created by \code{redun}}
61  \item{digits}{number of digits to which to round \eqn{R^2} values when
62	printing}
63  \item{long}{set to \code{FALSE} to prevent the \code{print} method
64	from printing the \eqn{R^2} history and the original \eqn{R^2} with
65	which each variable can be predicted from ALL other variables.}
66}
67\value{an object of class \code{"redun"}}
68\details{
69A categorical variable is deemed
70redundant if a linear combination of dummy variables representing it can
71be predicted from a linear combination of other variables.  For example,
72if there were 4 cities in the data and each city's rainfall was also
73present as a variable, with virtually the same rainfall reported for all
74observations for a city, city would be redundant given rainfall (or
75vice-versa; the one declared redundant would be the first one in the
76formula). If two cities had the same rainfall, \code{city} might be
77declared redundant even though tied cities might be deemed non-redundant
78in another setting.  To ensure that all categories may be predicted well
79from other variables, use the \code{allcat} option.  To ignore
80categories that are too infrequent or too frequent, set \code{minfreq}
81to a nonzero integer.  When the number of observations in the category
82is below this number or the number of observations not in the category
83is below this number, no attempt is made to predict observations being
84in that category individually for the purpose of redundancy detection.}
85\author{
86Frank Harrell
87\cr
88Department of Biostatistics
89\cr
90Vanderbilt University
91\cr
92\email{f.harrell@vanderbilt.edu}
93}
94\seealso{\code{\link{areg}}, \code{\link{dataframeReduce}},
95  \code{\link{transcan}}, \code{\link{varclus}},
96  \code{\link[subselect]{genetic}}}
97\examples{
98set.seed(1)
99n <- 100
100x1 <- runif(n)
101x2 <- runif(n)
102x3 <- x1 + x2 + runif(n)/10
103x4 <- x1 + x2 + x3 + runif(n)/10
104x5 <- factor(sample(c('a','b','c'),n,replace=TRUE))
105x6 <- 1*(x5=='a' | x5=='c')
106redun(~x1+x2+x3+x4+x5+x6, r2=.8)
107redun(~x1+x2+x3+x4+x5+x6, r2=.8, minfreq=40)
108redun(~x1+x2+x3+x4+x5+x6, r2=.8, allcat=TRUE)
109# x5 is no longer redundant but x6 is
110}
111\keyword{smooth}
112\keyword{regression}
113\keyword{multivariate}
114\keyword{methods}
115\keyword{models}
116\concept{data reduction}
117