1\input texinfo
2@c %**start of header
3@setfilename R-intro.info
4@settitle An Introduction to R
5@setchapternewpage on
6@c %**end of header
7
8@c Authors: If you edit/add @example(s) ,  please keep
9@c  ./R-intro.R   up-to-date !
10@c  ~~~~~~~~~~~
11@syncodeindex fn vr
12
13
14@dircategory Programming
15@direntry
16* R Introduction: (R-intro).    An introduction to R.
17@end direntry
18
19@finalout
20
21@include R-defs.texi
22@include version.texi
23
24@copying
25This manual is for R, version @value{VERSION}.
26
27Copyright @copyright{} 1990 W.@: N.@: Venables@*
28Copyright @copyright{} 1992 W.@: N.@: Venables & D.@: M.@: Smith@*
29Copyright @copyright{} 1997 R.@: Gentleman & R.@: Ihaka@*
30Copyright @copyright{} 1997, 1998 M.@: Maechler@*
31@Rcopyright{1999}
32
33@quotation
34@permission{}
35@end quotation
36@end copying
37
38
39@c <FIXME>
40@c Apparently AUCTeX 11.06 has a problem with '@appendixsection' entries
41@c when updating nodes---the equivalent '@appendixsec' seems to work.
42@c Hence changed (temporarily?) ...
43@c </FIXME>
44
45@c <NOTE>
46@c Conversion to PDF fails if sectioning titles contain (user-defined)
47@c macros such as @R{}.  Hence in section titles we changed @R{} to R.
48@c Revert when this is fixed.
49@c </NOTE>
50
51@titlepage
52@title An Introduction to R
53@subtitle Notes on @R{}: A Programming Environment for Data Analysis and Graphics
54@subtitle Version @value{VERSION}
55@author W. N. Venables, D. M. Smith
56@author and the R Core Team
57@page
58@vskip 0pt plus 1filll
59@insertcopying
60@end titlepage
61
62@ifplaintext
63@insertcopying
64@end ifplaintext
65
66@contents
67
68@ifnottex
69@node Top, Preface, (dir), (dir)
70@top An Introduction to R
71
72This is an introduction to R (``GNU S''), a language and environment for
73statistical computing and graphics.  R is similar to the
74award-winning@footnote{ACM Software Systems award, 1998:
75@uref{https://awards.acm.org/award_winners/chambers_6640862.cfm}.} S
76system, which was developed at Bell Laboratories by John Chambers et al.
77It provides a wide variety of statistical and graphical techniques
78(linear and nonlinear modelling, statistical tests, time series
79analysis, classification, clustering, ...).
80
81This manual provides information on data types, programming elements,
82statistical modelling and graphics.
83
84@insertcopying
85
86@end ifnottex
87
88@menu
89* Preface::
90* Introduction and preliminaries::
91* Simple manipulations numbers and vectors::
92* Objects::
93* Factors::
94* Arrays and matrices::
95* Lists and data frames::
96* Reading data from files::
97* Probability distributions::
98* Loops and conditional execution::
99* Writing your own functions::
100* Statistical models in R::
101* Graphics::
102* Packages::
103* OS facilities::
104* A sample session::
105* Invoking R::
106* The command-line editor::
107* Function and variable index::
108* Concept index::
109* References::
110@end menu
111
112@node Preface, Introduction and preliminaries, Top, Top
113@unnumbered Preface
114
115This introduction to @R{} is derived from an original set of notes
116describing the @Sl{} and @SPLUS{} environments written in 1990--2 by
117Bill Venables and David M. Smith when at the University of Adelaide.  We
118have made a number of small changes to reflect differences between the
119@R{} and @Sl{} programs, and expanded some of the material.
120
121We would like to extend warm thanks to Bill Venables (and David Smith)
122for granting permission to distribute this modified version of the notes
123in this way, and for being a supporter of @R{} from way back.
124
125Comments and corrections are always welcome.  Please address email
126correspondence to @email{R-help@@R-project.org}.
127
128@subheading Suggestions to the reader
129
130Most @R{} novices will start with the introductory session in Appendix
131A.  This should give some familiarity with the style of @R{} sessions
132and more importantly some instant feedback on what actually happens.
133
134Many users will come to @R{} mainly for its graphical facilities.
135@xref{Graphics}, which can be read at almost any time and need not wait
136until all the preceding sections have been digested.
137
138@menu
139* Introduction and preliminaries::
140@end menu
141
142@node Introduction and preliminaries, Simple manipulations numbers and vectors, Preface, Top
143@chapter Introduction and preliminaries
144
145@menu
146* The R environment::
147* Related software and documentation::
148* R and statistics::
149* R and the window system::
150* Using R interactively::
151* Getting help::
152* R commands; case sensitivity etc::
153* Recall and correction of previous commands::
154* Executing commands from or diverting output to a file::
155* Data permanency and removing objects::
156@end menu
157
158@node The R environment, Related software and documentation, Introduction and preliminaries, Introduction and preliminaries
159@section The R environment
160
161@R{} is an integrated suite of software facilities for data
162manipulation, calculation and graphical display.  Among other things it
163has
164
165@itemize @bullet
166@item
167an effective data handling and storage facility,
168@item
169a suite of operators for calculations on arrays, in particular matrices,
170@item
171a large, coherent, integrated collection of intermediate tools for data
172analysis,
173@item
174graphical facilities for data analysis and display either directly at
175the computer or on hardcopy, and
176@item
177a well developed, simple and effective programming language (called `S')
178which includes conditionals, loops, user defined recursive functions and
179input and output facilities.  (Indeed most of the system supplied
180functions are themselves written in the @Sl{} language.)
181@end itemize
182
183The term ``environment'' is intended to characterize it as a fully
184planned and coherent system, rather than an incremental accretion of
185very specific and inflexible tools, as is frequently the case with other
186data analysis software.
187
188@R{} is very much a vehicle for newly developing methods of interactive
189data analysis.  It has developed rapidly, and has been extended by a
190large collection of @emph{packages}.  However, most programs written in
191@R{} are essentially ephemeral, written for a single piece of data
192analysis.
193
194@node Related software and documentation, R and statistics, The R environment, Introduction and preliminaries
195@section Related software and documentation
196
197@R{} can be regarded as an implementation of the @Sl{} language which
198was developed at Bell Laboratories by Rick Becker, John Chambers and
199Allan Wilks, and also forms the basis of the @SPLUS{} systems.
200
201The evolution of the @Sl{} language is characterized by four books by
202John Chambers and coauthors.  For @R{}, the basic reference is @emph{The
203New @Sl{} Language: A Programming Environment for Data Analysis and
204Graphics} by Richard A.@: Becker, John M.@: Chambers and Allan R.@:
205Wilks.  The new features of the 1991 release of @Sl{}
206@c (@Sl{} version 3) JMC says the 1988 version is S3.
207are covered in @emph{Statistical Models in @Sl{}} edited by John M.@:
208Chambers and Trevor J.@: Hastie.  The formal methods and classes of the
209@pkg{methods} package are based on those described in @emph{Programming
210with Data} by John M.@: Chambers.  @xref{References}, for precise
211references.
212
213There are now a number of books which describe how to use @R{} for data
214analysis and statistics, and documentation for @Sl{}/@SPLUS{} can
215typically be used with @R{}, keeping the differences between the @Sl{}
216implementations in mind.  @xref{What documentation exists for R?, , ,
217R-FAQ, The R statistical system FAQ}.
218
219@node R and statistics, R and the window system, Related software and documentation, Introduction and preliminaries
220@section R and statistics
221@cindex Packages
222
223Our introduction to the @R{} environment did not mention
224@emph{statistics}, yet many people use @R{} as a statistics system.  We
225prefer to think of it of an environment within which many classical and
226modern statistical techniques have been implemented.  A few of these are
227built into the base @R{} environment, but many are supplied as
228@emph{packages}.  There are about 25 packages supplied with @R{} (called
229``standard'' and ``recommended'' packages) and many more are available
230through the @acronym{CRAN} family of Internet sites (via
231@uref{https://CRAN.R-project.org}) and elsewhere.  More details on
232packages are given later (@pxref{Packages}).
233
234Most classical statistics and much of the latest methodology is
235available for use with @R{}, but users may need to be prepared to do a
236little work to find it.
237
238There is an important difference in philosophy between @Sl{} (and hence
239@R{}) and the other main statistical systems.  In @Sl{} a statistical
240analysis is normally done as a series of steps, with intermediate
241results being stored in objects.  Thus whereas SAS and SPSS will give
242copious output from a regression or discriminant analysis, @R{} will
243give minimal output and store the results in a fit object for subsequent
244interrogation by further @R{} functions.
245
246@node R and the window system, Using R interactively, R and statistics, Introduction and preliminaries
247@section R and the window system
248
249The most convenient way to use @R{} is at a graphics workstation running
250a windowing system.  This guide is aimed at users who have this
251facility.  In particular we will occasionally refer to the use of @R{}
252on an X window system although the vast bulk of what is said applies
253generally to any implementation of the @R{} environment.
254
255Most users will find it necessary to interact directly with the
256operating system on their computer from time to time.  In this guide, we
257mainly discuss interaction with the operating system on UNIX machines.
258If you are running @R{} under Windows or macOS you will need to make
259some small adjustments.
260
261Setting up a workstation to take full advantage of the customizable
262features of @R{} is a straightforward if somewhat tedious procedure, and
263will not be considered further here.  Users in difficulty should seek
264local expert help.
265
266@node Using R interactively, Getting help, R and the window system, Introduction and preliminaries
267@section Using R interactively
268
269When you use the @R{} program it issues a prompt when it expects input
270commands.  The default prompt is @samp{@code{>}}, which on UNIX might be
271the same as the shell prompt, and so it may appear that nothing is
272happening.  However, as we shall see, it is easy to change to a
273different @R{} prompt if you wish.  We will assume that the UNIX shell
274prompt is @samp{@code{$}}.
275
276In using @R{} under UNIX the suggested procedure for the first occasion
277is as follows:
278
279@enumerate
280@item
281Create a separate sub-directory, say @file{work}, to hold data files on
282which you will use @R{} for this problem.  This will be the working
283directory whenever you use @R{} for this particular problem.
284
285@example
286$ mkdir work
287$ cd work
288@end example
289
290@item
291Start the @R{} program with the command
292
293@example
294$ R
295@end example
296
297@item
298At this point @R{} commands may be issued (see later).
299
300@item
301To quit the @R{} program the command is
302
303@example
304> q()
305@end example
306
307At this point you will be asked whether you want to save the data from
308your @R{} session.  On some systems this will bring up a dialog box, and
309on others you will receive a text prompt to which you can respond
310@kbd{yes}, @kbd{no} or @kbd{cancel} (a single letter abbreviation will
311do) to save the data before quitting, quit without saving, or return to
312the @R{} session.  Data which is saved will be available in future @R{}
313sessions.
314
315@end enumerate
316
317Further @R{} sessions are simple.
318
319@enumerate
320
321@item
322Make @file{work} the working directory and start the program as before:
323
324@example
325$ cd work
326$ R
327@end example
328
329@item
330Use the @R{} program, terminating with the @code{q()} command at the end
331of the session.
332
333@end enumerate
334
335To use @R{} under Windows the procedure to
336follow is basically the same.  Create a folder as the working directory,
337and set that in the @file{Start In} field in your @R{} shortcut.
338Then launch @R{} by double clicking on the icon.
339
340@section An introductory session
341
342Readers wishing to get a feel for @R{} at a computer before proceeding
343are strongly advised to work through the introductory session
344given in @ref{A sample session}.
345
346@node Getting help, R commands; case sensitivity etc, Using R interactively, Introduction and preliminaries
347@section Getting help with functions and features
348@findex help
349
350@R{} has an inbuilt help facility similar to the @code{man} facility of
351UNIX.  To get more information on any specific named function, for
352example @code{solve}, the command is
353
354@example
355> help(solve)
356@end example
357@findex help
358
359An alternative is
360
361@example
362> ?solve
363@end example
364@findex ?
365
366For a feature specified by special characters, the argument must be
367enclosed in double or single quotes, making it a ``character string'':
368This is also necessary for a few words with syntactic meaning including
369@code{if}, @code{for} and @code{function}.
370
371@example
372> help("[[")
373@end example
374
375Either form of quote mark may be used to escape the other, as in the
376string @code{"It's important"}.  Our convention is to use
377double quote marks for preference.
378
379On most @R{} installations help is available in @HTML{} format by
380running
381
382@example
383> help.start()
384@end example
385@findex help.start
386
387@noindent
388which will launch a Web browser that allows the help pages to be browsed
389with hyperlinks.  On UNIX, subsequent help requests are sent to the
390@HTML{}-based help system.  The `Search Engine and Keywords' link in the
391page loaded by @code{help.start()} is particularly useful as it is
392contains a high-level concept list which searches though available
393functions.  It can be a great way to get your bearings quickly and to
394understand the breadth of what @R{} has to offer.
395
396@findex help.search
397The @code{help.search} command (alternatively @code{??})
398allows searching for help in various
399ways. For example,
400
401@example
402> ??solve
403@end example
404@findex ??
405
406Try @code{?help.search} for details and more examples.
407
408The examples on a help topic can normally be run by
409
410@example
411> example(@var{topic})
412@end example
413@findex example
414
415Windows versions of @R{} have other optional help systems: use
416
417@example
418> ?help
419@end example
420
421@noindent
422for further details.
423
424@node R commands; case sensitivity etc, Recall and correction of previous commands, Getting help, Introduction and preliminaries
425@section R commands, case sensitivity, etc.
426
427Technically @R{} is an @emph{expression language} with a very simple
428syntax.  It is @emph{case sensitive} as are most UNIX based packages, so
429@code{A} and @code{a} are different symbols and would refer to different
430variables.  The set of symbols which can be used in @R{} names depends
431on the operating system and country within which @R{} is being run
432(technically on the @emph{locale} in use).  Normally all alphanumeric
433symbols are allowed@footnote{For portable @R{} code (including that to
434be used in @R{} packages) only A--Za--z0--9 should be used.} (and in
435some countries this includes accented letters) plus @samp{@code{.}} and
436@samp{@code{_}}, with the restriction that a name must start with
437@samp{@code{.}} or a letter, and if it starts with @samp{@code{.}} the
438second character must not be a digit.  Names are effectively
439unlimited in length.
440
441Elementary commands consist of either @emph{expressions} or
442@emph{assignments}.  If an expression is given as a command, it is
443evaluated, printed (unless specifically made invisible), and the value
444is lost.  An assignment also evaluates an expression and passes the
445value to a variable but the result is not automatically printed.
446
447Commands are separated either by a semi-colon (@samp{@code{;}}), or by a
448newline.  Elementary commands can be grouped together into one compound
449expression by braces (@samp{@code{@{}} and @samp{@code{@}}}).
450@emph{Comments} can be put almost@footnote{@strong{not} inside strings,
451nor within the argument list of a function definition} anywhere,
452starting with a hashmark (@samp{@code{#}}), everything to the end of the
453line is a comment.
454
455If a command is not complete at the end of a line, @R{} will
456give a different prompt, by default
457
458@example
459+
460@end example
461
462@noindent
463on second and subsequent lines and continue to read input until the
464command is syntactically complete.  This prompt may be changed by the
465user.  We will generally omit the continuation prompt
466and indicate continuation by simple indenting.
467
468Command lines entered at the console are limited@footnote{some of the
469consoles will not allow you to enter more, and amongst those which do
470some will silently discard the excess and some will use it as the start
471of the next line.} to about 4095 bytes (not characters).
472
473@node Recall and correction of previous commands, Executing commands from or diverting output to a file, R commands; case sensitivity etc, Introduction and preliminaries
474@section Recall and correction of previous commands
475
476Under many versions of UNIX and on Windows, @R{} provides a mechanism
477for recalling and re-executing previous commands.  The vertical arrow
478keys on the keyboard can be used to scroll forward and backward through
479a @emph{command history}.  Once a command is located in this way, the
480cursor can be moved within the command using the horizontal arrow keys,
481and characters can be removed with the @key{DEL} key or added with the
482other keys.  More details are provided later: @pxref{The command-line
483editor}.
484
485The recall and editing capabilities under UNIX are highly customizable.
486You can find out how to do this by reading the manual entry for the
487@strong{readline} library.
488
489Alternatively, the Emacs text editor provides more general support
490mechanisms (via @acronym{ESS}, @emph{Emacs Speaks Statistics}) for
491working interactively with @R{}.  @xref{R and Emacs, , , R-FAQ, The R
492statistical system FAQ}.
493
494@node Executing commands from or diverting output to a file, Data permanency and removing objects, Recall and correction of previous commands, Introduction and preliminaries
495@section Executing commands from or diverting output to a file
496@cindex Diverting input and output
497
498If commands@footnote{of unlimited length.} are stored in an external
499file, say @file{commands.R} in the working directory @file{work}, they
500may be executed at any time in an @R{} session with the command
501
502@example
503> source("commands.R")
504@end example
505@findex source
506
507For Windows @strong{Source} is also available on the
508@strong{File} menu.  The function @code{sink},
509
510@example
511> sink("record.lis")
512@end example
513@findex sink
514
515@noindent
516will divert all subsequent output from the console to an external file,
517@file{record.lis}.  The command
518
519@example
520> sink()
521@end example
522
523@noindent
524restores it to the console once again.
525
526@node Data permanency and removing objects,  , Executing commands from or diverting output to a file, Introduction and preliminaries
527@section Data permanency and removing objects
528
529The entities that @R{} creates and manipulates are known as
530@emph{objects}.  These may be variables, arrays of numbers, character
531strings, functions, or more general structures built from such
532components.
533
534During an @R{} session, objects are created and stored by name (we
535discuss this process in the next section).  The @R{} command
536
537@example
538> objects()
539@end example
540
541@noindent
542(alternatively, @code{ls()}) can be used to display the names of (most
543of) the objects which are currently stored within @R{}.  The collection
544of objects currently stored is called the @emph{workspace}.
545@cindex Workspace
546
547To remove objects the function @code{rm} is available:
548
549@example
550> rm(x, y, z, ink, junk, temp, foo, bar)
551@end example
552@findex rm
553@cindex Removing objects
554
555All objects created during an @R{} session can be stored permanently in
556a file for use in future @R{} sessions.  At the end of each @R{} session
557you are given the opportunity to save all the currently available
558objects.  If you indicate that you want to do this, the objects are
559written to a file called @file{.RData}@footnote{The leading ``dot'' in
560this file name makes it @emph{invisible} in normal file listings in
561UNIX, and in default GUI file listings on macOS and Windows.} in the
562current directory, and the command lines used in the session are saved
563to a file called @file{.Rhistory}.
564
565When @R{} is started at later time from the same directory it reloads
566the workspace from this file.  At the same time the associated commands
567history is reloaded.
568
569It is recommended that you should use separate working directories for
570analyses conducted with @R{}.  It is quite common for objects with names
571@code{x} and @code{y} to be created during an analysis.  Names like this
572are often meaningful in the context of a single analysis, but it can be
573quite hard to decide what they might be when the several analyses have
574been conducted in the same directory.
575
576@node Simple manipulations numbers and vectors, Objects, Introduction and preliminaries, Top
577@chapter Simple manipulations; numbers and vectors
578@cindex Vectors
579
580@menu
581* Vectors and assignment::
582* Vector arithmetic::
583* Generating regular sequences::
584* Logical vectors::
585* Missing values::
586* Character vectors::
587* Index vectors::
588* Other types of objects::
589@end menu
590
591@node Vectors and assignment, Vector arithmetic, Simple manipulations numbers and vectors, Simple manipulations numbers and vectors
592@section Vectors and assignment
593
594@R{} operates on named @emph{data structures}.  The simplest such
595structure is the numeric @emph{vector}, which is a single entity
596consisting of an ordered collection of numbers.  To set up a vector
597named @code{x}, say, consisting of five numbers, namely 10.4, 5.6, 3.1,
5986.4 and 21.7, use the @R{} command
599
600@example
601> x <- c(10.4, 5.6, 3.1, 6.4, 21.7)
602@end example
603@findex c
604@findex vector
605
606This is an @emph{assignment} statement using the @emph{function}
607@code{c()} which in this context can take an arbitrary number of vector
608@emph{arguments} and whose value is a vector got by concatenating its
609arguments end to end.@footnote{With other than vector types of argument,
610such as @code{list} mode arguments, the action of @code{c()} is rather
611different.  See @ref{Concatenating lists}.}
612
613A number occurring by itself in an expression is taken as a vector of
614length one.
615
616Notice that the assignment operator (@samp{@code{<-}}), which consists
617of the two characters @samp{@code{<}} (``less than'') and
618@samp{@code{-}} (``minus'') occurring strictly side-by-side and it
619`points' to the object receiving the value of the expression.
620In most contexts the @samp{@code{=}} operator can be used as an alternative.
621@c In this text, the assignment operator is printed as @samp{<-}, rather
622@c than ``@code{<-}''.
623@cindex Assignment
624
625Assignment can also be made using the function @code{assign()}.  An
626equivalent way of making the same assignment as above is with:
627
628@example
629> assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7))
630@end example
631
632@noindent
633The usual operator, @code{<-}, can be thought of as a syntactic
634short-cut to this.
635
636Assignments can also be made in the other direction, using the obvious
637change in the assignment operator.  So the same assignment could be made
638using
639
640@example
641> c(10.4, 5.6, 3.1, 6.4, 21.7) -> x
642@end example
643
644If an expression is used as a complete command, the value is printed
645@emph{and lost}@footnote{Actually, it is still available as
646@code{.Last.value} before any other statements are executed.}.  So now if we
647were to use the command
648
649@example
650> 1/x
651@end example
652
653@noindent
654the reciprocals of the five values would be printed at the terminal (and
655the value of @code{x}, of course, unchanged).
656
657The further assignment
658
659@example
660> y <- c(x, 0, x)
661@end example
662
663@noindent
664would create a vector @code{y} with 11 entries consisting of two copies
665of @code{x} with a zero in the middle place.
666
667@node Vector arithmetic, Generating regular sequences, Vectors and assignment, Simple manipulations numbers and vectors
668@section Vector arithmetic
669
670Vectors can be used in arithmetic expressions, in which case the
671operations are performed element by element.  Vectors occurring in the
672same expression need not all be of the same length.  If they are not,
673the value of the expression is a vector with the same length as the
674longest vector which occurs in the expression.  Shorter vectors in the
675expression are @emph{recycled} as often as need be (perhaps
676fractionally) until they match the length of the longest vector.  In
677particular a constant is simply repeated.  So with the above assignments
678the command
679@cindex Recycling rule
680
681@example
682> v <- 2*x + y + 1
683@end example
684
685@noindent
686generates a new vector @code{v} of length 11 constructed by adding
687together, element by element, @code{2*x} repeated 2.2 times, @code{y}
688repeated just once, and @code{1} repeated 11 times.
689
690@cindex Arithmetic functions and operators
691The elementary arithmetic operators are the usual @code{+}, @code{-},
692@code{*}, @code{/} and @code{^} for raising to a power.
693@findex +
694@findex -
695@findex *
696@findex /
697@findex ^
698In addition all of the common arithmetic functions are available.
699@code{log}, @code{exp}, @code{sin}, @code{cos}, @code{tan}, @code{sqrt},
700and so on, all have their usual meaning.
701@findex log
702@findex exp
703@findex sin
704@findex cos
705@findex tan
706@findex sqrt
707@code{max} and @code{min} select the largest and smallest elements of a
708vector respectively.
709@findex max
710@findex min
711@code{range} is a function whose value is a vector of length two, namely
712@code{c(min(x), max(x))}.
713@findex range
714@code{length(x)} is the number of elements in @code{x},
715@findex length
716@code{sum(x)} gives the total of the elements in @code{x},
717@findex sum
718and @code{prod(x)} their product.
719@findex prod
720
721Two statistical functions are @code{mean(x)} which calculates the sample
722mean, which is the same as @code{sum(x)/length(x)},
723@findex mean
724and @code{var(x)} which gives
725
726@example
727sum((x-mean(x))^2)/(length(x)-1)
728@end example
729@findex var
730
731@noindent
732or sample variance.  If the argument to @code{var()} is an
733@math{n}-by-@math{p} matrix the value is a @math{p}-by-@math{p} sample
734covariance matrix got by regarding the rows as independent
735@math{p}-variate sample vectors.
736
737@code{sort(x)} returns a vector of the same size as @code{x} with the
738elements arranged in increasing order; however there are other more
739flexible sorting facilities available (see @code{order()} or
740@code{sort.list()} which produce a permutation to do the sorting).
741@findex sort
742@findex order
743
744Note that @code{max} and @code{min} select the largest and smallest
745values in their arguments, even if they are given several vectors.  The
746@emph{parallel} maximum and minimum functions @code{pmax} and
747@code{pmin} return a vector (of length equal to their longest argument)
748that contains in each element the largest (smallest) element in that
749position in any of the input vectors.
750@findex pmax
751@findex pmin
752
753For most purposes the user will not be concerned if the ``numbers'' in a
754numeric vector are integers, reals or even complex.  Internally
755calculations are done as double precision real numbers, or double
756precision complex numbers if the input data are complex.
757
758To work with complex numbers, supply an explicit complex part.  Thus
759
760@example
761sqrt(-17)
762@end example
763
764@noindent
765will give @code{NaN} and a warning, but
766
767@example
768sqrt(-17+0i)
769@end example
770
771@noindent
772will do the computations as complex numbers.
773
774@menu
775* Generating regular sequences::
776@end menu
777
778@node Generating regular sequences, Logical vectors, Vector arithmetic, Simple manipulations numbers and vectors
779@section Generating regular sequences
780@cindex Regular sequences
781
782@R{} has a number of facilities for generating commonly used sequences
783of numbers.  For example @code{1:30} is the vector @code{c(1, 2,
784@dots{}, 29, 30)}.
785@c <NOTE>
786@c Info cannot handle ':' as an index entry.
787@ifnotinfo
788@findex :
789@end ifnotinfo
790@c </NOTE>
791The colon operator has high priority within an expression, so, for
792example @code{2*1:15} is the vector @code{c(2, 4, @dots{}, 28, 30)}.
793Put @code{n <- 10} and compare the sequences @code{1:n-1} and
794@code{1:(n-1)}.
795
796The construction @code{30:1} may be used to generate a sequence
797backwards.
798
799@findex seq
800The function @code{seq()} is a more general facility for generating
801sequences.  It has five arguments, only some of which may be specified
802in any one call.  The first two arguments, if given, specify the
803beginning and end of the sequence, and if these are the only two
804arguments given the result is the same as the colon operator.  That is
805@code{seq(2,10)} is the same vector as @code{2:10}.
806
807Arguments to @code{seq()}, and to many other @R{} functions, can also
808be given in named form, in which case the order in which they appear is
809irrelevant.  The first two arguments may be named
810@code{from=@var{value}} and @code{to=@var{value}}; thus
811@code{seq(1,30)}, @code{seq(from=1, to=30)} and @code{seq(to=30,
812from=1)} are all the same as @code{1:30}.  The next two arguments to
813@code{seq()} may be named @code{by=@var{value}} and
814@code{length=@var{value}}, which specify a step size and a length for
815the sequence respectively.  If neither of these is given, the default
816@code{by=1} is assumed.
817
818For example
819
820@example
821> seq(-5, 5, by=.2) -> s3
822@end example
823
824@noindent
825generates in @code{s3} the vector @code{c(-5.0, -4.8, -4.6, @dots{},
8264.6, 4.8, 5.0)}.  Similarly
827
828@example
829> s4 <- seq(length=51, from=-5, by=.2)
830@end example
831
832@noindent
833generates the same vector in @code{s4}.
834
835The fifth argument may be named @code{along=@var{vector}}, which is
836normally used as the only argument to create the sequence @code{1, 2,
837@dots{}, length(@var{vector})}, or the empty sequence if the vector is
838empty (as it can be).
839
840A related function is @code{rep()}
841@findex rep
842which can be used for replicating an object in various complicated ways.
843The simplest form is
844
845@example
846> s5 <- rep(x, times=5)
847@end example
848
849@noindent
850which will put five copies of @code{x} end-to-end in @code{s5}.  Another
851useful version is
852
853@example
854> s6 <- rep(x, each=5)
855@end example
856
857@noindent
858which repeats each element of @code{x} five times before moving on to
859the next.
860
861@node Logical vectors, Missing values, Generating regular sequences, Simple manipulations numbers and vectors
862@section Logical vectors
863
864As well as numerical vectors, @R{} allows manipulation of logical
865quantities.  The elements of a logical vector can have the values
866@code{TRUE}, @code{FALSE}, and @code{NA} (for ``not available'', see
867below).  The first two are often abbreviated as @code{T} and @code{F},
868respectively.  Note however that @code{T} and @code{F} are just
869variables which are set to @code{TRUE} and @code{FALSE} by default, but
870are not reserved words and hence can be overwritten by the user.  Hence,
871you should always use @code{TRUE} and @code{FALSE}.
872@findex FALSE
873@findex TRUE
874@findex F
875@findex T
876
877Logical vectors are generated by @emph{conditions}.  For example
878
879@example
880> temp <- x > 13
881@end example
882
883@noindent
884sets @code{temp} as a vector of the same length as @code{x} with values
885@code{FALSE} corresponding to elements of @code{x} where the condition
886is @emph{not} met and @code{TRUE} where it is.
887
888The logical operators are @code{<}, @code{<=}, @code{>}, @code{>=},
889@code{==} for exact equality and @code{!=} for inequality.
890@findex <
891@findex <=
892@findex >
893@findex >=
894@findex ==
895@findex !=
896In addition if @code{c1} and @code{c2} are logical expressions, then
897@w{@code{c1 & c2}} is their intersection (@emph{``and''}), @w{@code{c1 | c2}}
898is their union (@emph{``or''}), and @code{!c1} is the negation of
899@code{c1}.
900@findex !
901@findex |
902@findex &
903
904Logical vectors may be used in ordinary arithmetic, in which case they
905are @emph{coerced} into numeric vectors, @code{FALSE} becoming @code{0}
906and @code{TRUE} becoming @code{1}.  However there are situations where
907logical vectors and their coerced numeric counterparts are not
908equivalent, for example see the next subsection.
909
910@node Missing values, Character vectors, Logical vectors, Simple manipulations numbers and vectors
911@section Missing values
912@cindex Missing values
913
914In some cases the components of a vector may not be completely
915known.  When an element or value is ``not available'' or a ``missing
916value'' in the statistical sense, a place within a vector may be
917reserved for it by assigning it the special value @code{NA}.
918@findex NA
919In general any operation on an @code{NA} becomes an @code{NA}.  The
920motivation for this rule is simply that if the specification of an
921operation is incomplete, the result cannot be known and hence is not
922available.
923
924@findex is.na
925The function @code{is.na(x)} gives a logical vector of the same size as
926@code{x} with value @code{TRUE} if and only if the corresponding element
927in @code{x} is @code{NA}.
928
929@example
930> z <- c(1:3,NA);  ind <- is.na(z)
931@end example
932
933Notice that the logical expression @code{x == NA} is quite different
934from @code{is.na(x)} since @code{NA} is not really a value but a marker
935for a quantity that is not available.  Thus @code{x == NA} is a vector
936of the same length as @code{x} @emph{all} of whose values are @code{NA}
937as the logical expression itself is incomplete and hence undecidable.
938
939Note that there is a second kind of ``missing'' values which are
940produced by numerical computation, the so-called @emph{Not a Number},
941@code{NaN},
942@findex NaN
943values.  Examples are
944
945@example
946> 0/0
947@end example
948
949@noindent
950or
951
952@example
953> Inf - Inf
954@end example
955
956@noindent
957which both give @code{NaN} since the result cannot be defined sensibly.
958
959In summary, @code{is.na(xx)} is @code{TRUE} @emph{both} for @code{NA}
960and @code{NaN} values.  To differentiate these, @code{is.nan(xx)} is only
961@code{TRUE} for @code{NaN}s.
962@findex is.nan
963
964Missing values are sometimes printed as @code{<NA>} when character
965vectors are printed without quotes.
966
967@node Character vectors, Index vectors, Missing values, Simple manipulations numbers and vectors
968@section Character vectors
969@cindex Character vectors
970
971Character quantities and character vectors are used frequently in @R{},
972for example as plot labels.  Where needed they are denoted by a sequence
973of characters delimited by the double quote character, e.g.,
974@code{"x-values"}, @code{"New iteration results"}.
975
976Character strings are entered using either matching double (@code{"}) or
977single (@code{'}) quotes, but are printed using double quotes (or
978sometimes without quotes).  They use C-style escape sequences, using
979@code{\} as the escape character, so @code{\} is entered and printed as
980@code{\\}, and inside double quotes @code{"} is entered as @code{\"}.
981Other useful escape sequences are @code{\n}, newline, @code{\t}, tab and
982@code{\b}, backspace---see @command{?Quotes} for a full list.
983
984Character vectors may be concatenated into a vector by the @code{c()}
985function; examples of their use will emerge frequently.
986@findex c
987
988@findex paste
989The @code{paste()} function takes an arbitrary number of arguments and
990concatenates them one by one into character strings.  Any numbers given
991among the arguments are coerced into character strings in the evident
992way, that is, in the same way they would be if they were printed.  The
993arguments are by default separated in the result by a single blank
994character, but this can be changed by the named argument,
995@code{sep=@var{string}}, which changes it to @code{@var{string}},
996possibly empty.
997
998For example
999
1000@example
1001> labs <- paste(c("X","Y"), 1:10, sep="")
1002@end example
1003
1004@noindent
1005makes @code{labs} into the character vector
1006
1007@example
1008c("X1", "Y2", "X3", "Y4", "X5", "Y6", "X7", "Y8", "X9", "Y10")
1009@end example
1010
1011Note particularly that recycling of short lists takes place here too;
1012thus @code{c("X", "Y")} is repeated 5 times to match the sequence
1013@code{1:10}.
1014@footnote{@code{paste(..., collapse=@var{ss})} joins the
1015arguments into a single character string putting @var{ss} in between, e.g.,
1016@code{ss <- "|"}.  There are more tools for character manipulation, see the help
1017for @code{sub} and @code{substring}.}
1018
1019@node Index vectors, Other types of objects, Character vectors, Simple manipulations numbers and vectors
1020@section Index vectors; selecting and modifying subsets of a data set
1021@cindex Indexing vectors
1022
1023Subsets of the elements of a vector may be selected by appending to the
1024name of the vector an @emph{index vector} in square brackets.  More
1025generally any expression that evaluates to a vector may have subsets of
1026its elements similarly selected by appending an index vector in square
1027brackets immediately after the expression.
1028
1029@c FIXME: Add a forward reference to  subset()  here
1030@c FIXME  and  add a paragraph about subset() {which needs to come after
1031@c FIXME  data frames ...
1032
1033Such index vectors can be any of four distinct types.
1034
1035@enumerate
1036
1037@item
1038@strong{A logical vector}.  In this case the index vector is recycled to the
1039same length as the vector from which elements are to be selected.
1040Values corresponding to @code{TRUE} in the index vector are selected and
1041those corresponding to @code{FALSE} are omitted.  For example
1042
1043@example
1044> y <- x[!is.na(x)]
1045@end example
1046
1047@noindent
1048creates (or re-creates) an object @code{y} which will contain the
1049non-missing values of @code{x}, in the same order.  Note that if
1050@code{x} has missing values, @code{y} will be shorter than @code{x}.
1051Also
1052
1053@example
1054> (x+1)[(!is.na(x)) & x>0] -> z
1055@end example
1056
1057@noindent
1058creates an object @code{z} and places in it the values of the vector
1059@code{x+1} for which the corresponding value in @code{x} was both
1060non-missing and positive.
1061
1062@item
1063@strong{A vector of positive integral quantities}.  In this case the
1064values in the index vector must lie in the set @{1, 2, @dots{},
1065@code{length(x)}@}.  The corresponding elements of the vector are
1066selected and concatenated, @emph{in that order}, in the result.  The
1067index vector can be of any length and the result is of the same length
1068as the index vector.  For example @code{x[6]} is the sixth component of
1069@code{x} and
1070
1071@example
1072> x[1:10]
1073@end example
1074
1075@noindent
1076selects the first 10 elements of @code{x} (assuming @code{length(x)} is
1077not less than 10).  Also
1078
1079@example
1080> c("x","y")[rep(c(1,2,2,1), times=4)]
1081@end example
1082
1083@noindent
1084(an admittedly unlikely thing to do) produces a character vector of
1085length 16 consisting of @code{"x", "y", "y", "x"} repeated four times.
1086
1087@item
1088@strong{A vector of negative integral quantities}.  Such an index vector
1089specifies the values to be @emph{excluded} rather than included.  Thus
1090
1091@example
1092> y <- x[-(1:5)]
1093@end example
1094
1095@noindent
1096gives @code{y} all but the first five elements of @code{x}.
1097
1098@item
1099@strong{A vector of character strings}.  This possibility only applies
1100where an object has a @code{names} attribute to identify its components.
1101In this case a sub-vector of the names vector may be used in the same way
1102as the positive integral labels in item 2 further above.
1103
1104@example
1105> fruit <- c(5, 10, 1, 20)
1106> names(fruit) <- c("orange", "banana", "apple", "peach")
1107> lunch <- fruit[c("apple","orange")]
1108@end example
1109
1110The advantage is that alphanumeric @emph{names} are often easier to
1111remember than @emph{numeric indices}.  This option is particularly
1112useful in connection with data frames, as we shall see later.
1113
1114@end enumerate
1115
1116An indexed expression can also appear on the receiving end of an
1117assignment, in which case the assignment operation is performed
1118@emph{only on those elements of the vector}.  The expression must be of
1119the form @code{vector[@var{index_vector}]} as having an arbitrary
1120expression in place of the vector name does not make much sense here.
1121
1122For example
1123
1124@example
1125> x[is.na(x)] <- 0
1126@end example
1127
1128@noindent
1129replaces any missing values in @code{x} by zeros and
1130
1131@example
1132> y[y < 0] <- -y[y < 0]
1133@end example
1134
1135@noindent
1136has the same effect as
1137
1138@example
1139> y <- abs(y)
1140@end example
1141
1142@node Other types of objects,  , Index vectors, Simple manipulations numbers and vectors
1143@section Other types of objects
1144
1145Vectors are the most important type of object in @R{}, but there are
1146several others which we will meet more formally in later sections.
1147
1148@itemize @bullet
1149@item
1150@emph{matrices} or more generally @emph{arrays} are multi-dimensional
1151generalizations of vectors.  In fact, they @emph{are} vectors that can
1152be indexed by two or more indices and will be printed in special ways.
1153@xref{Arrays and matrices}.
1154
1155@item
1156@emph{factors} provide compact ways to handle categorical data.
1157@xref{Factors}.
1158
1159@item
1160@emph{lists} are a general form of vector in which the various elements
1161need not be of the same type, and are often themselves vectors or lists.
1162Lists provide a convenient way to return the results of a statistical
1163computation.  @xref{Lists}.
1164
1165@item
1166@emph{data frames} are matrix-like structures, in which the columns can
1167be of different types.  Think of data frames as `data matrices' with one
1168row per observational unit but with (possibly) both numerical and
1169categorical variables.  Many experiments are best described by data
1170frames: the treatments are categorical but the response is numeric.
1171@xref{Data frames}.
1172
1173@item
1174@emph{functions} are themselves objects in @R{} which can be stored in
1175the project's workspace.  This provides a simple and convenient way to
1176extend @R{}.  @xref{Writing your own functions}.
1177
1178@end itemize
1179
1180@node Objects, Factors, Simple manipulations numbers and vectors, Top
1181@chapter Objects, their modes and attributes
1182@cindex Objects
1183@cindex Attributes
1184
1185@c <FIXME>
1186@c This needs to be re-written for R.  We really have data types (as
1187@c returned by typeof()) and that functions mode() and storage.mode()
1188@c are for S compatibility mostly.  Hence in particular, there is no
1189@c intrinsic attribute `mode'.
1190
1191@menu
1192* The intrinsic attributes mode and length::
1193* Changing the length of an object::
1194* Getting and setting attributes::
1195* The class of an object::
1196@end menu
1197
1198@node The intrinsic attributes mode and length, Changing the length of an object, Objects, Objects
1199@section Intrinsic attributes: mode and length
1200
1201The entities @R{} operates on are technically known as @emph{objects}.
1202Examples are vectors of numeric (real) or complex values, vectors of
1203logical values and vectors of character strings.  These are known as
1204``atomic'' structures since their components are all of the same type,
1205or @emph{mode}, namely @emph{numeric}@footnote{@emph{numeric} mode is
1206actually an amalgam of two distinct modes, namely @emph{integer} and
1207@emph{double} precision, as explained in the manual.}, @emph{complex},
1208@emph{logical}, @emph{character} and @emph{raw}.
1209
1210Vectors must have their values @emph{all of the same mode}.  Thus any
1211given vector must be unambiguously either @emph{logical},
1212@emph{numeric}, @emph{complex}, @emph{character} or @emph{raw}.  (The
1213only apparent exception to this rule is the special ``value'' listed as
1214@code{NA} for quantities not available, but in fact there are several
1215types of @code{NA}).  Note that a vector can be empty and still have a
1216mode.  For example the empty character string vector is listed as
1217@code{character(0)} and the empty numeric vector as @code{numeric(0)}.
1218
1219@R{} also operates on objects called @emph{lists}, which are of mode
1220@emph{list}.  These are ordered sequences of objects which individually
1221can be of any mode.  @emph{lists} are known as ``recursive'' rather than
1222atomic structures since their components can themselves be lists in
1223their own right.
1224
1225The other recursive structures are those of mode @emph{function} and
1226@emph{expression}.  Functions are the objects that form part of the @R{}
1227system along with similar user written functions, which we discuss in
1228some detail later.  Expressions as objects form an
1229advanced part of @R{} which will not be discussed in this guide, except
1230indirectly when we discuss @emph{formulae} used with modeling in @R{}.
1231
1232By the @emph{mode} of an object we mean the basic type of its
1233fundamental constituents.  This is a special case of a ``property''
1234of an object.  Another property of every object is its @emph{length}.  The
1235functions @code{mode(@var{object})} and @code{length(@var{object})} can be
1236used to find out the mode and length of any defined structure
1237@footnote{Note however that @code{length(@var{object})} does not always
1238contain intrinsic useful information, e.g., when @code{@var{object}} is a
1239function.}.
1240
1241Further properties of an object are usually provided by
1242@code{attributes(@var{object})}, see @ref{Getting and setting attributes}.
1243Because of this, @emph{mode} and @emph{length} are also called ``intrinsic
1244attributes'' of an object.
1245@findex mode
1246@findex length
1247
1248For example, if @code{z} is a complex vector of length 100, then in an
1249expression @code{mode(z)} is the character string @code{"complex"} and
1250@code{length(z)} is @code{100}.
1251
1252@R{} caters for changes of mode almost anywhere it could be considered
1253sensible to do so, (and a few where it might not be).  For example with
1254
1255@example
1256> z <- 0:9
1257@end example
1258
1259@noindent
1260we could put
1261
1262@example
1263> digits <- as.character(z)
1264@end example
1265
1266@noindent
1267after which @code{digits} is the character vector @code{c("0", "1", "2",
1268@dots{}, "9")}.  A further @emph{coercion}, or change of mode,
1269reconstructs the numerical vector again:
1270
1271@example
1272> d <- as.integer(digits)
1273@end example
1274
1275@noindent
1276Now @code{d} and @code{z} are the same.@footnote{In general, coercion
1277from numeric to character and back again will not be exactly reversible,
1278because of roundoff errors in the character representation.}  There is a
1279large collection of functions of the form @code{as.@var{something}()}
1280for either coercion from one mode to another, or for investing an object
1281with some other attribute it may not already possess.  The reader should
1282consult the different help files to become familiar with them.
1283
1284@c </FIXME>
1285
1286@node Changing the length of an object, Getting and setting attributes, The intrinsic attributes mode and length, Objects
1287@section Changing the length of an object
1288
1289An ``empty'' object may still have a mode.  For example
1290
1291@example
1292> e <- numeric()
1293@end example
1294
1295@noindent
1296makes @code{e} an empty vector structure of mode numeric.  Similarly
1297@code{character()} is a empty character vector, and so on.  Once an
1298object of any size has been created, new components may be added to it
1299simply by giving it an index value outside its previous range.  Thus
1300
1301@example
1302> e[3] <- 17
1303@end example
1304
1305@noindent
1306now makes @code{e} a vector of length 3, (the first two components of
1307which are at this point both @code{NA}).  This applies to any structure
1308at all, provided the mode of the additional component(s) agrees with the
1309mode of the object in the first place.
1310
1311This automatic adjustment of lengths of an object is used often, for
1312example in the @code{scan()} function for input.  (@pxref{The scan()
1313function}.)
1314
1315Conversely to truncate the size of an object requires only an assignment
1316to do so.  Hence if @code{alpha} is an object of length 10, then
1317
1318@example
1319> alpha <- alpha[2 * 1:5]
1320@end example
1321
1322@noindent
1323makes it an object of length 5 consisting of just the former components
1324with even index.  (The old indices are not retained, of course.)  We can
1325then retain just the first three values by
1326
1327@example
1328> length(alpha) <- 3
1329@end example
1330
1331@noindent
1332and vectors can be extended (by missing values) in the same way.
1333
1334@node Getting and setting attributes, The class of an object, Changing the length of an object, Objects
1335@section Getting and setting attributes
1336@findex attr
1337@findex attributes
1338
1339The function @code{attributes(@var{object})}
1340@findex attributes
1341returns a list of all the non-intrinsic attributes currently defined for
1342that object.  The function @code{attr(@var{object}, @var{name})}
1343@findex attr
1344can be used to select a specific attribute.  These functions are rarely
1345used, except in rather special circumstances when some new attribute is
1346being created for some particular purpose, for example to associate a
1347creation date or an operator with an @R{} object.  The concept, however,
1348is very important.
1349
1350Some care should be exercised when assigning or deleting attributes
1351since they are an integral part of the object system used in @R{}.
1352
1353When it is used on the left hand side of an assignment it can be used
1354either to associate a new attribute with @code{@var{object}} or to
1355change an existing one.  For example
1356
1357@example
1358> attr(z, "dim") <- c(10,10)
1359@end example
1360
1361@noindent
1362allows @R{} to treat @code{z} as if it were a 10-by-10 matrix.
1363
1364@node The class of an object,  , Getting and setting attributes, Objects
1365@section The class of an object
1366@cindex Classes
1367
1368All objects in @R{} have a @emph{class}, reported by the function
1369@code{class}.  For simple vectors this is just the mode, for example
1370@code{"numeric"}, @code{"logical"}, @code{"character"} or @code{"list"},
1371but @code{"matrix"}, @code{"array"}, @code{"factor"} and
1372@code{"data.frame"} are other possible values.
1373
1374A special attribute known as the @emph{class} of the object is used to
1375allow for an object-oriented style@footnote{A different style using
1376`formal' or `S4' classes is provided in package @code{methods}.} of
1377programming in @R{}.  For example if an object has class
1378@code{"data.frame"}, it will be printed in a certain way, the
1379@code{plot()} function will display it graphically in a certain way, and
1380other so-called generic functions such as @code{summary()} will react to
1381it as an argument in a way sensitive to its class.
1382
1383To remove temporarily the effects of class, use the function
1384@code{unclass()}.
1385@findex unclass
1386For example if @code{winter} has the class @code{"data.frame"} then
1387
1388@example
1389> winter
1390@end example
1391
1392@noindent
1393will print it in data frame form, which is rather like a matrix, whereas
1394
1395@example
1396> unclass(winter)
1397@end example
1398
1399@noindent
1400will print it as an ordinary list.  Only in rather special situations do
1401you need to use this facility, but one is when you are learning to come
1402to terms with the idea of class and generic functions.
1403
1404Generic functions and classes will be discussed further in @ref{Object
1405orientation}, but only briefly.
1406
1407@node Factors, Arrays and matrices, Objects, Top
1408@chapter Ordered and unordered factors
1409@cindex Factors
1410@cindex Ordered factors
1411
1412A @emph{factor} is a vector object used to specify a discrete
1413classification (grouping) of the components of other vectors of the same length.
1414@R{} provides both @emph{ordered} and @emph{unordered} factors.
1415While the ``real'' application of factors is with model formulae
1416(@pxref{Contrasts}), we here look at a specific example.
1417
1418@section A specific example
1419
1420Suppose, for example, we have a sample of 30 tax accountants from all
1421the states and territories of Australia@footnote{Readers should note
1422that there are eight states and territories in Australia, namely the
1423Australian Capital Territory, New South Wales, the Northern Territory,
1424Queensland, South Australia, Tasmania, Victoria and Western Australia.}
1425and their individual state of origin is specified by a character vector
1426of state mnemonics as
1427
1428@example
1429> state <- c("tas", "sa",  "qld", "nsw", "nsw", "nt",  "wa",  "wa",
1430             "qld", "vic", "nsw", "vic", "qld", "qld", "sa",  "tas",
1431             "sa",  "nt",  "wa",  "vic", "qld", "nsw", "nsw", "wa",
1432             "sa",  "act", "nsw", "vic", "vic", "act")
1433@end example
1434
1435Notice that in the case of a character vector, ``sorted'' means sorted
1436in alphabetical order.
1437
1438A @emph{factor} is similarly created using the @code{factor()} function:
1439@findex factor
1440
1441@example
1442> statef <- factor(state)
1443@end example
1444
1445The @code{print()} function handles factors slightly differently from
1446other objects:
1447
1448@example
1449> statef
1450 [1] tas sa  qld nsw nsw nt  wa  wa  qld vic nsw vic qld qld sa
1451[16] tas sa  nt  wa  vic qld nsw nsw wa  sa  act nsw vic vic act
1452Levels:  act nsw nt qld sa tas vic wa
1453@end example
1454
1455To find out the levels of a factor the function @code{levels()} can be
1456used.
1457@findex levels
1458
1459@example
1460> levels(statef)
1461[1] "act" "nsw" "nt"  "qld" "sa"  "tas" "vic" "wa"
1462@end example
1463
1464@menu
1465* The function tapply() and ragged arrays::
1466* Ordered factors::
1467@end menu
1468
1469@node The function tapply() and ragged arrays, Ordered factors, Factors, Factors
1470@section The function @code{tapply()} and ragged arrays
1471@findex tapply
1472
1473To continue the previous example, suppose we have the incomes of the
1474same tax accountants in another vector (in suitably large units of
1475money)
1476
1477@example
1478> incomes <- c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42, 56,
1479               61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48, 52, 46,
1480               59, 46, 58, 43)
1481@end example
1482
1483To calculate the sample mean income for each state we can now use the
1484special function @code{tapply()}:
1485
1486@example
1487> incmeans <- tapply(incomes, statef, mean)
1488@end example
1489
1490@noindent
1491giving a means vector with the components labelled by the levels
1492
1493@example
1494   act    nsw     nt    qld     sa    tas    vic     wa
149544.500 57.333 55.500 53.600 55.000 60.500 56.000 52.250
1496@end example
1497
1498The function @code{tapply()} is used to apply a function, here
1499@code{mean()}, to each group of components of the first argument, here
1500@code{incomes}, defined by the levels of the second component, here
1501@code{statef}@footnote{Note that @code{tapply()} also works in this case
1502when its second argument is not a factor, e.g.,
1503@samp{@code{tapply(incomes, state)}}, and this is true for quite a few
1504other functions, since arguments are @emph{coerced} to factors when
1505necessary (using @code{as.factor()}).}, as if they were separate vector
1506structures.  The result is a structure of the same length as the levels
1507attribute of the factor containing the results.  The reader should
1508consult the help document for more details.
1509
1510Suppose further we needed to calculate the standard errors of the state
1511income means.  To do this we need to write an @R{} function to calculate
1512the standard error for any given vector.  Since there is an builtin
1513function @code{var()} to calculate the sample variance, such a function
1514is a very simple one liner, specified by the assignment:
1515
1516@example
1517> stdError <- function(x) sqrt(var(x)/length(x))
1518@end example
1519
1520@noindent
1521(Writing functions will be considered later in @ref{Writing your own
1522functions}.  Note that @R{}'s a builtin function @code{sd()} is something different.)
1523@findex sd
1524@findex var
1525After this assignment, the standard errors are calculated by
1526
1527@example
1528> incster <- tapply(incomes, statef, stdError)
1529@end example
1530
1531@noindent
1532and the values calculated are then
1533
1534@example
1535> incster
1536act    nsw  nt    qld     sa tas   vic     wa
15371.5 4.3102 4.5 4.1061 2.7386 0.5 5.244 2.6575
1538@end example
1539
1540As an exercise you may care to find the usual 95% confidence limits for
1541the state mean incomes.  To do this you could use @code{tapply()} once
1542more with the @code{length()} function to find the sample sizes, and the
1543@code{qt()} function to find the percentage points of the appropriate
1544@math{t}-distributions.  (You could also investigate @R{}'s facilities
1545for @math{t}-tests.)
1546
1547The function @code{tapply()} can also be used to handle more complicated
1548indexing of a vector by multiple categories.  For example, we might wish
1549to split the tax accountants by both state and sex.  However in this
1550simple instance (just one factor) what happens can be thought of as
1551follows.  The values in the vector are collected into groups
1552corresponding to the distinct entries in the factor.  The function is
1553then applied to each of these groups individually.  The value is a
1554vector of function results, labelled by the @code{levels} attribute of
1555the factor.
1556
1557The combination of a vector and a labelling factor is an example of what
1558is sometimes called a @emph{ragged array}, since the subclass sizes are
1559possibly irregular.  When the subclass sizes are all the same the
1560indexing may be done implicitly and much more efficiently, as we see in
1561the next section.
1562
1563
1564@node Ordered factors,  , The function tapply() and ragged arrays, Factors
1565@section Ordered factors
1566@findex ordered
1567
1568The levels of factors are stored in alphabetical order, or in the order
1569they were specified to @code{factor} if they were specified explicitly.
1570
1571Sometimes the levels will have a natural ordering that we want to record
1572and want our statistical analysis to make use of.  The @code{ordered()}
1573@findex ordered
1574function creates such ordered factors but is otherwise identical to
1575@code{factor}.  For most purposes the only difference between ordered
1576and unordered factors is that the former are printed showing the
1577ordering of the levels, but the contrasts generated for them in fitting
1578linear models are different.
1579
1580
1581@node Arrays and matrices, Lists and data frames, Factors, Top
1582@chapter Arrays and matrices
1583
1584@menu
1585* Arrays::
1586* Array indexing::
1587* Index matrices::
1588* The array() function::
1589* The outer product of two arrays::
1590* Generalized transpose of an array::
1591* Matrix facilities::
1592* Forming partitioned matrices::
1593* The concatenation function c() with arrays::
1594* Frequency tables from factors::
1595@end menu
1596
1597@node Arrays, Array indexing, Arrays and matrices, Arrays and matrices
1598@section Arrays
1599@cindex Arrays
1600@cindex Matrices
1601
1602An array can be considered as a multiply subscripted collection of data
1603entries, for example numeric.  @R{} allows simple facilities for
1604creating and handling arrays, and in particular the special case of
1605matrices.
1606
1607A dimension vector is a vector of non-negative integers.  If its length is
1608@math{k} then the array is @math{k}-dimensional, e.g.@: a matrix is a
1609@math{2}-dimensional array.  The dimensions are indexed from one up to
1610the values given in the dimension vector.
1611
1612A vector can be used by @R{} as an array only if it has a dimension
1613vector as its @emph{dim} attribute.  Suppose, for example, @code{z} is a
1614vector of 1500 elements.  The assignment
1615
1616@example
1617> dim(z) <- c(3,5,100)
1618@end example
1619@findex dim
1620
1621@noindent
1622gives it the @emph{dim} attribute that allows it to be treated as a
1623@math{3} by @math{5} by @math{100} array.
1624
1625Other functions such as @code{matrix()} and @code{array()} are available
1626for simpler and more natural looking assignments, as we shall see in
1627@ref{The array() function}.
1628
1629The values in the data vector give the values in the array in the same
1630order as they would occur in FORTRAN, that is ``column major order,''
1631with the first subscript moving fastest and the last subscript slowest.
1632
1633For example if the dimension vector for an array, say @code{a}, is
1634@code{c(3,4,2)} then there are @eqn{3 \times 4 \times 2 = 24, 3 * 4 * 2
1635= 24} entries in @code{a} and the data vector holds them in the order
1636@code{a[1,1,1], a[2,1,1], @dots{}, a[2,4,2], a[3,4,2]}.
1637
1638Arrays can be one-dimensional: such arrays are usually treated in the
1639same way as vectors (including when printing), but the exceptions can
1640cause confusion.
1641
1642@node Array indexing, Index matrices, Arrays, Arrays and matrices
1643@section Array indexing.  Subsections of an array
1644@cindex Indexing of and by arrays
1645
1646Individual elements of an array may be referenced by giving the name of
1647the array followed by the subscripts in square brackets, separated by
1648commas.
1649
1650More generally, subsections of an array may be specified by giving a
1651sequence of @emph{index vectors} in place of subscripts; however
1652@emph{if any index position is given an empty index vector, then the
1653full range of that subscript is taken}.
1654
1655Continuing the previous example, @code{a[2,,]} is a @eqn{4 \times 2, 4 *
16562} array with dimension vector @code{c(4,2)} and data vector containing
1657the values
1658
1659@example
1660c(a[2,1,1], a[2,2,1], a[2,3,1], a[2,4,1],
1661  a[2,1,2], a[2,2,2], a[2,3,2], a[2,4,2])
1662@end example
1663
1664@noindent
1665in that order.  @code{a[,,]} stands for the entire array, which is the
1666same as omitting the subscripts entirely and using @code{a} alone.
1667
1668For any array, say @code{Z}, the dimension vector may be referenced
1669explicitly as @code{dim(Z)} (on either side of an assignment).
1670
1671Also, if an array name is given with just @emph{one subscript or index
1672vector}, then the corresponding values of the data vector only are used;
1673in this case the dimension vector is ignored.  This is not the case,
1674however, if the single index is not a vector but itself an array, as we
1675next discuss.
1676
1677@menu
1678* Index matrices::
1679* The array() function::
1680@end menu
1681
1682@node Index matrices, The array() function, Array indexing, Arrays and matrices
1683@section Index matrices
1684
1685As well as an index vector in any subscript position, a matrix may be
1686used with a single @emph{index matrix} in order either to assign a vector
1687of quantities to an irregular collection of elements in the array, or to
1688extract an irregular collection as a vector.
1689
1690A matrix example makes the process clear.  In the case of a doubly
1691indexed array, an index matrix may be given consisting of two columns
1692and as many rows as desired.  The entries in the index matrix are the
1693row and column indices for the doubly indexed array.  Suppose for
1694example we have a @math{4} by @math{5} array @code{X} and we wish to do
1695the following:
1696
1697@itemize @bullet
1698@item
1699Extract elements @code{X[1,3]}, @code{X[2,2]} and @code{X[3,1]} as a
1700vector structure, and
1701@item
1702Replace these entries in the array @code{X} by zeroes.
1703@end itemize
1704In this case we need a @math{3} by @math{2} subscript array, as in the
1705following example.
1706
1707@example
1708> x <- array(1:20, dim=c(4,5))   # @r{Generate a 4 by 5 array.}
1709> x
1710     [,1] [,2] [,3] [,4] [,5]
1711[1,]    1    5    9   13   17
1712[2,]    2    6   10   14   18
1713[3,]    3    7   11   15   19
1714[4,]    4    8   12   16   20
1715> i <- array(c(1:3,3:1), dim=c(3,2))
1716> i                             # @r{@code{i} is a 3 by 2 index array.}
1717     [,1] [,2]
1718[1,]    1    3
1719[2,]    2    2
1720[3,]    3    1
1721> x[i]                          # @r{Extract those elements}
1722[1] 9 6 3
1723> x[i] <- 0                     # @r{Replace those elements by zeros.}
1724> x
1725     [,1] [,2] [,3] [,4] [,5]
1726[1,]    1    5    0   13   17
1727[2,]    2    0   10   14   18
1728[3,]    0    7   11   15   19
1729[4,]    4    8   12   16   20
1730>
1731@end example
1732@noindent
1733Negative indices are not allowed in index matrices.  @code{NA} and zero
1734values are allowed: rows in the index matrix containing a zero are
1735ignored, and rows containing an @code{NA} produce an @code{NA} in the
1736result.
1737
1738
1739As a less trivial example, suppose we wish to generate an (unreduced)
1740design matrix for a block design defined by factors @code{blocks}
1741(@code{b} levels) and @code{varieties} (@code{v} levels).  Further
1742suppose there are @code{n} plots in the experiment.  We could proceed as
1743follows:
1744
1745@example
1746> Xb <- matrix(0, n, b)
1747> Xv <- matrix(0, n, v)
1748> ib <- cbind(1:n, blocks)
1749> iv <- cbind(1:n, varieties)
1750> Xb[ib] <- 1
1751> Xv[iv] <- 1
1752> X <- cbind(Xb, Xv)
1753@end example
1754
1755To construct the incidence matrix, @code{N} say, we could use
1756
1757@example
1758> N <- crossprod(Xb, Xv)
1759@end example
1760@findex crossprod
1761
1762However a simpler direct way of producing this matrix is to use
1763@code{table()}:
1764@findex table
1765
1766@example
1767> N <- table(blocks, varieties)
1768@end example
1769
1770Index matrices must be numerical: any other form of matrix (e.g.@: a
1771logical or character matrix) supplied as a matrix is treated as an
1772indexing vector.
1773
1774@node The array() function, The outer product of two arrays, Index matrices, Arrays and matrices
1775@section The @code{array()} function
1776@findex array
1777
1778As well as giving a vector structure a @code{dim} attribute, arrays can
1779be constructed from vectors by the @code{array} function, which has the
1780form
1781
1782@example
1783> Z <- array(@var{data_vector}, @var{dim_vector})
1784@end example
1785
1786For example, if the vector @code{h} contains 24 or fewer, numbers then
1787the command
1788
1789@example
1790> Z <- array(h, dim=c(3,4,2))
1791@end example
1792
1793@noindent
1794would use @code{h} to set up @math{3} by @math{4} by @math{2} array in
1795@code{Z}.  If the size of @code{h} is exactly 24 the result is the same as
1796
1797@example
1798> Z <- h ; dim(Z) <- c(3,4,2)
1799@end example
1800
1801However if @code{h} is shorter than 24, its values are recycled from the
1802beginning again to make it up to size 24 (@pxref{The recycling rule})
1803but @code{dim(h) <- c(3,4,2)} would signal an error about mismatching
1804length.
1805As an extreme but common example
1806
1807@example
1808> Z <- array(0, c(3,4,2))
1809@end example
1810
1811@noindent
1812makes @code{Z} an array of all zeros.
1813
1814At this point @code{dim(Z)} stands for the dimension vector
1815@code{c(3,4,2)}, and @code{Z[1:24]} stands for the data vector as it was
1816in @code{h}, and @code{Z[]} with an empty subscript or @code{Z} with no
1817subscript stands for the entire array as an array.
1818
1819Arrays may be used in arithmetic expressions and the result is an array
1820formed by element-by-element operations on the data vector.  The
1821@code{dim} attributes of operands generally need to be the same, and
1822this becomes the dimension vector of the result.  So if @code{A},
1823@code{B} and @code{C} are all similar arrays, then
1824
1825@example
1826> D <- 2*A*B + C + 1
1827@end example
1828
1829@noindent
1830makes @code{D} a similar array with its data vector being the result of
1831the given element-by-element operations.  However the precise rule
1832concerning mixed array and vector calculations has to be considered a
1833little more carefully.
1834
1835@menu
1836* The recycling rule::
1837@end menu
1838
1839@node The recycling rule,  , The array() function, The array() function
1840@subsection Mixed vector and array arithmetic.  The recycling rule
1841@cindex Recycling rule
1842
1843The precise rule affecting element by element mixed calculations with
1844vectors and arrays is somewhat quirky and hard to find in the
1845references.  From experience we have found the following to be a reliable
1846guide.
1847
1848@itemize @bullet
1849@item
1850The expression is scanned from left to right.
1851@item
1852Any short vector operands are extended by recycling their values until
1853they match the size of any other operands.
1854@item
1855As long as short vectors and arrays @emph{only} are encountered, the
1856arrays must all have the same @code{dim} attribute or an error results.
1857@item
1858Any vector operand longer than a matrix or array operand generates an error.
1859@item
1860If array structures are present and no error or coercion to vector has
1861been precipitated, the result is an array structure with the common
1862@code{dim} attribute of its array operands.
1863@end itemize
1864
1865@node The outer product of two arrays, Generalized transpose of an array, The array() function, Arrays and matrices
1866@section The outer product of two arrays
1867@cindex Outer products of arrays
1868
1869An important operation on arrays is the @emph{outer product}.  If
1870@code{a} and @code{b} are two numeric arrays, their outer product is an
1871array whose dimension vector is obtained by concatenating their two
1872dimension vectors (order is important), and whose data vector is got by
1873forming all possible products of elements of the data vector of @code{a}
1874with those of @code{b}.  The outer product is formed by the special
1875operator @code{%o%}:
1876@findex %o%
1877
1878@example
1879> ab <- a %o% b
1880@end example
1881
1882An alternative is
1883
1884@example
1885> ab <- outer(a, b, "*")
1886@end example
1887@findex outer
1888
1889The multiplication function can be replaced by an arbitrary function of
1890two variables.  For example if we wished to evaluate the function
1891@eqn{f(x; y) = \cos(y)/(1 + x^2), f(x; y) = cos(y)/(1 + x^2)}
1892over a regular grid of values with @math{x}- and @math{y}-coordinates
1893defined by the @R{} vectors @code{x} and @code{y} respectively, we could
1894proceed as follows:
1895
1896@example
1897> f <- function(x, y) cos(y)/(1 + x^2)
1898> z <- outer(x, y, f)
1899@end example
1900
1901In particular the outer product of two ordinary vectors is a doubly
1902subscripted array (that is a matrix, of rank at most 1).  Notice that
1903the outer product operator is of course non-commutative.  Defining your
1904own @R{} functions will be considered further in @ref{Writing your own
1905functions}.
1906
1907@subsubheading An example: Determinants of 2 by 2 single-digit matrices
1908
1909As an artificial but cute example, consider the determinants of @math{2}
1910by @math{2} matrices @math{[a, b; c, d]} where each entry is a
1911non-negative integer in the range @math{0, 1, @dots{}, 9}, that is a
1912digit.
1913
1914The problem is to find the determinants, @math{ad - bc}, of all possible
1915matrices of this form and represent the frequency with which each value
1916occurs as a @emph{high density} plot.  This amounts to finding the
1917probability distribution of the determinant if each digit is chosen
1918independently and uniformly at random.
1919
1920A neat way of doing this uses the @code{outer()} function twice:
1921
1922@example
1923> d <- outer(0:9, 0:9)
1924> fr <- table(outer(d, d, "-"))
1925> plot(fr, xlab="Determinant", ylab="Frequency")
1926@end example
1927
1928Notice that @code{plot()} here uses a histogram like plot method, because
1929it ``sees'' that @code{fr} is of class @code{"table"}.
1930The ``obvious'' way of doing this problem with @code{for} loops, to be
1931discussed in @ref{Loops and conditional execution}, is so inefficient as
1932to be impractical.
1933
1934It is also perhaps surprising that about 1 in 20 such matrices is
1935singular.
1936
1937@node Generalized transpose of an array, Matrix facilities, The outer product of two arrays, Arrays and matrices
1938@section Generalized transpose of an array
1939@cindex Generalized transpose of an array
1940
1941The function @code{aperm(a, perm)}
1942@findex aperm
1943may be used to permute an array, @code{a}.  The argument @code{perm}
1944must be a permutation of the integers @math{@{1, @dots{}, k@}}, where
1945@math{k} is the number of subscripts in @code{a}.  The result of the
1946function is an array of the same size as @code{a} but with old dimension
1947given by @code{perm[j]} becoming the new @code{j}-th dimension.  The
1948easiest way to think of this operation is as a generalization of
1949transposition for matrices.  Indeed if @code{A} is a matrix, (that is, a
1950doubly subscripted array) then @code{B} given by
1951
1952@example
1953> B <- aperm(A, c(2,1))
1954@end example
1955
1956@noindent
1957is just the transpose of @code{A}.  For this special case a simpler
1958function @code{t()}
1959@findex t
1960is available, so we could have used @code{B <- t(A)}.
1961
1962@node Matrix facilities, Forming partitioned matrices, Generalized transpose of an array, Arrays and matrices
1963@section Matrix facilities
1964
1965@iftex
1966@macro xTx{}
1967@tex
1968$@strong{x}^T @strong{x}$%
1969@end tex
1970@end macro
1971@macro xxT{}
1972@tex
1973$@strong{x}@strong{x}^T$%
1974@end tex
1975@end macro
1976@end iftex
1977
1978@ifnottex
1979@macro xTx{}
1980x'x
1981@end macro
1982@macro xxT{}
1983x x'
1984@end macro
1985@end ifnottex
1986
1987As noted above, a matrix is just an array with two subscripts.  However
1988it is such an important special case it needs a separate discussion.
1989@R{} contains many operators and functions that are available only for
1990matrices.  For example @code{t(X)} is the matrix transpose function, as
1991noted above.  The functions @code{nrow(A)} and @code{ncol(A)} give the
1992number of rows and columns in the matrix @code{A} respectively.
1993@findex nrow
1994@findex ncol
1995
1996@menu
1997* Multiplication::
1998* Linear equations and inversion::
1999* Eigenvalues and eigenvectors::
2000* Singular value decomposition and determinants::
2001* Least squares fitting and the QR decomposition::
2002@end menu
2003
2004@node  Multiplication, Linear equations and inversion, Matrix facilities, Matrix facilities
2005@subsection Matrix multiplication
2006
2007@cindex Matrix multiplication
2008The operator @code{%*%} is used for matrix multiplication.
2009@findex %*%
2010An @math{n} by @math{1} or @math{1} by @math{n} matrix may of course be
2011used as an @math{n}-vector if in the context such is appropriate.
2012Conversely, vectors which occur in matrix multiplication expressions are
2013automatically promoted either to row or column vectors, whichever is
2014multiplicatively coherent, if possible, (although this is not always
2015unambiguously possible, as we see later).
2016
2017If, for example, @code{A} and @code{B} are square matrices of the same
2018size, then
2019
2020@example
2021> A * B
2022@end example
2023
2024@noindent
2025is the matrix of element by element products and
2026
2027@example
2028> A %*% B
2029@end example
2030
2031@noindent
2032is the matrix product.  If @code{x} is a vector, then
2033
2034@example
2035> x %*% A %*% x
2036@end example
2037
2038@noindent
2039is a quadratic form.@footnote{Note that @code{x %*% x} is ambiguous, as
2040it could mean either @xTx{} or @xxT{}, where @eqn{@strong{x},x} is the
2041column form.  In such cases the smaller matrix seems implicitly to be
2042the interpretation adopted, so the scalar @xTx{} is in this case the
2043result.  The matrix @xxT{} may be calculated either by @code{cbind(x)
2044%*% x} or @code{x %*% rbind(x)} since the result of @code{rbind()} or
2045@code{cbind()} is always a matrix.  However, the best way to compute
2046@xTx{} or @xxT{} is @code{crossprod(x)} or @code{x %o% x} respectively.}
2047
2048@findex crossprod
2049The function @code{crossprod()} forms ``crossproducts'', meaning that
2050@code{crossprod(X, y)} is the same as @code{t(X) %*% y} but the
2051operation is more efficient.  If the second argument to
2052@code{crossprod()} is omitted it is taken to be the same as the first.
2053
2054@findex diag
2055The meaning of @code{diag()} depends on its argument.  @code{diag(v)},
2056where @code{v} is a vector, gives a diagonal matrix with elements of the
2057vector as the diagonal entries.  On the other hand @code{diag(M)}, where
2058@code{M} is a matrix, gives the vector of main diagonal entries of
2059@code{M}.  This is the same convention as that used for @code{diag()} in
2060@sc{Matlab}.  Also, somewhat confusingly, if @code{k} is a single
2061numeric value then @code{diag(k)} is the @code{k} by @code{k} identity
2062matrix!
2063
2064@node Linear equations and inversion, Eigenvalues and eigenvectors, Multiplication, Matrix facilities
2065@subsection Linear equations and inversion
2066
2067@cindex Linear equations
2068@findex solve
2069Solving linear equations is the inverse of matrix multiplication.
2070When after
2071
2072@example
2073> b <- A %*% x
2074@end example
2075
2076@noindent
2077only @code{A} and @code{b} are given, the vector @code{x} is the
2078solution of that linear equation system.  In @R{},
2079
2080@example
2081> solve(A,b)
2082@end example
2083
2084@noindent
2085solves the system, returning @code{x} (up to some accuracy loss).
2086Note that in linear algebra, formally
2087@eqn{@strong{x} = @strong{A}^{-1} @strong{b}, @code{x = A^@{-1@} %*% b}}
2088where
2089@eqn{@strong{A}^{-1}, @code{A^@{-1@}}} denotes the @emph{inverse} of
2090@eqn{@strong{A},@code{A}}, which can be computed by
2091
2092@example
2093solve(A)
2094@end example
2095
2096@noindent
2097but rarely is needed.  Numerically, it is both inefficient and
2098potentially unstable to compute @code{x <- solve(A) %*% b} instead of
2099@code{solve(A,b)}.
2100
2101The quadratic form @eqn{@strong{x^T A^{-1} x},@ @code{x %*% A^@{-1@} %*%
2102x} @ } which is used in multivariate computations, should be computed by
2103something like@footnote{Even better would be to form a matrix square
2104root @eqn{B, B} with @eqn{A = BB^T, A = BB'} and find the squared length
2105of the solution of @eqn{By = x, By = x} , perhaps using the Cholesky or
2106eigen decomposition of @eqn{A, A}.  } @code{x %*% solve(A,x)}, rather
2107than computing the inverse of @code{A}.
2108
2109@node Eigenvalues and eigenvectors, Singular value decomposition and determinants, Linear equations and inversion, Matrix facilities
2110@subsection Eigenvalues and eigenvectors
2111@cindex Eigenvalues and eigenvectors
2112
2113@findex eigen
2114The function @code{eigen(Sm)} calculates the eigenvalues and
2115eigenvectors of a symmetric matrix @code{Sm}.  The result of this
2116function is a list of two components named @code{values} and
2117@code{vectors}.  The assignment
2118
2119@example
2120> ev <- eigen(Sm)
2121@end example
2122
2123@noindent
2124will assign this list to @code{ev}.  Then @code{ev$val} is the vector of
2125eigenvalues of @code{Sm} and @code{ev$vec} is the matrix of
2126corresponding eigenvectors.  Had we only needed the eigenvalues we could
2127have used the assignment:
2128
2129@example
2130> evals <- eigen(Sm)$values
2131@end example
2132
2133@noindent
2134@code{evals} now holds the vector of eigenvalues and the second
2135component is discarded.  If the expression
2136
2137@example
2138> eigen(Sm)
2139@end example
2140
2141@noindent
2142is used by itself as a command the two components are printed, with
2143their names.  For large matrices it is better to avoid computing the
2144eigenvectors if they are not needed by using the expression
2145
2146@example
2147> evals <- eigen(Sm, only.values = TRUE)$values
2148@end example
2149
2150
2151@node Singular value decomposition and determinants, Least squares fitting and the QR decomposition, Eigenvalues and eigenvectors, Matrix facilities
2152@subsection Singular value decomposition and determinants
2153@cindex Singular value decomposition
2154
2155@findex svd
2156The function @code{svd(M)} takes an arbitrary matrix argument, @code{M},
2157and calculates the singular value decomposition of @code{M}.  This
2158consists of a matrix of orthonormal columns @code{U} with the same
2159column space as @code{M}, a second matrix of orthonormal columns
2160@code{V} whose column space is the row space of @code{M} and a diagonal
2161matrix of positive entries @code{D} such that @code{M = U %*% D %*%
2162t(V)}.  @code{D} is actually returned as a vector of the diagonal
2163elements.  The result of @code{svd(M)} is actually a list of three
2164components named @code{d}, @code{u} and @code{v}, with evident meanings.
2165
2166If @code{M} is in fact square, then, it is not hard to see that
2167
2168@example
2169> absdetM <- prod(svd(M)$d)
2170@end example
2171
2172@noindent
2173calculates the absolute value of the determinant of @code{M}.  If this
2174calculation were needed often with a variety of matrices it could be
2175defined as an @R{} function
2176
2177@example
2178> absdet <- function(M) prod(svd(M)$d)
2179@end example
2180
2181@cindex Determinants
2182@noindent
2183after which we could use @code{absdet()} as just another @R{} function.
2184As a further trivial but potentially useful example, you might like to
2185consider writing a function, say @code{tr()}, to calculate the trace of
2186a square matrix.  [Hint: You will not need to use an explicit loop.
2187Look again at the @code{diag()} function.]
2188
2189@findex det
2190@findex determinant
2191@R{} has a builtin function @code{det} to calculate a determinant,
2192including the sign, and another, @code{determinant}, to give the sign
2193and modulus (optionally on log scale),
2194
2195@c Functions will be discussed formally later in these notes.
2196
2197@node Least squares fitting and the QR decomposition,  , Singular value decomposition and determinants, Matrix facilities
2198@subsection Least squares fitting and the QR decomposition
2199@cindex Least squares fitting
2200@cindex QR decomposition
2201
2202The function @code{lsfit()} returns a list giving results of a least
2203squares fitting procedure.  An assignment such as
2204
2205@example
2206> ans <- lsfit(X, y)
2207@end example
2208@findex lsfit
2209
2210@noindent
2211gives the results of a least squares fit where @code{y} is the vector of
2212observations and @code{X} is the design matrix.  See the help facility
2213for more details, and also for the follow-up function @code{ls.diag()}
2214for, among other things, regression diagnostics.  Note that a grand mean
2215term is automatically included and need not be included explicitly as a
2216column of @code{X}.  Further note that you almost always will prefer
2217using @code{lm(.)} (@pxref{Linear models}) to @code{lsfit()} for
2218regression modelling.
2219
2220@findex qr
2221Another closely related function is @code{qr()} and its allies.
2222Consider the following assignments
2223
2224@example
2225> Xplus <- qr(X)
2226> b <- qr.coef(Xplus, y)
2227> fit <- qr.fitted(Xplus, y)
2228> res <- qr.resid(Xplus, y)
2229@end example
2230
2231@noindent
2232These compute the orthogonal projection of @code{y} onto the range of
2233@code{X} in @code{fit}, the projection onto the orthogonal complement in
2234@code{res} and the coefficient vector for the projection in @code{b},
2235that is, @code{b} is essentially the result of the @sc{Matlab}
2236`backslash' operator.
2237
2238It is not assumed that @code{X} has full column rank.  Redundancies will
2239be discovered and removed as they are found.
2240
2241This alternative is the older, low-level way to perform least squares
2242calculations.  Although still useful in some contexts, it would now
2243generally be replaced by the statistical models features, as will be
2244discussed in @ref{Statistical models in R}.
2245
2246
2247@node Forming partitioned matrices, The concatenation function c() with arrays, Matrix facilities, Arrays and matrices
2248@section Forming partitioned matrices, @code{cbind()} and @code{rbind()}
2249@findex cbind
2250@findex rbind
2251
2252As we have already seen informally, matrices can be built up from other
2253vectors and matrices by the functions @code{cbind()} and @code{rbind()}.
2254Roughly @code{cbind()} forms matrices by binding together matrices
2255horizontally, or column-wise, and @code{rbind()} vertically, or
2256row-wise.
2257
2258In the assignment
2259
2260@example
2261> X <- cbind(@var{arg_1}, @var{arg_2}, @var{arg_3}, @dots{})
2262@end example
2263
2264@noindent
2265the arguments to @code{cbind()} must be either vectors of any length, or
2266matrices with the same column size, that is the same number of rows.
2267The result is a matrix with the concatenated arguments @var{arg_1},
2268@var{arg_2}, @dots{} forming the columns.
2269
2270If some of the arguments to @code{cbind()} are vectors they may be
2271shorter than the column size of any matrices present, in which case they
2272are cyclically extended to match the matrix column size (or the length
2273of the longest vector if no matrices are given).
2274
2275The function @code{rbind()} does the corresponding operation for rows.
2276In this case any vector argument, possibly cyclically extended, are of
2277course taken as row vectors.
2278
2279Suppose @code{X1} and @code{X2} have the same number of rows.  To
2280combine these by columns into a matrix @code{X}, together with an
2281initial column of @code{1}s we can use
2282
2283@example
2284> X <- cbind(1, X1, X2)
2285@end example
2286
2287The result of @code{rbind()} or @code{cbind()} always has matrix status.
2288Hence @code{cbind(x)} and @code{rbind(x)} are possibly the simplest ways
2289explicitly to allow the vector @code{x} to be treated as a column or row
2290matrix respectively.
2291
2292@node The concatenation function c() with arrays, Frequency tables from factors, Forming partitioned matrices, Arrays and matrices
2293@section The concatenation function, @code{c()}, with arrays
2294
2295It should be noted that whereas @code{cbind()} and @code{rbind()} are
2296concatenation functions that respect @code{dim} attributes, the basic
2297@code{c()} function does not, but rather clears numeric objects of all
2298@code{dim} and @code{dimnames} attributes.  This is occasionally useful
2299in its own right.
2300
2301The official way to coerce an array back to a simple vector object is to
2302use @code{as.vector()}
2303
2304@example
2305> vec <- as.vector(X)
2306@end example
2307@findex as.vector
2308
2309However a similar result can be achieved by using @code{c()} with just
2310one argument, simply for this side-effect:
2311
2312@example
2313> vec <- c(X)
2314@end example
2315@findex c
2316
2317There are slight differences between the two, but ultimately the choice
2318between them is largely a matter of style (with the former being
2319preferable).
2320
2321@node Frequency tables from factors,  , The concatenation function c() with arrays, Arrays and matrices
2322@section Frequency tables from factors
2323@cindex Tabulation
2324
2325Recall that a factor defines a partition into groups.  Similarly a pair
2326of factors defines a two way cross classification, and so on.
2327@findex table
2328The function @code{table()} allows frequency tables to be calculated
2329from equal length factors.  If there are @math{k} factor arguments,
2330the result is a @math{k}-way array of frequencies.
2331
2332Suppose, for example, that @code{statef} is a factor giving the state
2333code for each entry in a data vector.  The assignment
2334
2335@example
2336> statefr <- table(statef)
2337@end example
2338
2339@noindent
2340gives in @code{statefr} a table of frequencies of each state in the
2341sample.  The frequencies are ordered and labelled by the @code{levels}
2342attribute of the factor.  This simple case is equivalent to, but more
2343convenient than,
2344
2345@example
2346> statefr <- tapply(statef, statef, length)
2347@end example
2348
2349Further suppose that @code{incomef} is a factor giving a suitably
2350defined ``income class'' for each entry in the data vector, for example
2351with the @code{cut()} function:
2352
2353@example
2354> factor(cut(incomes, breaks = 35+10*(0:7))) -> incomef
2355@end example
2356@findex cut
2357
2358Then to calculate a two-way table of frequencies:
2359
2360@example
2361> table(incomef,statef)
2362         statef
2363incomef   act nsw nt qld sa tas vic wa
2364  (35,45]   1   1  0   1  0   0   1  0
2365  (45,55]   1   1  1   1  2   0   1  3
2366  (55,65]   0   3  1   3  2   2   2  1
2367  (65,75]   0   1  0   0  0   0   1  0
2368@end example
2369
2370Extension to higher-way frequency tables is immediate.
2371
2372@node Lists and data frames, Reading data from files, Arrays and matrices, Top
2373@chapter Lists and data frames
2374
2375@menu
2376* Lists::
2377* Constructing and modifying lists::
2378* Data frames::
2379@end menu
2380
2381@node Lists, Constructing and modifying lists, Lists and data frames, Lists and data frames
2382@section Lists
2383@cindex Lists
2384
2385An @R{} @emph{list} is an object consisting of an ordered collection of
2386objects known as its @emph{components}.
2387
2388There is no particular need for the components to be of the same mode or
2389type, and, for example, a list could consist of a numeric vector, a
2390logical value, a matrix, a complex vector, a character array, a
2391function, and so on.  Here is a simple example of how to make a list:
2392
2393@example
2394> Lst <- list(name="Fred", wife="Mary", no.children=3,
2395              child.ages=c(4,7,9))
2396@end example
2397@findex list
2398
2399Components are always @emph{numbered} and may always be referred to as
2400such.  Thus if @code{Lst} is the name of a list with four components,
2401these may be individually referred to as @code{Lst[[1]]},
2402@code{Lst[[2]]}, @code{Lst[[3]]} and @code{Lst[[4]]}.  If, further,
2403@code{Lst[[4]]} is a vector subscripted array then @code{Lst[[4]][1]} is
2404its first entry.
2405
2406If @code{Lst} is a list, then the function @code{length(Lst)} gives the
2407number of (top level) components it has.
2408
2409Components of lists may also be @emph{named}, and in this case the
2410component may be referred to either by giving the component name as a
2411character string in place of the number in double square brackets, or,
2412more conveniently, by giving an expression of the form
2413
2414@example
2415> @var{name}$@var{component_name}
2416@end example
2417
2418@noindent
2419for the same thing.
2420
2421This is a very useful convention as it makes it easier to get the right
2422component if you forget the number.
2423
2424So in the simple example given above:
2425
2426@code{Lst$name} is the same as @code{Lst[[1]]} and is the string
2427@code{"Fred"},
2428
2429@code{Lst$wife} is the same as @code{Lst[[2]]} and is the string
2430@code{"Mary"},
2431
2432@code{Lst$child.ages[1]} is the same as @code{Lst[[4]][1]} and is the
2433number @code{4}.
2434
2435Additionally, one can also use the names of the list components in
2436double square brackets, i.e., @code{Lst[["name"]]} is the same as
2437@code{Lst$name}.  This is especially useful, when the name of the
2438component to be extracted is stored in another variable as in
2439
2440@example
2441> x <- "name"; Lst[[x]]
2442@end example
2443
2444It is very important to distinguish @code{Lst[[1]]} from @code{Lst[1]}.
2445@samp{@code{[[@var{@dots{}}]]}} is the operator used to select a single
2446element, whereas @samp{@code{[@var{@dots{}}]}} is a general subscripting
2447operator.  Thus the former is the @emph{first object in the list}
2448@code{Lst}, and if it is a named list the name is @emph{not} included.
2449The latter is a @emph{sublist of the list @code{Lst} consisting of the
2450first entry only.  If it is a named list, the names are transferred to
2451the sublist.}
2452
2453The names of components may be abbreviated down to the minimum number of
2454letters needed to identify them uniquely.  Thus @code{Lst$coefficients}
2455may be minimally specified as @code{Lst$coe} and @code{Lst$covariance}
2456as @code{Lst$cov}.
2457
2458The vector of names is in fact simply an attribute of the list like any
2459other and may be handled as such.  Other structures besides lists may,
2460of course, similarly be given a @emph{names} attribute also.
2461
2462@node Constructing and modifying lists, Data frames, Lists, Lists and data frames
2463@section Constructing and modifying lists
2464
2465New lists may be formed from existing objects by the function
2466@code{list()}.  An assignment of the form
2467
2468@example
2469> Lst <- list(@var{name_1}=@var{object_1}, @var{@dots{}}, @var{name_m}=@var{object_m})
2470@end example
2471
2472@noindent
2473sets up a list @code{Lst} of @math{m} components using @var{object_1},
2474@dots{}, @var{object_m} for the components and giving them names as
2475specified by the argument names, (which can be freely chosen).  If these
2476names are omitted, the components are numbered only.  The components
2477used to form the list are @emph{copied} when forming the new list and
2478the originals are not affected.
2479
2480Lists, like any subscripted object, can be extended by specifying
2481additional components.  For example
2482
2483@example
2484> Lst[5] <- list(matrix=Mat)
2485@end example
2486
2487@menu
2488* Concatenating lists::
2489@end menu
2490
2491@node Concatenating lists,  , Constructing and modifying lists, Constructing and modifying lists
2492@subsection Concatenating lists
2493@cindex Concatenating lists
2494
2495@findex c
2496When the concatenation function @code{c()} is given list arguments, the
2497result is an object of mode list also, whose components are those of the
2498argument lists joined together in sequence.
2499
2500@example
2501> list.ABC <- c(list.A, list.B, list.C)
2502@end example
2503
2504Recall that with vector objects as arguments the concatenation function
2505similarly joined together all arguments into a single vector structure.
2506In this case all other attributes, such as @code{dim} attributes, are
2507discarded.
2508
2509
2510@node Data frames,  , Constructing and modifying lists, Lists and data frames
2511@section Data frames
2512@cindex Data frames
2513
2514A @emph{data frame} is a list with class @code{"data.frame"}.  There are
2515restrictions on lists that may be made into data frames, namely
2516
2517@itemize @bullet
2518@item
2519The components must be vectors (numeric, character, or logical),
2520factors, numeric matrices, lists, or other data frames.
2521@item
2522Matrices, lists, and data frames provide as many variables to the new
2523data frame as they have columns, elements, or variables, respectively.
2524@item
2525Numeric vectors, logicals and factors are included as is, and by
2526default@footnote{Conversion of character columns to factors is
2527overridden using the @code{stringsAsFactors} argument to the
2528@code{data.frame()} function.} character vectors are coerced to be
2529factors, whose levels are the unique values appearing in the vector.
2530@item
2531Vector structures appearing as variables of the data frame must all have
2532the @emph{same length}, and matrix structures must all have the same
2533@emph{row size}.
2534@end itemize
2535
2536A data frame may for many purposes be regarded as a matrix with columns
2537possibly of differing modes and attributes.  It may be displayed in
2538matrix form, and its rows and columns extracted using matrix indexing
2539conventions.
2540
2541@menu
2542* Making data frames::
2543* attach() and detach()::
2544* Working with data frames::
2545* Attaching arbitrary lists::
2546* Managing the search path::
2547@end menu
2548
2549@node Making data frames, attach() and detach(), Data frames, Data frames
2550@subsection Making data frames
2551
2552Objects satisfying the restrictions placed on the columns (components)
2553of a data frame may be used to form one using the function
2554@code{data.frame}:
2555@findex data.frame
2556
2557@example
2558> accountants <- data.frame(home=statef, loot=incomes, shot=incomef)
2559@end example
2560
2561A list whose components conform to the restrictions of a data frame may
2562be @emph{coerced} into a data frame using the function
2563@code{as.data.frame()}
2564@findex as.data.frame
2565
2566The simplest way to construct a data frame from scratch is to use the
2567@code{read.table()} function to read an entire data frame from an
2568external file.  This is discussed further in @ref{Reading data from
2569files}.
2570
2571@node attach() and detach(), Working with data frames, Making data frames, Data frames
2572@subsection @code{attach()} and @code{detach()}
2573@findex attach
2574@findex detach
2575
2576The @code{$} notation, such as @code{accountants$home}, for list
2577components is not always very convenient.  A useful facility would be
2578somehow to make the components of a list or data frame temporarily
2579visible as variables under their component name, without the need to
2580quote the list name explicitly each time.
2581
2582The @code{attach()} function takes a `database' such as a list or data
2583frame as its argument.  Thus suppose @code{lentils} is a
2584data frame with three variables @code{lentils$u}, @code{lentils$v},
2585@code{lentils$w}.  The attach
2586
2587@example
2588> attach(lentils)
2589@end example
2590
2591@noindent
2592places the data frame in the search path at @w{position 2}, and provided
2593there are no variables @code{u}, @code{v} or @code{w} in @w{position 1},
2594@code{u}, @code{v} and @code{w} are available as variables from the data
2595frame in their own right.  At this point an assignment such as
2596
2597@example
2598> u <- v+w
2599@end example
2600
2601@noindent
2602does not replace the component @code{u} of the data frame, but rather
2603masks it with another variable @code{u} in the working directory at
2604@w{position 1} on the search path.  To make a permanent change to the
2605data frame itself, the simplest way is to resort once again to the
2606@code{$} notation:
2607
2608@example
2609> lentils$u <- v+w
2610@end example
2611
2612However the new value of component @code{u} is not visible until the
2613data frame is detached and attached again.
2614
2615To detach a data frame, use the function
2616
2617@example
2618> detach()
2619@end example
2620
2621More precisely, this statement detaches from the search path the entity
2622currently at @w{position 2}.  Thus in the present context the variables
2623@code{u}, @code{v} and @code{w} would be no longer visible, except under
2624the list notation as @code{lentils$u} and so on.  Entities at positions
2625greater than 2 on the search path can be detached by giving their number
2626to @code{detach}, but it is much safer to always use a name, for example
2627by @code{detach(lentils)} or @code{detach("lentils")}
2628
2629@quotation Note
2630In @R{} lists and data frames can only be attached at position 2 or
2631above, and what is attached is a @emph{copy} of the original object.
2632You can alter the attached values @emph{via} @code{assign}, but the
2633original list or data frame is unchanged.
2634@end quotation
2635
2636@node Working with data frames, Attaching arbitrary lists, attach() and detach(), Data frames
2637@subsection Working with data frames
2638
2639A useful convention that allows you to work with many different problems
2640comfortably together in the same working directory is
2641
2642@itemize @bullet
2643@item
2644gather together all variables for any well defined and separate problem
2645in a data frame under a suitably informative name;
2646@item
2647when working with a problem attach the appropriate data frame at
2648@w{position 2}, and use the working directory at @w{level 1} for
2649operational quantities and temporary variables;
2650@item
2651before leaving a problem, add any variables you wish to keep for future
2652reference to the data frame using the @code{$} form of assignment, and
2653then @code{detach()};
2654@item
2655finally remove all unwanted variables from the working directory and
2656keep it as clean of left-over temporary variables as possible.
2657@end itemize
2658
2659In this way it is quite simple to work with many problems in the same
2660directory, all of which have variables named @code{x}, @code{y} and
2661@code{z}, for example.
2662
2663@node Attaching arbitrary lists, Managing the search path, Working with data frames, Data frames
2664@subsection Attaching arbitrary lists
2665
2666@code{attach()} is a generic function that allows not only directories
2667and data frames to be attached to the search path, but other classes of
2668object as well.  In particular any object of mode @code{"list"} may be
2669attached in the same way:
2670
2671@example
2672> attach(any.old.list)
2673@end example
2674
2675Anything that has been attached can be detached by @code{detach}, by
2676position number or, preferably, by name.
2677
2678@node Managing the search path,  , Attaching arbitrary lists, Data frames
2679@subsection Managing the search path
2680@findex search
2681@cindex Search path
2682
2683The function @code{search} shows the current search path and so is
2684a very useful way to keep track of which data frames and lists (and
2685packages) have been attached and detached.  Initially it gives
2686
2687@example
2688> search()
2689[1] ".GlobalEnv"   "Autoloads"    "package:base"
2690@end example
2691@noindent
2692where @code{.GlobalEnv} is the workspace.@footnote{See the on-line help
2693for @code{autoload} for the meaning of the second term.}
2694
2695After @code{lentils} is attached we have
2696
2697@example
2698> search()
2699[1] ".GlobalEnv"   "lentils"      "Autoloads"    "package:base"
2700> ls(2)
2701[1] "u" "v" "w"
2702@end example
2703
2704@noindent
2705and as we see @code{ls} (or @code{objects}) can be used to examine the
2706contents of any position on the search path.
2707
2708Finally, we detach the data frame and confirm it has been removed from
2709the search path.
2710
2711@example
2712> detach("lentils")
2713> search()
2714[1] ".GlobalEnv"   "Autoloads"    "package:base"
2715@end example
2716
2717@node Reading data from files, Probability distributions, Lists and data frames, Top
2718@chapter Reading data from files
2719@cindex Reading data from files
2720
2721Large data objects will usually be read as values from external files
2722rather than entered during an @R{} session at the keyboard.  @R{} input
2723facilities are simple and their requirements are fairly strict and even
2724rather inflexible.  There is a clear presumption by the designers of
2725@R{} that you will be able to modify your input files using other tools,
2726such as file editors or Perl@footnote{Under UNIX, the utilities
2727@command{sed} or@command{awk} can be used.} to fit in with the
2728requirements of @R{}.  Generally this is very simple.
2729
2730If variables are to be held mainly in data frames, as we strongly
2731suggest they should be, an entire data frame can be read directly with
2732the @code{read.table()} function.  There is also a more primitive input
2733function, @code{scan()}, that can be called directly.
2734
2735For more details on importing data into @R{} and also exporting data,
2736see the @emph{R Data Import/Export} manual.
2737
2738@menu
2739* The read.table() function::
2740* The scan() function::
2741* Accessing builtin datasets::
2742* Editing data::
2743@end menu
2744
2745@node The read.table() function, The scan() function, Reading data from files, Reading data from files
2746@section The @code{read.table()} function
2747@findex read.table
2748
2749To read an entire data frame directly, the external file will normally
2750have a special form.
2751
2752@itemize @bullet
2753@item
2754The first line of the file should have a @emph{name} for each variable
2755in the data frame.
2756
2757@item
2758Each additional line of the file has as its first item a @emph{row label}
2759and the values for each variable.
2760@end itemize
2761
2762If the file has one fewer item in its first line than in its second, this
2763arrangement is presumed to be in force.  So the first few lines of a file
2764to be read as a data frame might look as follows.
2765
2766@quotation
2767@cartouche
2768@example
2769@r{Input file form with names and row labels:}
2770
2771     Price    Floor     Area   Rooms     Age  Cent.heat
277201   52.00    111.0      830     5       6.2      no
277302   54.75    128.0      710     5       7.5      no
277403   57.50    101.0     1000     5       4.2      no
277504   57.50    131.0      690     6       8.8      no
277605   59.75     93.0      900     5       1.9     yes
2777...
2778@end example
2779@end cartouche
2780@end quotation
2781
2782By default numeric items (except row labels) are read as numeric
2783variables and non-numeric variables, such as @code{Cent.heat} in the
2784example, as factors.  This can be changed if necessary.
2785
2786The function @code{read.table()} can then be used to read the data frame
2787directly
2788
2789@example
2790> HousePrice <- read.table("houses.data")
2791@end example
2792
2793Often you will want to omit including the row labels directly and use the
2794default labels.  In this case the file may omit the row label column as in
2795the following.
2796
2797@quotation
2798@cartouche
2799@example
2800@r{Input file form without row labels:}
2801
2802Price    Floor     Area   Rooms     Age  Cent.heat
280352.00    111.0      830     5       6.2      no
280454.75    128.0      710     5       7.5      no
280557.50    101.0     1000     5       4.2      no
280657.50    131.0      690     6       8.8      no
280759.75     93.0      900     5       1.9     yes
2808...
2809@end example
2810@end cartouche
2811@end quotation
2812
2813The data frame may then be read as
2814
2815@example
2816> HousePrice <- read.table("houses.data", header=TRUE)
2817@end example
2818
2819@noindent
2820where the @code{header=TRUE} option specifies that the first line is a
2821line of headings, and hence, by implication from the form of the file,
2822that no explicit row labels are given.
2823
2824@menu
2825* The scan() function::
2826@end menu
2827
2828@node The scan() function, Accessing builtin datasets, The read.table() function, Reading data from files
2829@section The @code{scan()} function
2830@findex scan
2831
2832Suppose the data vectors are of equal length and are to be read in
2833parallel.  Further suppose that there are three vectors, the first of
2834mode character and the remaining two of mode numeric, and the file is
2835@file{input.dat}.  The first step is to use @code{scan()} to read in the
2836three vectors as a list, as follows
2837
2838@example
2839> inp <- scan("input.dat", list("",0,0))
2840@end example
2841
2842The second argument is a dummy list structure that establishes the mode
2843of the three vectors to be read.  The result, held in @code{inp}, is a
2844list whose components are the three vectors read in.  To separate the
2845data items into three separate vectors, use assignments like
2846
2847@example
2848> label <- inp[[1]]; x <- inp[[2]]; y <- inp[[3]]
2849@end example
2850
2851More conveniently, the dummy list can have named components, in which
2852case the names can be used to access the vectors read in.  For example
2853
2854@example
2855> inp <- scan("input.dat", list(id="", x=0, y=0))
2856@end example
2857
2858If you wish to access the variables separately they may either be
2859re-assigned to variables in the working frame:
2860
2861@example
2862> label <- inp$id; x <- inp$x; y <- inp$y
2863@end example
2864
2865@noindent
2866or the list may be attached at @w{position 2} of the search path
2867(@pxref{Attaching arbitrary lists}).
2868
2869If the second argument is a single value and not a list, a single vector
2870is read in, all components of which must be of the same mode as the
2871dummy value.
2872
2873@example
2874> X <- matrix(scan("light.dat", 0), ncol=5, byrow=TRUE)
2875@end example
2876
2877There are more elaborate input facilities available and these are
2878detailed in the manuals.
2879
2880@node Accessing builtin datasets, Editing data, The scan() function, Reading data from files
2881@section Accessing builtin datasets
2882@cindex Accessing builtin datasets
2883@findex data
2884
2885Around 100 datasets are supplied with @R{} (in package @pkg{datasets}),
2886and others are available in packages (including the recommended packages
2887supplied with @R{}).  To see the list of datasets currently available
2888use
2889
2890@example
2891data()
2892@end example
2893
2894@noindent
2895All the datasets supplied with @R{} are available directly by name.
2896However, many packages still use the obsolete convention in which
2897@code{data} was also used to load datasets into @R{}, for example
2898
2899@example
2900data(infert)
2901@end example
2902
2903@noindent
2904and this can still be used with the standard packages (as in this
2905example).  In most cases this will load an @R{} object of the same name.
2906However, in a few cases it loads several objects, so see the on-line
2907help for the object to see what to expect.
2908
2909@subsection Loading data from other R packages
2910
2911To access data from a particular package, use the @code{package}
2912argument, for example
2913
2914@example
2915data(package="rpart")
2916data(Puromycin, package="datasets")
2917@end example
2918
2919If a package has been attached by @code{library}, its datasets are
2920automatically included in the search.
2921
2922User-contributed packages can be a rich source of datasets.
2923
2924@node Editing data,  , Accessing builtin datasets, Reading data from files
2925@section Editing data
2926
2927@findex edit
2928When invoked on a data frame or matrix, @code{edit} brings up a separate
2929spreadsheet-like environment for editing.  This is useful for making
2930small changes once a data set has been read.  The command
2931
2932@example
2933> xnew <- edit(xold)
2934@end example
2935
2936@noindent
2937will allow you to edit your data set @code{xold}, and on completion the
2938changed object is assigned to @code{xnew}.  If you want to alter the
2939original dataset @code{xold}, the simplest way is to use
2940@code{fix(xold)}, which is equivalent to @code{xold <- edit(xold)}.
2941
2942Use
2943
2944@example
2945> xnew <- edit(data.frame())
2946@end example
2947
2948@noindent
2949to enter new data via the spreadsheet interface.
2950
2951
2952@node Probability distributions, Loops and conditional execution, Reading data from files, Top
2953@chapter Probability distributions
2954@cindex Probability distributions
2955
2956@menu
2957* R as a set of statistical tables::
2958* Examining the distribution of a set of data::
2959* One- and two-sample tests::
2960@end menu
2961
2962@node R as a set of statistical tables, Examining the distribution of a set of data, Probability distributions, Probability distributions
2963@section R as a set of statistical tables
2964
2965One convenient use of @R{} is to provide a comprehensive set of
2966statistical tables.  Functions are provided to evaluate the cumulative
2967distribution function @eqn{P(X \le x), P(X <= x)},
2968the probability density function and the quantile function (given
2969@math{q}, the smallest @math{x} such that @eqn{P(X \le x) > q, P(X <= x) > q}),
2970and to simulate from the distribution.
2971
2972@quotation
2973@multitable{Distribution namessss}{names, names}{arguments, arguments}
2974@headitem Distribution @tab @R{} name @tab additional arguments
2975@item beta @tab @code{beta} @tab @code{shape1, shape2, ncp}
2976@item binomial @tab  @code{binom} @tab @code{size, prob}
2977@item Cauchy @tab @code{cauchy} @tab @code{location, scale}
2978@item chi-squared @tab @code{chisq} @tab @code{df, ncp}
2979@item exponential @tab @code{exp} @tab @code{rate}
2980@item F @tab @code{f} @tab @code{df1, df2, ncp}
2981@item gamma @tab @code{gamma} @tab @code{shape, scale}
2982@item geometric @tab @code{geom} @tab @code{prob}
2983@item hypergeometric @tab @code{hyper} @tab @code{m, n, k}
2984@item log-normal @tab @code{lnorm} @tab @code{meanlog, sdlog}
2985@item logistic @tab @code{logis} @tab @code{location, scale}
2986@item negative binomial @tab @code{nbinom} @tab @code{size, prob}
2987@item normal @tab @code{norm} @tab @code{mean, sd}
2988@item Poisson @tab @code{pois} @tab @code{lambda}
2989@item signed rank @tab @code{signrank} @tab @code{n}
2990@item Student's t @tab @code{t} @tab @code{df, ncp}
2991@item uniform @tab @code{unif} @tab @code{min, max}
2992@item Weibull @tab @code{weibull} @tab @code{shape, scale}
2993@item Wilcoxon @tab @code{wilcox} @tab @code{m, n}
2994@end multitable
2995@end quotation
2996
2997@noindent
2998Prefix the name given here by @samp{d} for the density, @samp{p} for the
2999CDF, @samp{q} for the quantile function and @samp{r} for simulation
3000(@emph{r}andom deviates).  The first argument is @code{x} for
3001@code{d@var{xxx}}, @code{q} for @code{p@var{xxx}}, @code{p} for
3002@code{q@var{xxx}} and @code{n} for @code{r@var{xxx}} (except for
3003@code{rhyper}, @code{rsignrank} and @code{rwilcox}, for which it is
3004@code{nn}).  In not quite all cases is the non-centrality parameter
3005@code{ncp} currently available: see the on-line help for details.
3006
3007The @code{p@var{xxx}} and @code{q@var{xxx}} functions all have logical
3008arguments @code{lower.tail} and @code{log.p} and the @code{d@var{xxx}}
3009ones have @code{log}.  This allows, e.g., getting the cumulative (or
3010``integrated'') @emph{hazard} function, @eqn{H(t) = - \log(1 - F(t)),
3011H(t) = - log(1 - F(t))}, by
3012
3013@example
3014 - p@var{xxx}(t, ..., lower.tail = FALSE, log.p = TRUE)
3015@end example
3016
3017@noindent
3018or more accurate log-likelihoods (by @code{d@var{xxx}(..., log =
3019TRUE)}), directly.
3020
3021In addition there are functions @code{ptukey} and @code{qtukey} for the
3022distribution of the studentized range of samples from a normal
3023distribution, and @code{dmultinom} and @code{rmultinom} for the
3024multinomial distribution. Further distributions are available in
3025contributed packages, notably @CRANpkg{SuppDists}.
3026
3027Here are some examples
3028
3029@example
3030> ## @r{2-tailed p-value for t distribution}
3031> 2*pt(-2.43, df = 13)
3032> ## @r{upper 1% point for an F(2, 7) distribution}
3033> qf(0.01, 2, 7, lower.tail = FALSE)
3034@end example
3035
3036See the on-line help on @code{RNG} for how random-number generation is
3037done in @R{}.
3038
3039@node  Examining the distribution of a set of data, One- and two-sample tests, R as a set of statistical tables, Probability distributions
3040@section Examining the distribution of a set of data
3041
3042Given a (univariate) set of data we can examine its distribution in a
3043large number of ways.  The simplest is to examine the numbers.  Two
3044slightly different summaries are given by @code{summary} and
3045@code{fivenum}
3046@findex summary
3047@findex fivenum
3048and a display of the numbers by @code{stem} (a ``stem and leaf'' plot).
3049@findex stem
3050
3051@example
3052> attach(faithful)
3053> summary(eruptions)
3054   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
3055  1.600   2.163   4.000   3.488   4.454   5.100
3056> fivenum(eruptions)
3057[1] 1.6000 2.1585 4.0000 4.4585 5.1000
3058> stem(eruptions)
3059
3060  The decimal point is 1 digit(s) to the left of the |
3061
3062  16 | 070355555588
3063  18 | 000022233333335577777777888822335777888
3064  20 | 00002223378800035778
3065  22 | 0002335578023578
3066  24 | 00228
3067  26 | 23
3068  28 | 080
3069  30 | 7
3070  32 | 2337
3071  34 | 250077
3072  36 | 0000823577
3073  38 | 2333335582225577
3074  40 | 0000003357788888002233555577778
3075  42 | 03335555778800233333555577778
3076  44 | 02222335557780000000023333357778888
3077  46 | 0000233357700000023578
3078  48 | 00000022335800333
3079  50 | 0370
3080@end example
3081
3082A stem-and-leaf plot is like a histogram, and @R{} has a function
3083@code{hist} to plot histograms.
3084@findex hist
3085
3086@example
3087> hist(eruptions)
3088## @r{make the bins smaller, make a plot of density}
3089> hist(eruptions, seq(1.6, 5.2, 0.2), prob=TRUE)
3090> lines(density(eruptions, bw=0.1))
3091> rug(eruptions) # @r{show the actual data points}
3092@end example
3093
3094@findex density
3095@cindex Density estimation
3096More elegant density plots can be made by @code{density}, and we added a
3097line produced by @code{density} in this example.  The bandwidth
3098@code{bw} was chosen by trial-and-error as the default gives too much
3099smoothing (it usually does for ``interesting'' densities).  (Better
3100automated methods of bandwidth choice are available, and in this example
3101@code{bw = "SJ"} gives a good result.)
3102
3103@ifnotinfo
3104@image{images/hist,9cm}
3105@end ifnotinfo
3106
3107We can plot the empirical cumulative distribution function by using the
3108function @code{ecdf}.
3109@findex ecdf
3110@cindex Empirical CDFs
3111
3112@example
3113> plot(ecdf(eruptions), do.points=FALSE, verticals=TRUE)
3114@end example
3115
3116This distribution is obviously far from any standard distribution.
3117How about the right-hand mode, say eruptions of longer than 3 minutes?
3118Let us fit a normal distribution and overlay the fitted CDF.
3119
3120@example
3121> long <- eruptions[eruptions > 3]
3122> plot(ecdf(long), do.points=FALSE, verticals=TRUE)
3123> x <- seq(3, 5.4, 0.01)
3124> lines(x, pnorm(x, mean=mean(long), sd=sqrt(var(long))), lty=3)
3125@end example
3126
3127@ifnotinfo
3128@image{images/ecdf,9cm}
3129@end ifnotinfo
3130
3131Quantile-quantile (Q-Q) plots can help us examine this more carefully.
3132@cindex Quantile-quantile plots
3133@findex qqnorm
3134@findex qqline
3135
3136@example
3137par(pty="s")       # arrange for a square figure region
3138qqnorm(long); qqline(long)
3139@end example
3140
3141@noindent
3142which shows a reasonable fit but a shorter right tail than one would
3143expect from a normal distribution.  Let us compare this with some
3144simulated data from a @math{t} distribution
3145
3146@ifnotinfo
3147@image{images/QQ,7cm}
3148@end ifnotinfo
3149
3150@example
3151x <- rt(250, df = 5)
3152qqnorm(x); qqline(x)
3153@end example
3154
3155@noindent
3156which will usually (if it is a random sample) show longer tails than
3157expected for a normal.  We can make a Q-Q plot against the generating
3158distribution by
3159
3160@example
3161qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t dsn")
3162qqline(x)
3163@end example
3164
3165Finally, we might want a more formal test of agreement with normality
3166(or not).  @R{} provides the Shapiro-Wilk test
3167@cindex Shapiro-Wilk test
3168@findex shapiro.test
3169
3170@example
3171> shapiro.test(long)
3172
3173         Shapiro-Wilk normality test
3174
3175data:  long
3176W = 0.9793, p-value = 0.01052
3177@end example
3178
3179@noindent
3180and the Kolmogorov-Smirnov test
3181@cindex Kolmogorov-Smirnov test
3182@findex ks.test
3183
3184@example
3185> ks.test(long, "pnorm", mean = mean(long), sd = sqrt(var(long)))
3186
3187         One-sample Kolmogorov-Smirnov test
3188
3189data:  long
3190D = 0.0661, p-value = 0.4284
3191alternative hypothesis: two.sided
3192@end example
3193
3194@noindent
3195(Note that the distribution theory is not valid here as we
3196have estimated the parameters of the normal distribution from the same
3197sample.)
3198
3199@node One- and two-sample tests,  , Examining the distribution of a set of data, Probability distributions
3200@section One- and two-sample tests
3201@cindex One- and two-sample tests
3202
3203So far we have compared a single sample to a normal distribution.  A
3204much more common operation is to compare aspects of two samples.  Note
3205that in @R{}, all ``classical'' tests including the ones used below are
3206in package @pkg{stats} which is normally loaded.
3207
3208Consider the following sets of data on the latent heat of the fusion of
3209ice (@emph{cal/gm}) from Rice (1995, p.490)
3210
3211@example
3212Method A: 79.98 80.04 80.02 80.04 80.03 80.03 80.04 79.97
3213          80.05 80.03 80.02 80.00 80.02
3214Method B: 80.02 79.94 79.98 79.97 79.97 80.03 79.95 79.97
3215@end example
3216
3217@noindent
3218Boxplots provide a simple graphical comparison of the two samples.
3219
3220@c NOTE scan() from stdin is not parse()able, hence not source()able
3221@c Hence ./R-intro.R uses c(..)
3222@example
3223A <- scan()
322479.98 80.04 80.02 80.04 80.03 80.03 80.04 79.97
322580.05 80.03 80.02 80.00 80.02
3226
3227B <- scan()
322880.02 79.94 79.98 79.97 79.97 80.03 79.95 79.97
3229
3230boxplot(A, B)
3231@end example
3232@findex boxplot
3233@cindex Box plots
3234
3235@noindent
3236which indicates that the first group tends to give higher results than
3237the second.
3238
3239@ifnotinfo
3240@image{images/ice,7cm}
3241@end ifnotinfo
3242
3243To test for the equality of the means of the two examples, we can use
3244an @emph{unpaired} @math{t}-test by
3245@cindex Student's @math{t} test
3246@findex t.test
3247
3248@example
3249> t.test(A, B)
3250
3251         Welch Two Sample t-test
3252
3253data:  A and B
3254t = 3.2499, df = 12.027, p-value = 0.00694
3255alternative hypothesis: true difference in means is not equal to 0
325695 percent confidence interval:
3257 0.01385526 0.07018320
3258sample estimates:
3259mean of x mean of y
3260 80.02077  79.97875
3261@end example
3262
3263@noindent
3264which does indicate a significant difference, assuming normality.  By
3265default the @R{} function does not assume equality of variances in the
3266two samples (in contrast to the similar @SPLUS{} @code{t.test}
3267function).  We can use the F test to test for equality in the variances,
3268provided that the two samples are from normal populations.
3269
3270@example
3271> var.test(A, B)
3272
3273         F test to compare two variances
3274
3275data:  A and B
3276F = 0.5837, num df = 12, denom df =  7, p-value = 0.3938
3277alternative hypothesis: true ratio of variances is not equal to 1
327895 percent confidence interval:
3279 0.1251097 2.1052687
3280sample estimates:
3281ratio of variances
3282         0.5837405
3283@end example
3284@findex var.test
3285
3286@noindent
3287which shows no evidence of a significant difference, and so we can use
3288the classical @math{t}-test that assumes equality of the variances.
3289
3290@example
3291> t.test(A, B, var.equal=TRUE)
3292
3293         Two Sample t-test
3294
3295data:  A and B
3296t = 3.4722, df = 19, p-value = 0.002551
3297alternative hypothesis: true difference in means is not equal to 0
329895 percent confidence interval:
3299 0.01669058 0.06734788
3300sample estimates:
3301mean of x mean of y
3302 80.02077  79.97875
3303@end example
3304
3305All these tests assume normality of the two samples.  The two-sample
3306Wilcoxon (or Mann-Whitney) test only assumes a common continuous
3307distribution under the null hypothesis.
3308
3309@cindex Wilcoxon test
3310@findex wilcox.test
3311@example
3312> wilcox.test(A, B)
3313
3314         Wilcoxon rank sum test with continuity correction
3315
3316data:  A and B
3317W = 89, p-value = 0.007497
3318alternative hypothesis: true location shift is not equal to 0
3319
3320Warning message:
3321Cannot compute exact p-value with ties in: wilcox.test(A, B)
3322@end example
3323
3324@noindent
3325Note the warning: there are several ties in each sample, which suggests
3326strongly that these data are from a discrete distribution (probably due
3327to rounding).
3328
3329There are several ways to compare graphically the two samples.  We have
3330already seen a pair of boxplots.  The following
3331
3332@example
3333> plot(ecdf(A), do.points=FALSE, verticals=TRUE, xlim=range(A, B))
3334> plot(ecdf(B), do.points=FALSE, verticals=TRUE, add=TRUE)
3335@end example
3336
3337@noindent
3338will show the two empirical CDFs, and @code{qqplot} will perform a Q-Q
3339plot of the two samples.  The Kolmogorov-Smirnov test is of the maximal
3340vertical distance between the two ecdf's, assuming a common continuous
3341distribution:
3342
3343@example
3344> ks.test(A, B)
3345
3346         Two-sample Kolmogorov-Smirnov test
3347
3348data:  A and B
3349D = 0.5962, p-value = 0.05919
3350alternative hypothesis: two-sided
3351
3352Warning message:
3353cannot compute correct p-values with ties in: ks.test(A, B)
3354@end example
3355
3356@node Loops and conditional execution, Writing your own functions, Probability distributions, Top
3357@chapter Grouping, loops and conditional execution
3358@cindex Loops and conditional execution
3359
3360@menu
3361* Grouped expressions::
3362* Control statements::
3363@end menu
3364
3365@node Grouped expressions, Control statements, Loops and conditional execution, Loops and conditional execution
3366@section Grouped expressions
3367@cindex Grouped expressions
3368
3369@R{} is an expression language in the sense that its only command type
3370is a function or expression which returns a result.  Even an assignment
3371is an expression whose result is the value assigned, and it may be used
3372wherever any expression may be used; in particular multiple assignments
3373are possible.
3374
3375Commands may be grouped together in braces, @code{@{@var{expr_1};
3376@var{@dots{}}; @var{expr_m}@}}, in which case the value of the group
3377is the result of the last expression in the group evaluated.  Since such
3378a group is also an expression it may, for example, be itself included in
3379parentheses and used as part of an even larger expression, and so on.
3380
3381@node Control statements,  , Grouped expressions, Loops and conditional execution
3382@section Control statements
3383@cindex Control statements
3384
3385@menu
3386* Conditional execution::
3387* Repetitive execution::
3388@end menu
3389
3390@node Conditional execution, Repetitive execution, Control statements, Control statements
3391@subsection Conditional execution: @code{if} statements
3392@findex if
3393
3394The language has available a conditional construction of the form
3395
3396@example
3397> if (@var{expr_1}) @var{expr_2} else @var{expr_3}
3398@end example
3399@findex if
3400@findex else
3401
3402@noindent
3403where @var{expr_1} must evaluate to a single logical value and the
3404result of the entire expression is then evident.
3405
3406@findex &&
3407@findex ||
3408The ``short-circuit'' operators @code{&&} and @code{||} are often used
3409as part of the condition in an @code{if} statement.  Whereas @code{&}
3410and @code{|} apply element-wise to vectors, @code{&&} and @code{||}
3411apply to vectors of length one, and only evaluate their second argument
3412if necessary.
3413
3414@findex ifelse
3415There is a vectorized version of the @code{if}/@code{else} construct,
3416the @code{ifelse} function.  This has the form @code{ifelse(condition, a,
3417b)} and returns a vector of the same length as @code{condition}, with
3418elements @code{a[i]} if @code{condition[i]} is true, otherwise
3419@code{b[i]} (where @code{a} and @code{b} are recycled as necessary).
3420
3421
3422@node Repetitive execution,  , Conditional execution, Control statements
3423@subsection Repetitive execution: @code{for} loops, @code{repeat} and @code{while}
3424@findex for
3425
3426There is also a @code{for} loop construction which has the form
3427
3428@example
3429> for (@code{@var{name}} in @var{expr_1}) @var{expr_2}
3430@end example
3431
3432@noindent
3433where @code{@var{name}} is the loop variable.  @var{expr_1} is a
3434vector expression, (often a sequence like @code{1:20}), and
3435@var{expr_2} is often a grouped expression with its sub-expressions
3436written in terms of the dummy @emph{name}.  @var{expr_2} is repeatedly
3437evaluated as @var{name} ranges through the values in the vector result
3438of @var{expr_1}.
3439
3440As an example, suppose @code{ind} is a vector of class indicators and we
3441wish to produce separate plots of @code{y} versus @code{x} within
3442classes.  One possibility here is to use @code{coplot()},@footnote{to be
3443discussed later, or use @code{xyplot} from package @CRANpkg{lattice}.}
3444which will produce an array of plots corresponding to each level of the
3445factor.  Another way to do this, now putting all plots on the one
3446display, is as follows:
3447
3448@example
3449> xc <- split(x, ind)
3450> yc <- split(y, ind)
3451> for (i in 1:length(yc)) @{
3452    plot(xc[[i]], yc[[i]])
3453    abline(lsfit(xc[[i]], yc[[i]]))
3454  @}
3455@end example
3456
3457@findex split
3458
3459(Note the function @code{split()} which produces a list of vectors
3460obtained by splitting a larger vector according to the classes specified
3461by a factor.  This is a useful function, mostly used in connection
3462with boxplots.  See the @code{help} facility for further details.)
3463
3464@quotation
3465@strong{Warning}: @code{for()} loops are used in @R{} code much less
3466often than in compiled languages.  Code that takes a `whole object' view
3467is likely to be both clearer and faster in @R{}.
3468@end quotation
3469
3470Other looping facilities include the
3471
3472@example
3473> repeat @var{expr}
3474@end example
3475@findex repeat
3476
3477@noindent
3478statement and the
3479
3480@example
3481> while (@var{condition}) @var{expr}
3482@end example
3483@findex while
3484
3485@noindent
3486statement.
3487
3488The @code{break} statement can be used to terminate any loop, possibly
3489abnormally.  This is the only way to terminate @code{repeat} loops.
3490@findex break
3491
3492The @code{next} statement can be used to discontinue one particular
3493cycle and skip to the ``next''.
3494@findex next
3495
3496Control statements are most often used in connection with
3497@emph{functions} which are discussed in @ref{Writing your own
3498functions}, and where more examples will emerge.
3499
3500
3501@node Writing your own functions, Statistical models in R, Loops and conditional execution, Top
3502@chapter Writing your own functions
3503@cindex Writing functions
3504
3505As we have seen informally along the way, the @R{} language allows the
3506user to create objects of mode @emph{function}.  These are true @R{}
3507functions that are stored in a special internal form and may be used in
3508further expressions and so on.  In the process, the language gains
3509enormously in power, convenience and elegance, and learning to write
3510useful functions is one of the main ways to make your use of @R{}
3511comfortable and productive.
3512
3513It should be emphasized that most of the functions supplied as part of
3514the @R{} system, such as @code{mean()}, @code{var()},
3515@code{postscript()} and so on, are themselves written in @R{} and thus
3516do not differ materially from user written functions.
3517
3518A function is defined by an assignment of the form
3519
3520@example
3521> @var{name} <- function(@var{arg_1}, @var{arg_2}, @dots{}) @var{expression}
3522@end example
3523@findex function
3524
3525@noindent
3526The @var{expression} is an @R{} expression, (usually a grouped
3527expression), that uses the arguments, @var{arg_i}, to calculate a value.
3528The value of the expression is the value returned for the function.
3529
3530A call to the function then usually takes the form
3531@code{@var{name}(@var{expr_1}, @var{expr_2}, @dots{})} and may occur
3532anywhere a function call is legitimate.
3533
3534@menu
3535* Simple examples::
3536* Defining new binary operators::
3537* Named arguments and defaults::
3538* The three dots argument::
3539* Assignment within functions::
3540* More advanced examples::
3541* Scope::
3542* Customizing the environment::
3543* Object orientation::
3544@end menu
3545
3546@node Simple examples, Defining new binary operators, Writing your own functions, Writing your own functions
3547@section Simple examples
3548
3549As a first example, consider a function to calculate the two sample
3550@math{t}-statistic, showing ``all the steps''.  This is an artificial
3551example, of course, since there are other, simpler ways of achieving the
3552same end.
3553
3554The function is defined as follows:
3555
3556@example
3557> twosam <- function(y1, y2) @{
3558    n1  <- length(y1); n2  <- length(y2)
3559    yb1 <- mean(y1);   yb2 <- mean(y2)
3560    s1  <- var(y1);    s2  <- var(y2)
3561    s <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2)
3562    tst <- (yb1 - yb2)/sqrt(s*(1/n1 + 1/n2))
3563    tst
3564  @}
3565@end example
3566
3567With this function defined, you could perform two sample @math{t}-tests
3568using a call such as
3569
3570@example
3571> tstat <- twosam(data$male, data$female); tstat
3572@end example
3573
3574As a second example, consider a function to emulate directly the
3575@sc{Matlab} backslash command, which returns the coefficients of the
3576orthogonal projection of the vector @math{y} onto the column space of
3577the matrix, @math{X}.  (This is ordinarily called the least squares
3578estimate of the regression coefficients.)  This would ordinarily be
3579done with the @code{qr()} function; however this is sometimes a bit
3580tricky to use directly and it pays to have a simple function such as the
3581following to use it safely.
3582
3583Thus given a @math{n} by @math{1} vector @math{y} and an @math{n} by
3584@math{p} matrix @math{X} then @math{X \ y} is defined as
3585@ifnottex
3586(X'X)^@{-@}X'y, where (X'X)^@{-@}
3587@end ifnottex
3588@tex
3589$(X^T X)^{-}X^T y$, where $(X^T X)^{-}$
3590@end tex
3591is a generalized inverse of @math{X'X}.
3592
3593@example
3594> bslash <- function(X, y) @{
3595  X <- qr(X)
3596  qr.coef(X, y)
3597@}
3598@end example
3599
3600After this object is created it may be used in statements such as
3601
3602@example
3603> regcoeff <- bslash(Xmat, yvar)
3604@end example
3605
3606@noindent
3607and so on.
3608
3609The classical @R{} function @code{lsfit()} does this job quite well, and
3610more@footnote{See also the methods described in @ref{Statistical models
3611in R}}.  It in turn uses the functions @code{qr()} and @code{qr.coef()}
3612in the slightly counterintuitive way above to do this part of the
3613calculation.  Hence there is probably some value in having just this
3614part isolated in a simple to use function if it is going to be in
3615frequent use.  If so, we may wish to make it a matrix binary operator
3616for even more convenient use.
3617
3618@node Defining new binary operators, Named arguments and defaults, Simple examples, Writing your own functions
3619@section Defining new binary operators
3620@cindex Binary operators
3621
3622Had we given the @code{bslash()} function a different name, namely one of
3623the form
3624
3625@example
3626%@var{anything}%
3627@end example
3628
3629@noindent
3630it could have been used as a @emph{binary operator} in expressions
3631rather than in function form.  Suppose, for example, we choose @code{!}
3632for the internal character.  The function definition would then start as
3633
3634@example
3635> "%!%" <- function(X, y) @{ @dots{} @}
3636@end example
3637
3638@noindent
3639(Note the use of quote marks.)  The function could then be used as
3640@code{X %!% y}.  (The backslash symbol itself is not a convenient choice
3641as it presents special problems in this context.)
3642
3643The matrix multiplication operator, @code{%*%}, and the outer product
3644matrix operator @code{%o%} are other examples of binary operators
3645defined in this way.
3646
3647@node Named arguments and defaults, The three dots argument, Defining new binary operators, Writing your own functions
3648@section Named arguments and defaults
3649@cindex Named arguments
3650@cindex Default values
3651
3652As first noted in @ref{Generating regular sequences}, if arguments to
3653called functions are given in the ``@code{@var{name}=@var{object}}''
3654form, they may be given in any order.  Furthermore the argument sequence
3655may begin in the unnamed, positional form, and specify named arguments
3656after the positional arguments.
3657
3658Thus if there is a function @code{fun1} defined by
3659
3660@example
3661> fun1 <- function(data, data.frame, graph, limit) @{
3662    @r{[function body omitted]}
3663  @}
3664@end example
3665
3666@noindent
3667then the function may be invoked in several ways, for example
3668
3669@example
3670> ans <- fun1(d, df, TRUE, 20)
3671> ans <- fun1(d, df, graph=TRUE, limit=20)
3672> ans <- fun1(data=d, limit=20, graph=TRUE, data.frame=df)
3673@end example
3674
3675@noindent
3676are all equivalent.
3677
3678In many cases arguments can be given commonly appropriate default
3679values, in which case they may be omitted altogether from the call when
3680the defaults are appropriate.  For example, if @code{fun1} were defined
3681as
3682
3683@example
3684> fun1 <- function(data, data.frame, graph=TRUE, limit=20) @{ @dots{} @}
3685@end example
3686
3687@noindent
3688it could be called as
3689
3690@example
3691> ans <- fun1(d, df)
3692@end example
3693
3694@noindent
3695which is now equivalent to the three cases above, or as
3696
3697@example
3698> ans <- fun1(d, df, limit=10)
3699@end example
3700
3701@noindent
3702which changes one of the defaults.
3703
3704It is important to note that defaults may be arbitrary expressions, even
3705involving other arguments to the same function; they are not restricted
3706to be constants as in our simple example here.
3707
3708@node The three dots argument, Assignment within functions, Named arguments and defaults, Writing your own functions
3709@section The @samp{@dots{}} argument
3710
3711@c The ?Reserved topic links here, so please update it
3712@c if changing the node name.
3713
3714Another frequent requirement is to allow one function to pass on
3715argument settings to another.  For example many graphics functions use
3716the function @code{par()} and functions like @code{plot()} allow the
3717user to pass on graphical parameters to @code{par()} to control the
3718graphical output.  (@xref{The par() function}, for more details on the
3719@code{par()} function.)  This can be done by including an extra
3720argument, literally @samp{@dots{}}, of the function, which may then be
3721passed on.  An outline example is given below.
3722
3723@example
3724fun1 <- function(data, data.frame, graph=TRUE, limit=20, ...) @{
3725  @r{[omitted statements]}
3726  if (graph)
3727    par(pch="*", ...)
3728  @r{[more omissions]}
3729@}
3730@end example
3731
3732Less frequently, a function will need to refer to components of
3733@samp{@dots{}}.  The expression @code{list(...)} evaluates all such
3734arguments and returns them in a named list, while @code{..1},
3735@code{..2}, etc. evaluate them one at a time, with @samp{..n}
3736returning the n'th unmatched argument.
3737
3738@node Assignment within functions, More advanced examples, The three dots argument, Writing your own functions
3739@section Assignments within functions
3740
3741Note that @emph{any ordinary assignments done within the function are
3742local and temporary and are lost after exit from the function}.  Thus
3743the assignment @code{X <- qr(X)} does not affect the value of the
3744argument in the calling program.
3745
3746To understand completely the rules governing the scope of @R{} assignments
3747the reader needs to be familiar with the notion of an evaluation
3748@emph{frame}.  This is a somewhat advanced, though hardly difficult,
3749topic and is not covered further here.
3750
3751If global and permanent assignments are intended within a function, then
3752either the ``superassignment'' operator, @code{<<-} or the function
3753@code{assign()} can be used.  See the @code{help} document for details.
3754@SPLUS{} users should be aware that @code{<<-} has different semantics
3755in @R{}.  These are discussed further in @ref{Scope}.
3756
3757@node More advanced examples, Scope, Assignment within functions, Writing your own functions
3758@section More advanced examples
3759
3760@menu
3761* Efficiency factors in block designs::
3762* Dropping all names in a printed array::
3763* Recursive numerical integration::
3764@end menu
3765
3766@node Efficiency factors in block designs, Dropping all names in a printed array, More advanced examples, More advanced examples
3767@subsection Efficiency factors in block designs
3768
3769As a more complete, if a little pedestrian, example of a function,
3770consider finding the efficiency factors for a block design.  (Some
3771aspects of this problem have already been discussed in @ref{Index
3772matrices}.)
3773
3774A block design is defined by two factors, say @code{blocks} (@code{b}
3775levels) and @code{varieties} (@code{v} levels).  If @math{R} and
3776@math{K} are the @math{v} by @math{v} and @math{b} by @math{b}
3777@emph{replications} and @emph{block size} matrices, respectively, and
3778@math{N} is the @math{b} by @math{v} incidence matrix, then the
3779efficiency factors are defined as the eigenvalues of the matrix
3780@ifnottex
3781E = I_v - R^@{-1/2@}N'K^@{-1@}NR^@{-1/2@} = I_v - A'A, where
3782A = K^@{-1/2@}NR^@{-1/2@}.
3783@end ifnottex
3784@tex
3785$$E = I_v - R^{-1/2}N^T K^{-1}NR^{-1/2} = I_v - A^T A,$$
3786where $A = K^{-1/2}NR^{-1/2}$.
3787@end tex
3788One way to write the function is given below.
3789
3790@example
3791> bdeff <- function(blocks, varieties) @{
3792    blocks <- as.factor(blocks)             # @r{minor safety move}
3793    b <- length(levels(blocks))
3794    varieties <- as.factor(varieties)       # @r{minor safety move}
3795    v <- length(levels(varieties))
3796    K <- as.vector(table(blocks))           # @r{remove dim attr}
3797    R <- as.vector(table(varieties))        # @r{remove dim attr}
3798    N <- table(blocks, varieties)
3799    A <- 1/sqrt(K) * N * rep(1/sqrt(R), rep(b, v))
3800    sv <- svd(A)
3801    list(eff=1 - sv$d^2, blockcv=sv$u, varietycv=sv$v)
3802@}
3803@end example
3804
3805It is numerically slightly better to work with the singular value
3806decomposition on this occasion rather than the eigenvalue routines.
3807
3808The result of the function is a list giving not only the efficiency
3809factors as the first component, but also the block and variety canonical
3810contrasts, since sometimes these give additional useful qualitative
3811information.
3812
3813@node Dropping all names in a printed array, Recursive numerical integration, Efficiency factors in block designs, More advanced examples
3814@subsection Dropping all names in a printed array
3815
3816For printing purposes with large matrices or arrays, it is often useful
3817to print them in close block form without the array names or numbers.
3818Removing the @code{dimnames} attribute will not achieve this effect, but
3819rather the array must be given a @code{dimnames} attribute consisting of
3820empty strings.  For example to print a matrix, @code{X}
3821
3822@example
3823> temp <- X
3824> dimnames(temp) <- list(rep("", nrow(X)), rep("", ncol(X)))
3825> temp; rm(temp)
3826@end example
3827
3828This can be much more conveniently done using a function,
3829@code{no.dimnames()}, shown below, as a ``wrap around'' to achieve the
3830same result.  It also illustrates how some effective and useful user
3831functions can be quite short.
3832
3833@example
3834no.dimnames <- function(a) @{
3835  ## @r{Remove all dimension names from an array for compact printing.}
3836  d <- list()
3837  l <- 0
3838  for(i in dim(a)) @{
3839    d[[l <- l + 1]] <- rep("", i)
3840  @}
3841  dimnames(a) <- d
3842  a
3843@}
3844@end example
3845
3846With this function defined, an array may be printed in close format
3847using
3848
3849@example
3850> no.dimnames(X)
3851@end example
3852
3853This is particularly useful for large integer arrays, where patterns are
3854the real interest rather than the values.
3855
3856@node Recursive numerical integration,  , Dropping all names in a printed array, More advanced examples
3857@subsection Recursive numerical integration
3858
3859Functions may be recursive, and may themselves define functions within
3860themselves.  Note, however, that such functions, or indeed variables,
3861are not inherited by called functions in higher evaluation frames as
3862they would be if they were on the search path.
3863
3864The example below shows a naive way of performing one-dimensional
3865numerical integration.  The integrand is evaluated at the end points of
3866the range and in the middle.  If the one-panel trapezium rule answer is
3867close enough to the two panel, then the latter is returned as the value.
3868Otherwise the same process is recursively applied to each panel.  The
3869result is an adaptive integration process that concentrates function
3870evaluations in regions where the integrand is farthest from linear.
3871There is, however, a heavy overhead, and the function is only
3872competitive with other algorithms when the integrand is both smooth and
3873very difficult to evaluate.
3874
3875The example is also given partly as a little puzzle in @R{} programming.
3876
3877@example
3878area <- function(f, a, b, eps = 1.0e-06, lim = 10) @{
3879  fun1 <- function(f, a, b, fa, fb, a0, eps, lim, fun) @{
3880    ## @r{function `fun1' is only visible inside `area'}
3881    d <- (a + b)/2
3882    h <- (b - a)/4
3883    fd <- f(d)
3884    a1 <- h * (fa + fd)
3885    a2 <- h * (fd + fb)
3886    if(abs(a0 - a1 - a2) < eps || lim == 0)
3887      return(a1 + a2)
3888    else @{
3889      return(fun(f, a, d, fa, fd, a1, eps, lim - 1, fun) +
3890             fun(f, d, b, fd, fb, a2, eps, lim - 1, fun))
3891    @}
3892  @}
3893  fa <- f(a)
3894  fb <- f(b)
3895  a0 <- ((fa + fb) * (b - a))/2
3896  fun1(f, a, b, fa, fb, a0, eps, lim, fun1)
3897@}
3898@end example
3899
3900@menu
3901* Scope::
3902* Object orientation::
3903@end menu
3904
3905@node Scope, Customizing the environment, More advanced examples, Writing your own functions
3906@section Scope
3907@cindex Scope
3908
3909The discussion in this section is somewhat more technical than in other
3910parts of this document.  However, it details one of the major differences
3911between @SPLUS{} and @R{}.
3912
3913The symbols which occur in the body of a function can be divided into
3914three classes; formal parameters, local variables and free variables.
3915The formal parameters of a function are those occurring in the argument
3916list of the function.  Their values are determined by the process of
3917@emph{binding} the actual function arguments to the formal parameters.
3918Local variables are those whose values are determined by the evaluation
3919of expressions in the body of the functions.  Variables which are not
3920formal parameters or local variables are called free variables.  Free
3921variables become local variables if they are assigned to.  Consider the
3922following function definition.
3923
3924@example
3925f <- function(x) @{
3926  y <- 2*x
3927  print(x)
3928  print(y)
3929  print(z)
3930@}
3931@end example
3932
3933In this function, @code{x} is a formal parameter, @code{y} is a local
3934variable and @code{z} is a free variable.
3935
3936In @R{} the free variable bindings are resolved by first looking in the
3937environment in which the function was created.  This is called
3938@emph{lexical scope}.  First we define a function called @code{cube}.
3939
3940@example
3941cube <- function(n) @{
3942  sq <- function() n*n
3943  n*sq()
3944@}
3945@end example
3946
3947The variable @code{n} in the function @code{sq} is not an argument to that
3948function.  Therefore it is a free variable and the scoping rules must be
3949used to ascertain the value that is to be associated with it.  Under static
3950scope (@SPLUS{}) the value is that associated with a global variable named
3951@code{n}.  Under lexical scope (@R{}) it is the parameter to the function
3952@code{cube} since that is the active binding for the variable @code{n} at
3953the time the function @code{sq} was defined.  The difference between
3954evaluation in @R{} and evaluation in @SPLUS{} is that @SPLUS{} looks for a
3955global variable called @code{n} while @R{} first looks for a variable
3956called @code{n} in the environment created when @code{cube} was invoked.
3957
3958@example
3959## @r{first evaluation in S}
3960S> cube(2)
3961Error in sq(): Object "n" not found
3962Dumped
3963S> n <- 3
3964S> cube(2)
3965[1] 18
3966## @r{then the same function evaluated in R}
3967R> cube(2)
3968[1] 8
3969@end example
3970
3971Lexical scope can also be used to give functions @emph{mutable state}.
3972In the following example we show how @R{} can be used to mimic a bank
3973account.  A functioning bank account needs to have a balance or total, a
3974function for making withdrawals, a function for making deposits and a
3975function for stating the current balance.  We achieve this by creating
3976the three functions within @code{account} and then returning a list
3977containing them.  When @code{account} is invoked it takes a numerical
3978argument @code{total} and returns a list containing the three functions.
3979Because these functions are defined in an environment which contains
3980@code{total}, they will have access to its value.
3981
3982The special assignment operator, @code{<<-},
3983@findex <<-
3984is used to change the value associated with @code{total}.  This operator
3985looks back in enclosing environments for an environment that contains
3986the symbol @code{total} and when it finds such an environment it
3987replaces the value, in that environment, with the value of right hand
3988side.  If the global or top-level environment is reached without finding
3989the symbol @code{total} then that variable is created and assigned to
3990there.  For most users @code{<<-} creates a global variable and assigns
3991the value of the right hand side to it@footnote{In some sense this
3992mimics the behavior in @SPLUS{} since in @SPLUS{} this operator always
3993creates or assigns to a global variable.}.  Only when @code{<<-} has
3994been used in a function that was returned as the value of another
3995function will the special behavior described here occur.
3996
3997@example
3998open.account <- function(total) @{
3999  list(
4000    deposit = function(amount) @{
4001      if(amount <= 0)
4002        stop("Deposits must be positive!\n")
4003      total <<- total + amount
4004      cat(amount, "deposited.  Your balance is", total, "\n\n")
4005    @},
4006    withdraw = function(amount) @{
4007      if(amount > total)
4008        stop("You don't have that much money!\n")
4009      total <<- total - amount
4010      cat(amount, "withdrawn.  Your balance is", total, "\n\n")
4011    @},
4012    balance = function() @{
4013      cat("Your balance is", total, "\n\n")
4014    @}
4015  )
4016@}
4017
4018ross <- open.account(100)
4019robert <- open.account(200)
4020
4021ross$withdraw(30)
4022ross$balance()
4023robert$balance()
4024
4025ross$deposit(50)
4026ross$balance()
4027ross$withdraw(500)
4028@end example
4029
4030@node Customizing the environment, Object orientation, Scope, Writing your own functions
4031@section Customizing the environment
4032@cindex Customizing the environment
4033
4034Users can customize their environment in several different ways.  There
4035is a site initialization file and every directory can have its own
4036special initialization file.  Finally, the special functions
4037@code{.First} and @code{.Last} can be used.
4038
4039The location of the site initialization file is taken from the value of
4040the @env{R_PROFILE} environment variable.  If that variable is unset,
4041the file @file{Rprofile.site} in the @R{} home subdirectory @file{etc} is
4042used.  This file should contain the commands that you want to execute
4043every time @R{} is started under your system.  A second, personal,
4044profile file named @file{.Rprofile}@footnote{So it is hidden under
4045UNIX.} can be placed in any directory.  If @R{} is invoked in that
4046directory then that file will be sourced.  This file gives individual
4047users control over their workspace and allows for different startup
4048procedures in different working directories.  If no @file{.Rprofile}
4049file is found in the startup directory, then @R{} looks for a
4050@file{.Rprofile} file in the user's home directory and uses that (if it
4051exists).  If the environment variable @env{R_PROFILE_USER} is set, the
4052file it points to is used instead of the @file{.Rprofile} files.
4053
4054Any function named @code{.First()} in either of the two profile files or
4055in the @file{.RData} image has a special status.  It is automatically
4056performed at the beginning of an @R{} session and may be used to
4057initialize the environment.  For example, the definition in the example
4058below alters the prompt to @code{$} and sets up various other useful
4059things that can then be taken for granted in the rest of the session.
4060
4061Thus, the sequence in which files are executed is, @file{Rprofile.site},
4062the user profile, @file{.RData} and then @code{.First()}.  A definition
4063in later files will mask definitions in earlier files.
4064
4065@example
4066> .First <- function() @{
4067  options(prompt="$ ", continue="+\t")  # @r{@code{$} is the prompt}
4068  options(digits=5, length=999)         # @r{custom numbers and printout}
4069  x11()                                 # @r{for graphics}
4070  par(pch = "+")                        # @r{plotting character}
4071  source(file.path(Sys.getenv("HOME"), "R", "mystuff.R"))
4072                                        # @r{my personal functions}
4073  library(MASS)                         # @r{attach a package}
4074@}
4075@end example
4076@findex .First
4077
4078Similarly a function @code{.Last()}, if defined, is (normally) executed
4079at the very end of the session.  An example is given below.
4080
4081@example
4082> .Last <- function() @{
4083  graphics.off()                        # @r{a small safety measure.}
4084  cat(paste(date(),"\nAdios\n"))        # @r{Is it time for lunch?}
4085@}
4086@end example
4087@findex .Last
4088
4089@node Object orientation,  , Customizing the environment, Writing your own functions
4090@section Classes, generic functions and object orientation
4091@cindex Classes
4092@cindex Generic functions
4093@cindex Object orientation
4094
4095The class of an object determines how it will be treated by what are
4096known as @emph{generic} functions.  Put the other way round, a generic
4097function performs a task or action on its arguments @emph{specific to
4098the class of the argument itself}.  If the argument lacks any @code{class}
4099attribute, or has a class not catered for specifically by the generic
4100function in question, there is always a @emph{default action} provided.
4101
4102An example makes things clearer.  The class mechanism offers the user
4103the facility of designing and writing generic functions for special
4104purposes.  Among the other generic functions are @code{plot()} for
4105displaying objects graphically, @code{summary()} for summarizing
4106analyses of various types, and @code{anova()} for comparing statistical
4107models.
4108
4109The number of generic functions that can treat a class in a specific way
4110can be quite large.  For example, the functions that can accommodate in
4111some fashion objects of class @code{"data.frame"} include
4112
4113@example
4114[     [[<-    any    as.matrix
4115[<-   mean    plot   summary
4116@end example
4117
4118@findex methods
4119A currently complete list can be got by using the @code{methods()}
4120function:
4121
4122@example
4123> methods(class="data.frame")
4124@end example
4125
4126Conversely the number of classes a generic function can handle can also
4127be quite large.  For example the @code{plot()} function has a default
4128method and variants for objects of classes @code{"data.frame"},
4129@code{"density"}, @code{"factor"}, and more.  A complete list can be got
4130again by using the @code{methods()} function:
4131
4132@example
4133> methods(plot)
4134@end example
4135
4136For many generic functions the function body is quite short, for example
4137
4138@example
4139> coef
4140function (object, ...)
4141UseMethod("coef")
4142@end example
4143
4144@noindent
4145The presence of @code{UseMethod} indicates this is a generic function.
4146To see what methods are available we can use @code{methods()}
4147
4148@example
4149> methods(coef)
4150[1] coef.aov*         coef.Arima*       coef.default*     coef.listof*
4151[5] coef.nls*         coef.summary.nls*
4152
4153   Non-visible functions are asterisked
4154@end example
4155
4156@noindent
4157In this example there are six methods, none of which can be seen by
4158typing its name.  We can read these by either of
4159
4160@findex getAnywhere
4161@findex getS3method
4162@example
4163> getAnywhere("coef.aov")
4164A single object matching 'coef.aov' was found
4165It was found in the following places
4166  registered S3 method for coef from namespace stats
4167  namespace:stats
4168with value
4169
4170function (object, ...)
4171@{
4172    z <- object$coef
4173    z[!is.na(z)]
4174@}
4175
4176> getS3method("coef", "aov")
4177function (object, ...)
4178@{
4179    z <- object$coef
4180    z[!is.na(z)]
4181@}
4182@end example
4183
4184A function named @code{@var{gen}.@var{cl}} will be invoked by the
4185generic @code{@var{gen}} for class @code{@var{cl}}, so do not name
4186functions in this style unless they are intended to be methods.
4187
4188The reader is referred to the @emph{R Language Definition} for a more
4189complete discussion of this mechanism.
4190
4191
4192@node Statistical models in R, Graphics, Writing your own functions, Top
4193@chapter Statistical models in R
4194@cindex Statistical models
4195
4196This section presumes the reader has some familiarity with statistical
4197methodology, in particular with regression analysis and the analysis of
4198variance.  Later we make some rather more ambitious presumptions, namely
4199that something is known about generalized linear models and nonlinear
4200regression.
4201
4202The requirements for fitting statistical models are sufficiently well
4203defined to make it possible to construct general tools that apply in a
4204broad spectrum of problems.
4205
4206@R{} provides an interlocking suite of facilities that make fitting
4207statistical models very simple.  As we mention in the introduction, the
4208basic output is minimal, and one needs to ask for the details by calling
4209extractor functions.
4210
4211@menu
4212* Formulae for statistical models::
4213* Linear models::
4214* Generic functions for extracting model information::
4215* Analysis of variance and model comparison::
4216* Updating fitted models::
4217* Generalized linear models::
4218* Nonlinear least squares and maximum likelihood models::
4219* Some non-standard models::
4220@end menu
4221
4222@node Formulae for statistical models, Linear models, Statistical models in R, Statistical models in R
4223@section Defining statistical models; formulae
4224@cindex Formulae
4225
4226The template for a statistical model is a linear regression model with
4227independent, homoscedastic errors
4228
4229@ifnottex
4230@display
4231y_i = sum_@{j=0@}^p beta_j x_@{ij@} + e_i, @ @ @ @ i = 1, @dots{}, n,
4232@end display
4233@noindent
4234where the e_i are NID(0, sigma^2).
4235@end ifnottex
4236@tex
4237$$ y_i = \sum_{j=0}^p \beta_j x_{ij} + e_i,
4238   \qquad e_i \sim {\rm NID}(0,\sigma^2),
4239   \qquad i = 1, @dots{}, n
4240$$
4241@end tex
4242In matrix terms this would be written
4243
4244@ifnottex
4245@display
4246y = X @ beta + e
4247@end display
4248@end ifnottex
4249@tex
4250$$ y = X \beta + e $$
4251@end tex
4252
4253@noindent
4254where the @math{y} is the response vector, @math{X} is the @emph{model
4255matrix} or @emph{design matrix} and has columns
4256@math{x_0, x_1, @dots{}, x_p},
4257the determining variables.  Very often @math{x_0}
4258will be a column of ones defining an @emph{intercept} term.
4259
4260@subsubheading Examples
4261
4262Before giving a formal specification, a few examples may usefully set
4263the picture.
4264
4265Suppose @code{y}, @code{x}, @code{x0}, @code{x1}, @code{x2}, @dots{} are
4266numeric variables, @code{X} is a matrix and @code{A}, @code{B},
4267@code{C}, @dots{} are factors.  The following formulae on the left
4268side below specify statistical models as described on the right.
4269
4270@table @code
4271@item y ~ x
4272@itemx y ~ 1 + x
4273Both imply the same simple linear regression model of @math{y} on
4274@math{x}.  The first has an implicit intercept term, and the second an
4275explicit one.
4276
4277@item y ~ 0 + x
4278@itemx y ~ -1 + x
4279@itemx y ~ x - 1
4280Simple linear regression of @math{y} on @math{x} through the origin
4281(that is, without an intercept term).
4282
4283@item log(y) ~ x1 + x2
4284Multiple regression of the transformed variable,
4285@ifnottex
4286log(y),
4287@end ifnottex
4288@tex
4289$\log(y)$,
4290@end tex
4291on @math{x1} and @math{x2} (with an implicit intercept term).
4292
4293@item y ~ poly(x,2)
4294@itemx y ~ 1 + x + I(x^2)
4295Polynomial regression of @math{y} on @math{x} of degree 2.  The first
4296form uses orthogonal polynomials, and the second uses explicit powers,
4297as basis.
4298
4299@item y ~ X + poly(x,2)
4300Multiple regression @math{y} with model matrix consisting of the matrix
4301@math{X} as well as polynomial terms in @math{x} to degree 2.
4302
4303@item y ~ A
4304Single classification analysis of variance model of @math{y}, with
4305classes determined by @math{A}.
4306
4307@item y ~ A + x
4308Single classification analysis of covariance model of @math{y}, with
4309classes determined by @math{A}, and with covariate @math{x}.
4310
4311@item y ~ A*B
4312@itemx y ~ A + B + A:B
4313@itemx y ~ B %in% A
4314@itemx y ~ A/B
4315Two factor non-additive model of @math{y} on @math{A} and @math{B}.  The
4316first two specify the same crossed classification and the second two
4317specify the same nested classification.  In abstract terms all four
4318specify the same model subspace.
4319
4320@item y ~ (A + B + C)^2
4321@itemx y ~ A*B*C - A:B:C
4322Three factor experiment but with a model containing main effects and two
4323factor interactions only.  Both formulae specify the same model.
4324
4325@item y ~ A * x
4326@itemx y ~ A/x
4327@itemx y ~ A/(1 + x) - 1
4328Separate simple linear regression models of @math{y} on @math{x} within
4329the levels of @math{A}, with different codings.  The last form produces
4330explicit estimates of as many different intercepts and slopes as there
4331are levels in @math{A}.
4332
4333@item y ~ A*B + Error(C)
4334An experiment with two treatment factors, @math{A} and @math{B}, and
4335error strata determined by factor @math{C}.  For example a split plot
4336experiment, with whole plots (and hence also subplots), determined by
4337factor @math{C}.
4338@end table
4339
4340@findex ~
4341The operator @code{~} is used to define a @emph{model formula} in @R{}.
4342The form, for an ordinary linear model, is
4343
4344@example
4345@var{response} ~ @var{op_1} @var{term_1} @var{op_2} @var{term_2} @var{op_3} @var{term_3} @var{@dots{}}
4346@end example
4347
4348@noindent
4349where
4350
4351@table @var
4352@item response
4353is a vector or matrix, (or expression evaluating to a vector or matrix)
4354defining the response variable(s).
4355@item op_i
4356is an operator, either @code{+} or @code{-}, implying the inclusion or
4357exclusion of a term in the model, (the first is optional).
4358@item term_i
4359is either
4360@itemize @bullet
4361@item
4362a vector or matrix expression, or @code{1},
4363@item
4364a factor, or
4365@item
4366a @emph{formula expression} consisting of factors, vectors or matrices
4367connected by @emph{formula operators}.
4368@end itemize
4369In all cases each term defines a collection of columns either to be
4370added to or removed from the model matrix.  A @code{1} stands for an
4371intercept column and is by default included in the model matrix unless
4372explicitly removed.
4373
4374@end table
4375
4376The @emph{formula operators} are similar in effect to the Wilkinson and
4377Rogers notation used by such programs as Glim and Genstat.  One
4378inevitable change is that the operator @samp{@code{.}} becomes
4379@samp{@code{:}} since the period is a valid name character in @R{}.
4380
4381The notation is summarized below (based on Chambers & Hastie, 1992,
4382p.29):
4383
4384@table @code
4385@item @var{Y} ~ @var{M}
4386@var{Y} is modeled as @var{M}.
4387
4388@item @var{M_1} + @var{M_2}
4389Include @var{M_1} and @var{M_2}.
4390
4391@item @var{M_1} - @var{M_2}
4392Include @var{M_1} leaving out terms of @var{M_2}.
4393
4394@item @var{M_1} : @var{M_2}
4395The tensor product of @var{M_1} and @var{M_2}.  If both terms are
4396factors, then the ``subclasses'' factor.
4397
4398@item @var{M_1} %in% @var{M_2}
4399Similar to @code{@var{M_1}:@var{M_2}}, but with a different coding.
4400
4401@item @var{M_1} * @var{M_2}
4402@code{@var{M_1} + @var{M_2} + @var{M_1}:@var{M_2}}.
4403
4404@item @var{M_1} / @var{M_2}
4405@code{@var{M_1} + @var{M_2} %in% @var{M_1}}.
4406
4407@item @var{M}^@var{n}
4408All terms in @var{M} together with ``interactions'' up to order @var{n}
4409
4410@item I(@var{M})
4411Insulate @var{M}.  Inside @var{M} all operators have their normal
4412arithmetic meaning, and that term appears in the model matrix.
4413@end table
4414
4415Note that inside the parentheses that usually enclose function arguments
4416all operators have their normal arithmetic meaning.  The function
4417@code{I()} is an identity function used to allow terms in model formulae
4418to be defined using arithmetic operators.
4419
4420Note particularly that the model formulae specify the @emph{columns
4421of the model matrix}, the specification of the parameters being
4422implicit.  This is not the case in other contexts, for example in
4423specifying nonlinear models.
4424
4425@menu
4426* Contrasts::
4427@end menu
4428
4429@node Contrasts,  , Formulae for statistical models, Formulae for statistical models
4430@subsection Contrasts
4431@cindex Contrasts
4432
4433We need at least some idea how the model formulae specify the columns of
4434the model matrix.  This is easy if we have continuous variables, as each
4435provides one column of the model matrix (and the intercept will provide
4436a column of ones if included in the model).
4437
4438@cindex Factors
4439@cindex Ordered factors
4440What about a @math{k}-level factor @code{A}?  The answer differs for
4441unordered and ordered factors.  For @emph{unordered} factors @math{k -
44421} columns are generated for the indicators of the second, @dots{},
4443@math{k}th levels of the factor. (Thus the implicit parameterization is
4444to contrast the response at each level with that at the first.)  For
4445@emph{ordered} factors the @math{k - 1} columns are the orthogonal
4446polynomials on @math{1, @dots{}, k}, omitting the constant term.
4447
4448Although the answer is already complicated, it is not the whole story.
4449First, if the intercept is omitted in a model that contains a factor
4450term, the first such term is encoded into @math{k} columns giving the
4451indicators for all the levels.  Second, the whole behavior can be
4452changed by the @code{options} setting for @code{contrasts}.  The default
4453setting in @R{} is
4454
4455@example
4456options(contrasts = c("contr.treatment", "contr.poly"))
4457@end example
4458
4459@noindent
4460The main reason for mentioning this is that @R{} and @Sl{} have
4461different defaults for unordered factors, @Sl{} using Helmert
4462contrasts.  So if you need to compare your results to those of a textbook
4463or paper which used @SPLUS{}, you will need to set
4464
4465@example
4466options(contrasts = c("contr.helmert", "contr.poly"))
4467@end example
4468
4469@noindent
4470This is a deliberate difference, as treatment contrasts (@R{}'s default)
4471are thought easier for newcomers to interpret.
4472
4473We have still not finished, as the contrast scheme to be used can be set
4474for each term in the model using the functions @code{contrasts} and
4475@code{C}.
4476@findex contrasts
4477@findex C
4478
4479We have not yet considered interaction terms: these generate the
4480products of the columns introduced for their component terms.
4481
4482Although the details are complicated, model formulae in @R{} will
4483normally generate the models that an expert statistician would expect,
4484provided that marginality is preserved.  Fitting, for example, a model
4485with an interaction but not the corresponding main effects will in
4486general lead to surprising results, and is for experts only.
4487
4488
4489@node Linear models, Generic functions for extracting model information, Formulae for statistical models, Statistical models in R
4490@section Linear models
4491@cindex Linear models
4492
4493The basic function for fitting ordinary multiple models is @code{lm()},
4494and a streamlined version of the call is as follows:
4495@findex lm
4496
4497@example
4498> @var{fitted.model} <- lm(@var{formula}, data = @var{data.frame})
4499@end example
4500
4501For example
4502
4503@example
4504> fm2 <- lm(y ~ x1 + x2, data = production)
4505@end example
4506
4507@noindent
4508would fit a multiple regression model of @math{y} on @math{x1} and
4509@math{x2} (with implicit intercept term).
4510
4511The important (but technically optional) parameter @code{data =
4512production} specifies that any variables needed to construct the model
4513should come first from the @code{production} @emph{data frame}.
4514@emph{This is the case regardless of whether data frame
4515@code{production} has been attached on the search path or not}.
4516
4517@node Generic functions for extracting model information, Analysis of variance and model comparison, Linear models, Statistical models in R
4518@section Generic functions for extracting model information
4519
4520The value of @code{lm()} is a fitted model object; technically a list of
4521results of class @code{"lm"}.  Information about the fitted model can
4522then be displayed, extracted, plotted and so on by using generic
4523functions that orient themselves to objects of class @code{"lm"}.  These
4524include
4525
4526@example
4527add1    deviance   formula      predict  step
4528alias   drop1      kappa        print    summary
4529anova   effects    labels       proj     vcov
4530coef    family     plot         residuals
4531@end example
4532
4533A brief description of the most commonly used ones is given below.
4534
4535@table @code
4536@findex anova
4537@item anova(@var{object_1}, @var{object_2})
4538Compare a submodel with an outer model and produce an analysis of
4539variance table.
4540
4541@findex coefficients
4542@findex coef
4543@item coef(@var{object})
4544Extract the regression coefficient (matrix).
4545
4546Long form: @code{coefficients(@var{object})}.
4547
4548@findex deviance
4549@item deviance(@var{object})
4550Residual sum of squares, weighted if appropriate.
4551
4552@findex formula
4553@item formula(@var{object})
4554Extract the model formula.
4555
4556@findex plot
4557@item plot(@var{object})
4558Produce four plots, showing residuals, fitted values and some
4559diagnostics.
4560
4561@findex predict
4562@item predict(@var{object}, newdata=@var{data.frame})
4563The data frame supplied must have variables specified with the same
4564labels as the original.  The value is a vector or matrix of predicted
4565values corresponding to the determining variable values in
4566@var{data.frame}.
4567
4568@c  @item @code{predict.gam(@var{object},}
4569@c  @item @w{@ @ @ @code{newdata=@var{data.frame})}}
4570@c  @tab @code{predict.gam()} is a safe alternative to @code{predict()} that
4571@c  can be used for @code{lm}, @code{glm} and @code{gam} fitted objects.  It
4572@c  must be used, for example, in cases where orthogonal polynomials are
4573@c  used as the original basis functions, and the addition of new data
4574@c  implies different basis functions to the original.
4575
4576@findex print
4577@item print(@var{object})
4578Print a concise version of the object.  Most often used implicitly.
4579
4580@findex residuals
4581@findex resid
4582@item residuals(@var{object})
4583Extract the (matrix of) residuals, weighted as appropriate.
4584
4585Short form: @code{resid(@var{object})}.
4586
4587@findex step
4588@item step(@var{object})
4589Select a suitable model by adding or dropping terms and preserving
4590hierarchies.  The model with the smallest value of AIC (Akaike's An
4591Information Criterion) discovered in the stepwise search is returned.
4592
4593@findex summary
4594@item summary(@var{object})
4595Print a comprehensive summary of the results of the regression analysis.
4596
4597@findex vcov
4598@item vcov(@var{object})
4599Returns the variance-covariance matrix of the main parameters of a
4600fitted model object.
4601@end table
4602
4603@node Analysis of variance and model comparison, Updating fitted models, Generic functions for extracting model information, Statistical models in R
4604@section Analysis of variance and model comparison
4605@cindex Analysis of variance
4606
4607The model fitting function @code{aov(@var{formula},
4608data=@var{data.frame})}
4609@findex aov
4610operates at the simplest level in a very similar way to the function
4611@code{lm()}, and most of the generic functions listed in the table in
4612@ref{Generic functions for extracting model information} apply.
4613
4614It should be noted that in addition @code{aov()} allows an analysis of
4615models with multiple error strata such as split plot experiments, or
4616balanced incomplete block designs with recovery of inter-block
4617information.  The model formula
4618
4619@example
4620@var{response} ~ @var{mean.formula} + Error(@var{strata.formula})
4621@end example
4622@findex Error
4623
4624@noindent
4625specifies a multi-stratum experiment with error strata defined by the
4626@var{strata.formula}.  In the simplest case, @var{strata.formula} is
4627simply a factor, when it defines a two strata experiment, namely between
4628and within the levels of the factor.
4629
4630For example, with all determining variables factors, a model formula such
4631as that in:
4632
4633@example
4634> fm <- aov(yield ~ v + n*p*k + Error(farms/blocks), data=farm.data)
4635@end example
4636
4637@noindent
4638would typically be used to describe an experiment with mean model
4639@code{v + n*p*k} and three error strata, namely ``between farms'',
4640``within farms, between blocks'' and ``within blocks''.
4641
4642@menu
4643* ANOVA tables::
4644@end menu
4645
4646@node ANOVA tables,  , Analysis of variance and model comparison, Analysis of variance and model comparison
4647@subsection ANOVA tables
4648
4649Note also that the analysis of variance table (or tables) are for a
4650sequence of fitted models.  The sums of squares shown are the decrease
4651in the residual sums of squares resulting from an inclusion of
4652@emph{that term} in the model at @emph{that place} in the sequence.
4653Hence only for orthogonal experiments will the order of inclusion be
4654inconsequential.
4655
4656For multistratum experiments the procedure is first to project the
4657response onto the error strata, again in sequence, and to fit the mean
4658model to each projection.  For further details, see Chambers & Hastie
4659(1992).
4660
4661A more flexible alternative to the default full ANOVA table is to
4662compare two or more models directly using the @code{anova()} function.
4663@findex anova
4664
4665@example
4666> anova(@var{fitted.model.1}, @var{fitted.model.2}, @dots{})
4667@end example
4668
4669The display is then an ANOVA table showing the differences between the
4670fitted models when fitted in sequence.  The fitted models being compared
4671would usually be an hierarchical sequence, of course.  This does not
4672give different information to the default, but rather makes it easier to
4673comprehend and control.
4674
4675@node Updating fitted models, Generalized linear models, Analysis of variance and model comparison, Statistical models in R
4676@section Updating fitted models
4677@cindex Updating fitted models
4678
4679The @code{update()} function is largely a convenience function that
4680allows a model to be fitted that differs from one previously fitted
4681usually by just a few additional or removed terms.  Its form is
4682@findex update
4683
4684@example
4685> @var{new.model} <- update(@var{old.model}, @var{new.formula})
4686@end example
4687
4688In the @var{new.formula} the special name consisting of a period,
4689@samp{@code{.}},
4690@findex .
4691only, can be used to stand for ``the corresponding part of the old model
4692formula''.  For example,
4693
4694@example
4695> fm05 <- lm(y ~ x1 + x2 + x3 + x4 + x5, data = production)
4696> fm6  <- update(fm05, . ~ . + x6)
4697> smf6 <- update(fm6, sqrt(.) ~ .)
4698@end example
4699
4700@noindent
4701would fit a five variate multiple regression with variables (presumably)
4702from the data frame @code{production}, fit an additional model including
4703a sixth regressor variable, and fit a variant on the model where the
4704response had a square root transform applied.
4705
4706Note especially that if the @code{data=} argument is specified on the
4707original call to the model fitting function, this information is passed on
4708through the fitted model object to @code{update()} and its allies.
4709
4710The name @samp{.} can also be used in other contexts, but with slightly
4711different meaning.  For example
4712
4713@example
4714> fmfull <- lm(y ~ . , data = production)
4715@end example
4716
4717@noindent
4718would fit a model with response @code{y} and regressor variables
4719@emph{all other variables in the data frame @code{production}}.
4720
4721Other functions for exploring incremental sequences of models are
4722@code{add1()}, @code{drop1()} and @code{step()}.
4723@findex add1
4724@findex drop1
4725@findex step
4726The names of these give a good clue to their purpose, but for full
4727details see the on-line help.
4728
4729@node Generalized linear models, Nonlinear least squares and maximum likelihood models, Updating fitted models, Statistical models in R
4730@section Generalized linear models
4731@cindex Generalized linear models
4732
4733Generalized linear modeling is a development of linear models to
4734accommodate both non-normal response distributions and transformations
4735to linearity in a clean and straightforward way.  A generalized linear
4736model may be described in terms of the following sequence of
4737assumptions:
4738
4739@itemize @bullet
4740@item
4741There is a response, @math{y}, of interest and stimulus variables
4742@ifnottex
4743x_1, x_2, @dots{},
4744@end ifnottex
4745@tex
4746$x_1$, $x_2$, @dots{},
4747@end tex
4748whose values influence the distribution of the response.
4749
4750@item
4751The stimulus variables influence the distribution of @math{y} through
4752@emph{a single linear function, only}.  This linear function is called
4753the @emph{linear predictor}, and is usually written
4754@ifnottex
4755@display
4756eta = beta_1 x_1 + beta_2 x_2 + @dots{} + beta_p x_p,
4757@end display
4758hence x_i has no influence on the distribution of @math{y} if and only if
4759beta_i is zero.
4760@end ifnottex
4761@tex
4762$$ \eta = \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p, $$
4763hence $x_i$ has no influence on the distribution of @math{y} if and only
4764if $\beta_i=0$.
4765@end tex
4766
4767@item
4768The distribution of @math{y} is of the form
4769@ifnottex
4770@display
4771f_Y(y; mu, phi)
4772  = exp((A/phi) * (y lambda(mu) - gamma(lambda(mu))) + tau(y, phi))
4773@end display
4774where phi is a @emph{scale parameter} (possibly known), and is constant
4775for all observations, @math{A} represents a prior weight, assumed known
4776but possibly varying with the observations, and $\mu$ is the mean of
4777@math{y}.
4778@end ifnottex
4779@tex
4780$$
4781f_Y(y;\mu,\varphi)
4782= \exp\left[{A \over \varphi}\left\{y\lambda(\mu) -
4783\gamma\left(\lambda(\mu)\right)\right\} + \tau(y,\varphi)\right]
4784$$
4785where $\varphi$ is a @emph{scale parameter} (possibly known), and is
4786constant for all observations, $A$ represents a prior weight, assumed
4787known but possibly varying with the observations, and $\mu$ is the mean
4788of $y$.
4789@end tex
4790So it is assumed that the distribution of @math{y} is determined by its
4791mean and possibly a scale parameter as well.
4792
4793@item
4794@ifnottex
4795The mean, mu, is a smooth invertible function of the linear predictor:
4796@display
4797mu = m(eta),    eta = m^@{-1@}(mu) = ell(mu)
4798@end display
4799and this inverse function, ell(), is called the @emph{link function}.
4800@end ifnottex
4801@tex
4802The mean, $\mu$, is a smooth invertible function of the linear predictor:
4803$$ \mu = m(\eta),\qquad \eta = m^{-1}(\mu) = \ell(\mu) $$
4804and this inverse function, $\ell()$, is called the @emph{link function}.
4805@end tex
4806
4807@end itemize
4808
4809These assumptions are loose enough to encompass a wide class of models
4810useful in statistical practice, but tight enough to allow the
4811development of a unified methodology of estimation and inference, at
4812least approximately.  The reader is referred to any of the current
4813reference works on the subject for full details, such as McCullagh &
4814Nelder (1989) or Dobson (1990).
4815
4816@menu
4817* Families::
4818* The glm() function::
4819@end menu
4820
4821@node Families, The glm() function, Generalized linear models, Generalized linear models
4822@subsection Families
4823@cindex Families
4824
4825The class of generalized linear models handled by facilities supplied in
4826@R{} includes @emph{gaussian}, @emph{binomial}, @emph{poisson},
4827@emph{inverse gaussian} and @emph{gamma} response distributions and also
4828@emph{quasi-likelihood} models where the response distribution is not
4829explicitly specified.  In the latter case the @emph{variance function}
4830must be specified as a function of the mean, but in other cases this
4831function is implied by the response distribution.
4832
4833Each response distribution admits a variety of link functions to connect
4834the mean with the linear predictor.  Those automatically available are
4835shown in the following table:
4836
4837@quotation
4838@multitable @columnfractions 0.25 0.55
4839@headitem Family name @tab Link functions
4840@item @code{binomial} @tab @code{logit}, @code{probit}, @code{log}, @code{cloglog}
4841@item @code{gaussian} @tab @code{identity}, @code{log}, @code{inverse}
4842@item @code{Gamma} @tab @code{identity}, @code{inverse}, @code{log}
4843@item @code{inverse.gaussian} @tab @code{1/mu^2}, @code{identity}, @code{inverse}, @code{log}
4844@item @code{poisson} @tab @code{identity}, @code{log}, @code{sqrt}
4845@item @code{quasi} @tab @code{logit}, @code{probit}, @code{cloglog},
4846@code{identity}, @code{inverse}, @code{log}, @code{1/mu^2}, @code{sqrt}
4847@end multitable
4848@end quotation
4849
4850The combination of a response distribution, a link function and various
4851other pieces of information that are needed to carry out the modeling
4852exercise is called the @emph{family} of the generalized linear model.
4853
4854@node The glm() function,  , Families, Generalized linear models
4855@subsection The @code{glm()} function
4856@findex glm
4857
4858Since the distribution of the response depends on the stimulus variables
4859through a single linear function @emph{only}, the same mechanism as was
4860used for linear models can still be used to specify the linear part of a
4861generalized model.  The family has to be specified in a different way.
4862
4863The @R{} function to fit a generalized linear model is @code{glm()}
4864which uses the form
4865
4866@example
4867> @var{fitted.model} <- glm(@var{formula}, family=@var{family.generator}, data=@var{data.frame})
4868@end example
4869
4870The only new feature is the @var{family.generator}, which is the
4871instrument by which the family is described.  It is the name of a
4872function that generates a list of functions and expressions that
4873together define and control the model and estimation process.  Although
4874this may seem a little complicated at first sight, its use is quite
4875simple.
4876
4877The names of the standard, supplied family generators are given under
4878``Family Name'' in the table in @ref{Families}.  Where there is a choice
4879of links, the name of the link may also be supplied with the family
4880name, in parentheses as a parameter.  In the case of the @code{quasi}
4881family, the variance function may also be specified in this way.
4882
4883Some examples make the process clear.
4884
4885@subsubheading The @code{gaussian} family
4886
4887A call such as
4888
4889@example
4890> fm <- glm(y ~ x1 + x2, family = gaussian, data = sales)
4891@end example
4892
4893@noindent
4894achieves the same result as
4895
4896@example
4897> fm <- lm(y ~ x1+x2, data=sales)
4898@end example
4899
4900@noindent
4901but much less efficiently.  Note how the gaussian family is not
4902automatically provided with a choice of links, so no parameter is
4903allowed.  If a problem requires a gaussian family with a nonstandard
4904link, this can usually be achieved through the @code{quasi} family, as
4905we shall see later.
4906
4907@subsubheading The @code{binomial} family
4908
4909Consider a small, artificial example, from Silvey (1970).
4910
4911On the Aegean island of Kalythos the male inhabitants suffer from a
4912congenital eye disease, the effects of which become more marked with
4913increasing age.  Samples of islander males of various ages were tested
4914for blindness and the results recorded.  The data is shown below:
4915
4916@iftex
4917@quotation
4918@multitable {No.@: tested::} {50} {50} {50} {50} {50}
4919@item Age:          @tab  20 @tab  35 @tab  45 @tab  55 @tab  70
4920@item No.@: tested: @tab  50 @tab  50 @tab  50 @tab  50 @tab  50
4921@item No.@: blind:  @tab  @w{ 6} @tab  17 @tab  26 @tab  37 @tab  44
4922@end multitable
4923@end quotation
4924@end iftex
4925@ifnottex
4926@multitable {No.@: tested::} {50} {50} {50} {50} {50}
4927@item Age:          @tab  20 @tab  35 @tab  45 @tab  55 @tab  70
4928@item No.@: tested: @tab  50 @tab  50 @tab  50 @tab  50 @tab  50
4929@item No.@: blind:  @tab  @w{ 6} @tab  17 @tab  26 @tab  37 @tab  44
4930@end multitable
4931@end ifnottex
4932
4933The problem we consider is to fit both logistic and probit models to
4934this data, and to estimate for each model the LD50, that is the age at
4935which the chance of blindness for a male inhabitant is 50%.
4936
4937If @math{y} is the number of blind at age @math{x} and @math{n} the
4938number tested, both models have the form
4939@ifnottex
4940y ~ B(n, F(beta_0 + beta_1 x))
4941@end ifnottex
4942@tex
4943$$ y \sim {\rm B}(n, F(\beta_0 + \beta_1 x)) $$
4944@end tex
4945where for the probit case,
4946@eqn{F(z) = \Phi(z), F(z) = Phi(z)}
4947is the standard normal distribution function, and in the logit case
4948(the default),
4949@eqn{F(z) = e^z/(1+e^z),F(z) = e^z/(1+e^z)}.
4950In both cases the LD50 is
4951@ifnottex
4952LD50 = - beta_0/beta_1
4953@end ifnottex
4954@tex
4955$$ \hbox{LD50} = -\beta_0/\beta_1 $$
4956@end tex
4957that is, the point at which the argument of the distribution function is
4958zero.
4959
4960The first step is to set the data up as a data frame
4961
4962@example
4963> kalythos <- data.frame(x = c(20,35,45,55,70), n = rep(50,5),
4964                         y = c(6,17,26,37,44))
4965@end example
4966
4967To fit a binomial model using @code{glm()} there are three possibilities
4968for the response:
4969
4970@itemize @bullet
4971@item
4972If the response is a @emph{vector} it is assumed to hold @emph{binary}
4973data, and so must be a @math{0/1} vector.
4974
4975@item
4976If the response is a @emph{two-column matrix} it is assumed that the
4977first column holds the number of successes for the trial and the second
4978holds the number of failures.
4979
4980@item
4981If the response is a @emph{factor}, its first level is taken as failure
4982(0) and all other levels as `success' (1).
4983@end itemize
4984
4985Here we need the second of these conventions, so we add a matrix to our
4986data frame:
4987
4988@example
4989> kalythos$Ymat <- cbind(kalythos$y, kalythos$n - kalythos$y)
4990@end example
4991
4992To fit the models we use
4993
4994@example
4995> fmp <- glm(Ymat ~ x, family = binomial(link=probit), data = kalythos)
4996> fml <- glm(Ymat ~ x, family = binomial, data = kalythos)
4997@end example
4998
4999Since the logit link is the default the parameter may be omitted on the
5000second call.  To see the results of each fit we could use
5001
5002@example
5003> summary(fmp)
5004> summary(fml)
5005@end example
5006
5007Both models fit (all too) well.  To find the LD50 estimate we can use a
5008simple function:
5009
5010@example
5011> ld50 <- function(b) -b[1]/b[2]
5012> ldp <- ld50(coef(fmp)); ldl <- ld50(coef(fml)); c(ldp, ldl)
5013@end example
5014
5015The actual estimates from this data are 43.663 years and 43.601 years
5016respectively.
5017
5018@subsubheading Poisson models
5019
5020With the Poisson family the default link is the @code{log}, and in
5021practice the major use of this family is to fit surrogate Poisson
5022log-linear models to frequency data, whose actual distribution is often
5023multinomial.  This is a large and important subject we will not discuss
5024further here.  It even forms a major part of the use of non-gaussian
5025generalized models overall.
5026
5027Occasionally genuinely Poisson data arises in practice and in the past
5028it was often analyzed as gaussian data after either a log or a
5029square-root transformation.  As a graceful alternative to the latter, a
5030Poisson generalized linear model may be fitted as in the following
5031example:
5032
5033@example
5034> fmod <- glm(y ~ A + B + x, family = poisson(link=sqrt),
5035              data = worm.counts)
5036@end example
5037
5038@subsubheading Quasi-likelihood models
5039
5040For all families the variance of the response will depend on the mean
5041and will have the scale parameter as a multiplier.  The form of
5042dependence of the variance on the mean is a characteristic of the
5043response distribution; for example for the poisson distribution
5044@eqn{\hbox{Var}[y] = \mu,Var(y) = mu}.
5045
5046For quasi-likelihood estimation and inference the precise response
5047distribution is not specified, but rather only a link function and the
5048form of the variance function as it depends on the mean.  Since
5049quasi-likelihood estimation uses formally identical techniques to those
5050for the gaussian distribution, this family provides a way of fitting
5051gaussian models with non-standard link functions or variance functions,
5052incidentally.
5053
5054For example, consider fitting the non-linear regression
5055@ifnottex
5056y = theta_1 z_1 / (z_2 - theta_2) + e
5057@end ifnottex
5058@tex
5059$$ y = {\theta_1z_1 \over z_2 - \theta_2} + e $$
5060@end tex
5061which may be written alternatively as
5062@ifnottex
5063y = 1 / (beta_1 x_1 + beta_2 x_2) + e
5064@end ifnottex
5065@tex
5066$$ y = {1 \over \beta_1x_1 + \beta_2x_2} + e $$
5067@end tex
5068where
5069@ifnottex
5070x_1 = z_2/z_1, x_2 = -1/z_1, beta_1 = 1/theta_1, and beta_2 =
5071theta_2/theta_1.
5072@end ifnottex
5073@tex
5074$x_1 = z_2/z_1$, $x_2=-1/z_1$, $\beta_1=1/\theta_1$ and
5075$\beta_2=\theta_2/\theta_1$.
5076@end tex
5077Supposing a suitable data frame to be set up we could fit this
5078non-linear regression as
5079
5080@example
5081> nlfit <- glm(y ~ x1 + x2 - 1,
5082               family = quasi(link=inverse, variance=constant),
5083               data = biochem)
5084@end example
5085
5086The reader is referred to the manual and the help document for further
5087information, as needed.
5088
5089@node Nonlinear least squares and maximum likelihood models, Some non-standard models, Generalized linear models, Statistical models in R
5090@section Nonlinear least squares and maximum likelihood models
5091@cindex Nonlinear least squares
5092
5093Certain forms of nonlinear model can be fitted by Generalized Linear
5094Models (@code{glm()}).  But in the majority of cases we have to approach
5095the nonlinear curve fitting problem as one of nonlinear optimization.
5096@R{}'s nonlinear optimization routines are @code{optim()}, @code{nlm()}
5097and @code{nlminb()},
5098@findex nlm
5099@findex optim
5100@findex nlminb
5101which provide the functionality (and more) of @SPLUS{}'s @code{ms()} and
5102@code{nlminb()}.  We seek the parameter values that minimize some index
5103of lack-of-fit, and they do this by trying out various parameter values
5104iteratively.  Unlike linear regression for example, there is no
5105guarantee that the procedure will converge on satisfactory estimates.
5106All the methods require initial guesses about what parameter values to
5107try, and convergence may depend critically upon the quality of the
5108starting values.
5109
5110@menu
5111* Least squares::
5112* Maximum likelihood::
5113@end menu
5114
5115@node Least squares, Maximum likelihood, Nonlinear least squares and maximum likelihood models, Nonlinear least squares and maximum likelihood models
5116@subsection Least squares
5117
5118One way to fit a nonlinear model is by minimizing the sum of the squared
5119errors (SSE) or residuals.  This method makes sense if the observed
5120errors could have plausibly arisen from a normal distribution.
5121
5122Here is an example from Bates & Watts (1988), page 51.  The data are:
5123
5124@example
5125> x <- c(0.02, 0.02, 0.06, 0.06, 0.11, 0.11, 0.22, 0.22, 0.56, 0.56,
5126         1.10, 1.10)
5127> y <- c(76, 47, 97, 107, 123, 139, 159, 152, 191, 201, 207, 200)
5128@end example
5129
5130The fit criterion to be minimized is:
5131
5132@example
5133> fn <- function(p) sum((y - (p[1] * x)/(p[2] + x))^2)
5134@end example
5135
5136In order to do the fit we need initial estimates of the parameters.  One
5137way to find sensible starting values is to plot the data, guess some
5138parameter values, and superimpose the model curve using those values.
5139
5140@example
5141> plot(x, y)
5142> xfit <- seq(.02, 1.1, .05)
5143> yfit <- 200 * xfit/(0.1 + xfit)
5144> lines(spline(xfit, yfit))
5145@end example
5146
5147We could do better, but these starting values of 200 and 0.1 seem
5148adequate.  Now do the fit:
5149
5150@example
5151> out <- nlm(fn, p = c(200, 0.1), hessian = TRUE)
5152@end example
5153@findex nlm
5154
5155After the fitting, @code{out$minimum} is the SSE, and
5156@code{out$estimate} are the least squares estimates of the parameters.
5157To obtain the approximate standard errors (SE) of the estimates we do:
5158
5159@example
5160> sqrt(diag(2*out$minimum/(length(y) - 2) * solve(out$hessian)))
5161@end example
5162
5163The @code{2} which is subtracted in the line above represents the number
5164of parameters.  A 95% confidence interval would be the parameter
5165estimate @eqn{\pm, +/-} 1.96 SE.  We can superimpose the least squares
5166fit on a new plot:
5167
5168@example
5169> plot(x, y)
5170> xfit <- seq(.02, 1.1, .05)
5171> yfit <- 212.68384222 * xfit/(0.06412146 + xfit)
5172> lines(spline(xfit, yfit))
5173@end example
5174
5175The standard package @pkg{stats} provides much more extensive facilities
5176for fitting non-linear models by least squares.  The model we have just
5177fitted is the Michaelis-Menten model, so we can use
5178
5179@example
5180> df <- data.frame(x=x, y=y)
5181> fit <- nls(y ~ SSmicmen(x, Vm, K), df)
5182> fit
5183Nonlinear regression model
5184  model:  y ~ SSmicmen(x, Vm, K)
5185   data:  df
5186          Vm            K
5187212.68370711   0.06412123
5188 residual sum-of-squares:  1195.449
5189> summary(fit)
5190
5191Formula: y ~ SSmicmen(x, Vm, K)
5192
5193Parameters:
5194    Estimate Std. Error t value Pr(>|t|)
5195Vm 2.127e+02  6.947e+00  30.615 3.24e-11
5196K  6.412e-02  8.281e-03   7.743 1.57e-05
5197
5198Residual standard error: 10.93 on 10 degrees of freedom
5199
5200Correlation of Parameter Estimates:
5201      Vm
5202K 0.7651
5203@end example
5204
5205@node Maximum likelihood,  , Least squares, Nonlinear least squares and maximum likelihood models
5206@subsection Maximum likelihood
5207@cindex Maximum likelihood
5208
5209Maximum likelihood is a method of nonlinear model fitting that applies
5210even if the errors are not normal.  The method finds the parameter values
5211which maximize the log likelihood, or equivalently which minimize the
5212negative log-likelihood.  Here is an example from Dobson (1990), pp.@:
5213108--111.  This example fits a logistic model to dose-response data,
5214which clearly could also be fit by @code{glm()}.  The data are:
5215
5216@example
5217> x <- c(1.6907, 1.7242, 1.7552, 1.7842, 1.8113,
5218         1.8369, 1.8610, 1.8839)
5219> y <- c( 6, 13, 18, 28, 52, 53, 61, 60)
5220> n <- c(59, 60, 62, 56, 63, 59, 62, 60)
5221@end example
5222
5223The negative log-likelihood to minimize is:
5224
5225@example
5226> fn <- function(p)
5227   sum( - (y*(p[1]+p[2]*x) - n*log(1+exp(p[1]+p[2]*x))
5228           + log(choose(n, y)) ))
5229@end example
5230
5231@noindent
5232We pick sensible starting values and do the fit:
5233
5234@example
5235> out <- nlm(fn, p = c(-50,20), hessian = TRUE)
5236@end example
5237@findex nlm
5238
5239@noindent
5240After the fitting, @code{out$minimum} is the negative log-likelihood,
5241and @code{out$estimate} are the maximum likelihood estimates of the
5242parameters.  To obtain the approximate SEs of the estimates we do:
5243
5244@example
5245> sqrt(diag(solve(out$hessian)))
5246@end example
5247
5248A 95% confidence interval would be the parameter estimate @eqn{\pm, +/-}
52491.96 SE.
5250
5251@node Some non-standard models,  , Nonlinear least squares and maximum likelihood models, Statistical models in R
5252@section Some non-standard models
5253
5254We conclude this chapter with just a brief mention of some of the other
5255facilities available in @R{} for special regression and data analysis
5256problems.
5257
5258@itemize @bullet
5259@item
5260@cindex Mixed models
5261@strong{Mixed models.}  The recommended @CRANpkg{nlme} package provides
5262functions @code{lme()} and @code{nlme()}
5263@findex lme
5264@findex nlme
5265for linear and non-linear mixed-effects models, that is linear and
5266non-linear regressions in which some of the coefficients correspond to
5267random effects.  These functions make heavy use of formulae to specify
5268the models.
5269
5270@item
5271@cindex Local approximating regressions
5272@strong{Local approximating regressions.}  The @code{loess()}
5273@findex loess
5274function fits a nonparametric regression by using a locally weighted
5275regression.  Such regressions are useful for highlighting a trend in
5276messy data or for data reduction to give some insight into a large data
5277set.
5278
5279Function @code{loess} is in the standard package @pkg{stats}, together
5280with code for projection pursuit regression.
5281@findex loess
5282
5283@item
5284@cindex Robust regression
5285@strong{Robust regression.} There are several functions available for
5286fitting regression models in a way resistant to the influence of extreme
5287outliers in the data.  Function @code{lqs}
5288@findex lqs
5289in the recommended package @CRANpkg{MASS} provides state-of-art algorithms
5290for highly-resistant fits.  Less resistant but statistically more
5291efficient methods are available in packages, for example function
5292@code{rlm}
5293@findex rlm
5294in package @CRANpkg{MASS}.
5295
5296@item
5297@cindex Additive models
5298@strong{Additive models.} This technique aims to construct a regression
5299function from smooth additive functions of the determining variables,
5300usually one for each determining variable.  Functions @code{avas} and
5301@code{ace}
5302@findex avas
5303@findex ace
5304in package @CRANpkg{acepack} and functions @code{bruto} and @code{mars}
5305@findex bruto
5306@findex mars
5307in package @CRANpkg{mda} provide some examples of these techniques in
5308user-contributed packages to @R{}.  An extension is @strong{Generalized
5309Additive Models}, implemented in user-contributed packages @CRANpkg{gam} and
5310@CRANpkg{mgcv}.
5311
5312@item
5313@cindex Tree-based models
5314@strong{Tree-based models.} Rather than seek an explicit global linear
5315model for prediction or interpretation, tree-based models seek to
5316bifurcate the data, recursively, at critical points of the determining
5317variables in order to partition the data ultimately into groups that are
5318as homogeneous as possible within, and as heterogeneous as possible
5319between.  The results often lead to insights that other data analysis
5320methods tend not to yield.
5321
5322Models are again specified in the ordinary linear model form.  The model
5323fitting function is @code{tree()},
5324@findex tree
5325but many other generic functions such as @code{plot()} and @code{text()}
5326are well adapted to displaying the results of a tree-based model fit in
5327a graphical way.
5328
5329Tree models are available in @R{} @emph{via} the user-contributed
5330packages @CRANpkg{rpart} and @CRANpkg{tree}.
5331
5332@end itemize
5333
5334@node Graphics, Packages, Statistical models in R, Top
5335@chapter Graphical procedures
5336
5337Graphical facilities are an important and extremely versatile component
5338of the @R{} environment.  It is possible to use the facilities to
5339display a wide variety of statistical graphs and also to build entirely
5340new types of graph.
5341
5342The graphics facilities can be used in both interactive and batch modes,
5343but in most cases, interactive use is more productive.  Interactive use
5344is also easy because at startup time @R{} initiates a graphics
5345@emph{device driver} which opens a special @emph{graphics window} for
5346the display of interactive graphics.  Although this is done
5347automatically, it may useful to know that the command used is
5348@code{X11()} under UNIX, @code{windows()} under Windows and
5349@code{quartz()} under macOS.  A new device can always be opened by
5350@code{dev.new()}.
5351
5352Once the device driver is running, @R{} plotting commands can be used to
5353produce a variety of graphical displays and to create entirely new kinds
5354of display.
5355
5356Plotting commands are divided into three basic groups:
5357
5358@itemize @bullet
5359@item
5360@strong{High-level} plotting functions create a new plot on the graphics
5361device, possibly with axes, labels, titles and so on.
5362@item
5363@strong{Low-level} plotting functions add more information to an
5364existing plot, such as extra points, lines and labels.
5365@item
5366@strong{Interactive} graphics functions allow you interactively add
5367information to, or extract information from, an existing plot, using a
5368pointing device such as a mouse.
5369@end itemize
5370
5371In addition, @R{} maintains a list of @emph{graphical parameters} which
5372can be manipulated to customize your plots.
5373
5374This manual only describes what are known as `base' graphics.  A
5375separate graphics sub-system in package @pkg{grid} coexists with base --
5376it is more powerful but harder to use.  There is a recommended package
5377@CRANpkg{lattice} which builds on @pkg{grid} and provides ways to produce
5378multi-panel plots akin to those in the @emph{Trellis} system in @Sl{}.
5379
5380@menu
5381* High-level plotting commands::
5382* Low-level plotting commands::
5383* Interacting with graphics::
5384* Using graphics parameters::
5385* Graphics parameters::
5386* Device drivers::
5387* Dynamic graphics::
5388@end menu
5389
5390@node High-level plotting commands, Low-level plotting commands, Graphics, Graphics
5391@section High-level plotting commands
5392
5393High-level plotting functions are designed to generate a complete plot
5394of the data passed as arguments to the function.  Where appropriate,
5395axes, labels and titles are automatically generated (unless you request
5396otherwise.) High-level plotting commands always start a new plot,
5397erasing the current plot if necessary.
5398
5399@menu
5400* The plot() function::
5401* Displaying multivariate data::
5402* Display graphics::
5403* Arguments to high-level plotting functions::
5404@end menu
5405
5406@node The plot() function, Displaying multivariate data, High-level plotting commands, High-level plotting commands
5407@subsection The @code{plot()} function
5408@findex plot
5409
5410One of the most frequently used plotting functions in @R{} is the
5411@code{plot()} function.  This is a @emph{generic} function: the type of
5412plot produced is dependent on the type or @emph{class} of the first
5413argument.
5414
5415@table @code
5416
5417@item plot(@var{x}, @var{y})
5418@itemx plot(@var{xy})
5419If @var{x} and @var{y} are vectors, @code{plot(@var{x}, @var{y})}
5420produces a scatterplot of @var{y} against @var{x}.  The same effect can
5421be produced by supplying one argument (second form) as either a list
5422containing two elements @var{x} and @var{y} or a two-column matrix.
5423
5424@item plot(@var{x})
5425If @var{x} is a time series, this produces a time-series plot. If
5426@var{x} is a numeric vector, it produces a plot of the values in the
5427vector against their index in the vector.  If @var{x} is a complex
5428vector, it produces a plot of imaginary versus real parts of the vector
5429elements.
5430
5431@item plot(@var{f})
5432@itemx plot(@var{f}, @var{y})
5433@var{f} is a factor object, @var{y} is a numeric vector.  The first form
5434generates a bar plot of @var{f}; the second form produces boxplots of
5435@var{y} for each level of @var{f}.
5436
5437@item plot(@var{df})
5438@itemx plot(~ @var{expr})
5439@itemx plot(@var{y} ~ @var{expr})
5440@var{df} is a data frame, @var{y} is any object, @var{expr} is a list
5441of object names separated by `@code{+}' (e.g., @code{a + b + c}).  The
5442first two forms produce distributional plots of the variables in a data
5443frame (first form) or of a number of named objects (second form).  The
5444third form plots @var{y} against every object named in @var{expr}.
5445@end table
5446
5447@node Displaying multivariate data, Display graphics, The plot() function, High-level plotting commands
5448@subsection Displaying multivariate data
5449
5450@R{} provides two very useful functions for representing multivariate
5451data.  If @code{X} is a numeric matrix or data frame, the command
5452
5453@example
5454> pairs(X)
5455@end example
5456@findex pairs
5457
5458@noindent
5459produces a pairwise scatterplot matrix of the variables defined by the
5460columns of @code{X}, that is, every column of @code{X} is plotted
5461against every other column of @code{X} and the resulting @math{n(n-1)}
5462plots are arranged in a matrix with plot scales constant over the rows
5463and columns of the matrix.
5464
5465When three or four variables are involved a @emph{coplot} may be more
5466enlightening.  If @code{a} and @code{b} are numeric vectors and @code{c}
5467is a numeric vector or factor object (all of the same length), then
5468the command
5469
5470@example
5471> coplot(a ~ b | c)
5472@end example
5473@findex coplot
5474
5475@noindent
5476produces a number of scatterplots of @code{a} against @code{b} for given
5477values of @code{c}.  If @code{c} is a factor, this simply means that
5478@code{a} is plotted against @code{b} for every level of @code{c}.  When
5479@code{c} is numeric, it is divided into a number of @emph{conditioning
5480intervals} and for each interval @code{a} is plotted against @code{b}
5481for values of @code{c} within the interval.  The number and position of
5482intervals can be controlled with @code{given.values=} argument to
5483@code{coplot()}---the function @code{co.intervals()} is useful for
5484selecting intervals.  You can also use two @emph{given} variables with a
5485command like
5486
5487@example
5488> coplot(a ~ b | c + d)
5489@end example
5490
5491@noindent
5492which produces scatterplots of @code{a} against @code{b} for every joint
5493conditioning interval of @code{c} and @code{d}.
5494
5495The @code{coplot()} and @code{pairs()} function both take an argument
5496@code{panel=} which can be used to customize the type of plot which
5497appears in each panel.  The default is @code{points()} to produce a
5498scatterplot but by supplying some other low-level graphics function of
5499two vectors @code{x} and @code{y} as the value of @code{panel=} you can
5500produce any type of plot you wish.  An example panel function useful for
5501coplots is @code{panel.smooth()}.
5502
5503@node Display graphics, Arguments to high-level plotting functions, Displaying multivariate data, High-level plotting commands
5504@subsection Display graphics
5505
5506Other high-level graphics functions produce different types of plots.
5507Some examples are:
5508
5509@table @code
5510@c  @item tsplot(x_1, x_2, @dots{})
5511@c  @findex tsplot
5512@c  Plots any number of time series on the same scale.  This automatic
5513@c  simultaneous scaling feature is also useful when the @code{x@var{_i}}'s
5514@c  are ordinary numeric vectors, in which case they are plotted against the
5515@c  numbers @math{1, 2, 3, @dots{}}.
5516
5517@item qqnorm(x)
5518@itemx qqline(x)
5519@itemx qqplot(x, y)
5520@findex qqnorm
5521@findex qqline
5522@findex qqplot
5523Distribution-comparison plots.  The first form plots the numeric vector
5524@code{x} against the expected Normal order scores (a normal scores plot)
5525and the second adds a straight line to such a plot by drawing a line
5526through the distribution and data quartiles.  The third form plots the
5527quantiles of @code{x} against those of @code{y} to compare their
5528respective distributions.
5529
5530@item hist(x)
5531@itemx hist(x, nclass=@var{n})
5532@itemx hist(x, breaks=@var{b}, @dots{})
5533@findex hist
5534Produces a histogram of the numeric vector @code{x}.  A sensible number
5535of classes is usually chosen, but a recommendation can be given with the
5536@code{nclass=} argument.  Alternatively, the breakpoints can be
5537specified exactly with the @code{breaks=} argument.  If the
5538@code{probability=TRUE} argument is given, the bars represent relative
5539frequencies divided by bin width instead of counts.
5540
5541@item dotchart(x, @dots{})
5542@findex dotchart
5543Constructs a dotchart of the data in @code{x}.  In a dotchart the
5544@math{y}-axis gives a labelling of the data in @code{x} and the
5545@math{x}-axis gives its value.  For example it allows easy visual
5546selection of all data entries with values lying in specified ranges.
5547
5548@item image(x, y, z, @dots{})
5549@itemx contour(x, y, z, @dots{})
5550@itemx persp(x, y, z, @dots{})
5551@findex image
5552@findex contour
5553@findex persp
5554Plots of three variables.  The @code{image} plot draws a grid of rectangles
5555using different colours to represent the value of @code{z}, the @code{contour}
5556plot draws contour lines to represent the value of @code{z}, and the
5557@code{persp} plot draws a 3D surface.
5558@end table
5559
5560@node Arguments to high-level plotting functions,  , Display graphics, High-level plotting commands
5561@subsection Arguments to high-level plotting functions
5562
5563There are a number of arguments which may be passed to high-level
5564graphics functions, as follows:
5565
5566@table @code
5567@item add=TRUE
5568Forces the function to act as a low-level graphics function,
5569superimposing the plot on the current plot (some functions only).
5570
5571@item axes=FALSE
5572Suppresses generation of axes---useful for adding your own custom axes
5573with the @code{axis()} function.  The default, @code{axes=TRUE}, means
5574include axes.
5575
5576@item log="x"
5577@itemx log="y"
5578@itemx log="xy"
5579Causes the @math{x}, @math{y} or both axes to be logarithmic.  This will
5580work for many, but not all, types of plot.
5581
5582@item type=
5583The @code{type=} argument controls the type of plot produced, as
5584follows:
5585
5586@table @code
5587@item type="p"
5588Plot individual points (the default)
5589@item type="l"
5590Plot lines
5591@item type="b"
5592Plot points connected by lines (@emph{both})
5593@item type="o"
5594Plot points overlaid by lines
5595@item type="h"
5596Plot vertical lines from points to the zero axis (@emph{high-density})
5597@item type="s"
5598@itemx type="S"
5599Step-function plots.  In the first form, the top of the vertical defines
5600the point; in the second, the bottom.
5601@item type="n"
5602No plotting at all.  However axes are still drawn (by default) and the
5603coordinate system is set up according to the data.  Ideal for creating
5604plots with subsequent low-level graphics functions.
5605@end table
5606
5607@item xlab=@var{string}
5608@itemx ylab=@var{string}
5609Axis labels for the @math{x} and @math{y} axes.  Use these arguments to
5610change the default labels, usually the names of the objects used in the
5611call to the high-level plotting function.
5612
5613@item main=@var{string}
5614Figure title, placed at the top of the plot in a large font.
5615
5616@item sub=@var{string}
5617Sub-title, placed just below the @math{x}-axis in a smaller font.
5618@end table
5619
5620@node Low-level plotting commands, Interacting with graphics, High-level plotting commands, Graphics
5621@section Low-level plotting commands
5622
5623Sometimes the high-level plotting functions don't produce exactly the
5624kind of plot you desire.  In this case, low-level plotting commands can
5625be used to add extra information (such as points, lines or text) to the
5626current plot.
5627
5628Some of the more useful low-level plotting functions are:
5629
5630@table @code
5631@item points(x, y)
5632@itemx lines(x, y)
5633@findex points
5634@findex lines
5635Adds points or connected lines to the current plot.  @code{plot()}'s
5636@code{type=} argument can also be passed to these functions (and
5637defaults to @code{"p"} for @code{points()} and @code{"l"} for
5638@code{lines()}.)
5639
5640@item text(x, y, labels, @dots{})
5641@findex text
5642Add text to a plot at points given by @code{x, y}.  Normally
5643@code{labels} is an integer or character vector in which case
5644@code{labels[i]} is plotted at point @code{(x[i], y[i])}.  The default
5645is @code{1:length(x)}.
5646
5647@strong{Note}: This function is often used in the sequence
5648
5649@example
5650> plot(x, y, type="n"); text(x, y, names)
5651@end example
5652
5653@noindent
5654The graphics parameter @code{type="n"} suppresses the points but sets up
5655the axes, and the @code{text()} function supplies special characters, as
5656specified by the character vector @code{names} for the points.
5657
5658@item abline(a, b)
5659@itemx abline(h=@var{y})
5660@itemx abline(v=@var{x})
5661@itemx abline(@var{lm.obj})
5662@findex abline
5663Adds a line of slope @code{b} and intercept @code{a} to the current
5664plot.  @code{h=@var{y}} may be used to specify @math{y}-coordinates for
5665the heights of horizontal lines to go across a plot, and
5666@code{v=@var{x}} similarly for the @math{x}-coordinates for vertical
5667lines.  Also @var{lm.obj} may be list with a @code{coefficients}
5668component of length 2 (such as the result of model-fitting functions,)
5669which are taken as an intercept and slope, in that order.
5670
5671@item polygon(x, y, @dots{})
5672@findex polygon
5673Draws a polygon defined by the ordered vertices in (@code{x}, @code{y})
5674and (optionally) shade it in with hatch lines, or fill it if the
5675graphics device allows the filling of figures.
5676
5677@item legend(x, y, legend, @dots{})
5678@findex legend
5679Adds a legend to the current plot at the specified position.  Plotting
5680characters, line styles, colors etc., are identified with the labels in
5681the character vector @code{legend}.  At least one other argument @var{v}
5682(a vector the same length as @code{legend}) with the corresponding
5683values of the plotting unit must also be given, as follows:
5684
5685@table @code
5686@item legend( , fill=@var{v})
5687Colors for filled boxes
5688@item legend( , col=@var{v})
5689Colors in which points or lines will be drawn
5690@item legend( , lty=@var{v})
5691Line styles
5692@item legend( , lwd=@var{v})
5693Line widths
5694@item legend( , pch=@var{v})
5695Plotting characters (character vector)
5696@end table
5697
5698@item title(main, sub)
5699@findex title
5700Adds a title @code{main} to the top of the current plot in a large font
5701and (optionally) a sub-title @code{sub} at the bottom in a smaller font.
5702
5703@item axis(side, @dots{})
5704@findex axis
5705Adds an axis to the current plot on the side given by the first argument
5706(1 to 4, counting clockwise from the bottom.)  Other arguments control
5707the positioning of the axis within or beside the plot, and tick
5708positions and labels.  Useful for adding custom axes after calling
5709@code{plot()} with the @code{axes=FALSE} argument.
5710@end table
5711
5712Low-level plotting functions usually require some positioning
5713information (e.g., @math{x} and @math{y} coordinates) to determine where
5714to place the new plot elements.  Coordinates are given in terms of
5715@emph{user coordinates} which are defined by the previous high-level
5716graphics command and are chosen based on the supplied data.
5717
5718Where @code{x} and @code{y} arguments are required, it is also
5719sufficient to supply a single argument being a list with elements named
5720@code{x} and @code{y}.  Similarly a matrix with two columns is also
5721valid input.  In this way functions such as @code{locator()} (see below)
5722may be used to specify positions on a plot interactively.
5723
5724@menu
5725* Mathematical annotation::
5726* Hershey vector fonts::
5727@end menu
5728
5729@node Mathematical annotation, Hershey vector fonts, Low-level plotting commands, Low-level plotting commands
5730@subsection Mathematical annotation
5731
5732In some cases, it is useful to add mathematical symbols and formulae to a
5733plot.  This can be achieved in @R{} by specifying an @emph{expression} rather
5734than a character string in any one of @code{text}, @code{mtext}, @code{axis},
5735or @code{title}.   For example, the following code draws the formula for
5736the Binomial probability function:
5737
5738@example
5739> text(x, y, expression(paste(bgroup("(", atop(n, x), ")"), p^x, q^@{n-x@})))
5740@end example
5741
5742More information, including a full listing of the features available can
5743obtained from within @R{} using the commands:
5744
5745@example
5746> help(plotmath)
5747> example(plotmath)
5748> demo(plotmath)
5749@end example
5750
5751@node Hershey vector fonts,  , Mathematical annotation, Low-level plotting commands
5752@subsection Hershey vector fonts
5753
5754It is possible to specify Hershey vector fonts for rendering text when using
5755the @code{text} and @code{contour} functions.  There are three reasons for
5756using the Hershey fonts:
5757@itemize @bullet
5758@item
5759Hershey fonts can produce better
5760output, especially on a computer screen, for rotated and/or small text.
5761@item
5762Hershey fonts
5763provide certain symbols that may not be available
5764in the standard fonts.  In particular, there are zodiac signs, cartographic
5765symbols and astronomical symbols.
5766@item
5767Hershey fonts provide cyrillic and japanese (Kana and Kanji) characters.
5768@end itemize
5769
5770More information, including tables of Hershey characters can be obtained from
5771within @R{} using the commands:
5772
5773@example
5774> help(Hershey)
5775> demo(Hershey)
5776> help(Japanese)
5777> demo(Japanese)
5778@end example
5779
5780@node Interacting with graphics, Using graphics parameters, Low-level plotting commands, Graphics
5781@section Interacting with graphics
5782
5783@R{} also provides functions which allow users to extract or add
5784information to a plot using a mouse.  The simplest of these is the
5785@code{locator()} function:
5786
5787@table @code
5788@item locator(n, type)
5789@findex locator
5790Waits for the user to select locations on the current plot using the
5791left mouse button.  This continues until @code{n} (default 512) points
5792have been selected, or another mouse button is pressed.  The
5793@code{type} argument allows for plotting at the selected points and has
5794the same effect as for high-level graphics commands; the default is no
5795plotting.  @code{locator()} returns the locations of the points selected
5796as a list with two components @code{x} and @code{y}.
5797@end table
5798
5799@code{locator()} is usually called with no arguments.  It is
5800particularly useful for interactively selecting positions for graphic
5801elements such as legends or labels when it is difficult to calculate in
5802advance where the graphic should be placed.  For example, to place some
5803informative text near an outlying point, the command
5804
5805@example
5806> text(locator(1), "Outlier", adj=0)
5807@end example
5808
5809@noindent
5810may be useful.  (@code{locator()} will be ignored if the current device,
5811such as @code{postscript} does not support interactive pointing.)
5812
5813@table @code
5814@item identify(x, y, labels)
5815@findex identify
5816Allow the user to highlight any of the points defined by @code{x} and
5817@code{y} (using the left mouse button) by plotting the corresponding
5818component of @code{labels} nearby (or the index number of the point if
5819@code{labels} is absent).  Returns the indices of the selected points
5820when another button is pressed.
5821@end table
5822
5823Sometimes we want to identify particular @emph{points} on a plot, rather
5824than their positions.  For example, we may wish the user to select some
5825observation of interest from a graphical display and then manipulate
5826that observation in some way.  Given a number of @math{(x, y)}
5827coordinates in two numeric vectors @code{x} and @code{y}, we could use
5828the @code{identify()} function as follows:
5829
5830@example
5831> plot(x, y)
5832> identify(x, y)
5833@end example
5834
5835The @code{identify()} functions performs no plotting itself, but simply
5836allows the user to move the mouse pointer and click the left mouse
5837button near a point.  If there is a point near the mouse pointer it will
5838be marked with its index number (that is, its position in the
5839@code{x}/@code{y} vectors) plotted nearby.  Alternatively, you could use
5840some informative string (such as a case name) as a highlight by using
5841the @code{labels} argument to @code{identify()}, or disable marking
5842altogether with the @code{plot = FALSE} argument.  When the process is
5843terminated (see above), @code{identify()} returns the indices of the
5844selected points; you can use these indices to extract the selected
5845points from the original vectors @code{x} and @code{y}.
5846
5847@node Using graphics parameters, Graphics parameters, Interacting with graphics, Graphics
5848@section Using graphics parameters
5849
5850When creating graphics, particularly for presentation or publication
5851purposes, @R{}'s defaults do not always produce exactly that which is
5852required.  You can, however, customize almost every aspect of the
5853display using @emph{graphics parameters}.  @R{} maintains a list of a
5854large number of graphics parameters which control things such as line
5855style, colors, figure arrangement and text justification among many
5856others.  Every graphics parameter has a name (such as `@code{col}',
5857which controls colors,) and a value (a color number, for example.)
5858
5859A separate list of graphics parameters is maintained for each active
5860device, and each device has a default set of parameters when
5861initialized.  Graphics parameters can be set in two ways: either
5862permanently, affecting all graphics functions which access the current
5863device; or temporarily, affecting only a single graphics function call.
5864
5865@menu
5866* The par() function::
5867* Arguments to graphics functions::
5868@end menu
5869
5870@node The par() function, Arguments to graphics functions, Using graphics parameters, Using graphics parameters
5871@subsection Permanent changes: The @code{par()} function
5872@findex par
5873@cindex Graphics parameters
5874
5875The @code{par()} function is used to access and modify the list of
5876graphics parameters for the current graphics device.
5877
5878@table @code
5879@item par()
5880Without arguments, returns a list of all graphics parameters and their
5881values for the current device.
5882@item par(c("col", "lty"))
5883With a character vector argument, returns only the named graphics
5884parameters (again, as a list.)
5885@item par(col=4, lty=2)
5886With named arguments (or a single list argument), sets the values of
5887the named graphics parameters, and returns the original values of the
5888parameters as a list.
5889@end table
5890
5891Setting graphics parameters with the @code{par()} function changes the
5892value of the parameters @emph{permanently}, in the sense that all future
5893calls to graphics functions (on the current device) will be affected by
5894the new value.  You can think of setting graphics parameters in this way
5895as setting ``default'' values for the parameters, which will be used by
5896all graphics functions unless an alternative value is given.
5897
5898Note that calls to @code{par()} @emph{always} affect the global values
5899of graphics parameters, even when @code{par()} is called from within a
5900function.  This is often undesirable behavior---usually we want to set
5901some graphics parameters, do some plotting, and then restore the
5902original values so as not to affect the user's @R{} session.  You can
5903restore the initial values by saving the result of @code{par()} when
5904making changes, and restoring the initial values when plotting is
5905complete.
5906
5907@example
5908> oldpar <- par(col=4, lty=2)
5909  @r{@dots{} plotting commands @dots{}}
5910> par(oldpar)
5911@end example
5912
5913@noindent
5914To save and restore @emph{all} settable@footnote{Some graphics
5915parameters such as the size of the current device are for information
5916only.} graphical parameters use
5917
5918@example
5919> oldpar <- par(no.readonly=TRUE)
5920  @r{@dots{} plotting commands @dots{}}
5921> par(oldpar)
5922@end example
5923
5924
5925@node Arguments to graphics functions,  , The par() function, Using graphics parameters
5926@subsection Temporary changes: Arguments to graphics functions
5927
5928Graphics parameters may also be passed to (almost) any graphics function
5929as named arguments.  This has the same effect as passing the arguments
5930to the @code{par()} function, except that the changes only last for the
5931duration of the function call.  For example:
5932
5933@example
5934> plot(x, y, pch="+")
5935@end example
5936
5937@noindent
5938produces a scatterplot using a plus sign as the plotting character,
5939without changing the default plotting character for future plots.
5940
5941Unfortunately, this is not implemented entirely consistently and it is
5942sometimes necessary to set and reset graphics parameters using
5943@code{par()}.
5944
5945
5946@node Graphics parameters, Device drivers, Using graphics parameters, Graphics
5947@section Graphics parameters list
5948
5949The following sections detail many of the commonly-used graphical
5950parameters.  The @R{} help documentation for the @code{par()} function
5951provides a more concise summary; this is provided as a somewhat more
5952detailed alternative.
5953
5954Graphics parameters will be presented in the following form:
5955
5956@table @code
5957@item @var{name}=@var{value}
5958A description of the parameter's effect.  @var{name} is the name of the
5959parameter, that is, the argument name to use in calls to @code{par()} or
5960a graphics function.  @var{value} is a typical value you might use when
5961setting the parameter.
5962@end table
5963
5964Note that @code{axes} is @strong{not} a graphics parameter but an
5965argument to a few @code{plot} methods: see @code{xaxt} and @code{yaxt}.
5966
5967@menu
5968* Graphical elements::
5969* Axes and tick marks::
5970* Figure margins::
5971* Multiple figure environment::
5972@end menu
5973
5974@node Graphical elements, Axes and tick marks, Graphics parameters, Graphics parameters
5975@subsection Graphical elements
5976
5977@R{} plots are made up of points, lines, text and polygons (filled
5978regions.) Graphical parameters exist which control how these
5979@emph{graphical elements} are drawn, as follows:
5980
5981@table @code
5982@item pch="+"
5983Character to be used for plotting points.  The default varies with
5984graphics drivers, but it is usually
5985@ifnottex
5986a circle.
5987@end ifnottex
5988@tex
5989`$\circ$'.
5990@end tex
5991Plotted points tend to appear slightly above or below the appropriate
5992position unless you use @code{"."} as the plotting character, which
5993produces centered points.
5994
5995@item pch=4
5996When @code{pch} is given as an integer between 0 and 25 inclusive, a
5997specialized plotting symbol is produced.  To see what the symbols are,
5998use the command
5999
6000@example
6001> legend(locator(1), as.character(0:25), pch = 0:25)
6002@end example
6003
6004@noindent
6005Those from 21 to 25 may appear to duplicate earlier symbols, but can be
6006coloured in different ways: see the help on @code{points} and its
6007examples.
6008
6009In addition, @code{pch} can be a character or a number in the range
6010@code{32:255} representing a character in the current font.
6011
6012@item lty=2
6013Line types.  Alternative line styles are not supported on all graphics
6014devices (and vary on those that do) but line type 1 is always a solid
6015line, line type 0 is always invisible, and line types 2 and onwards are
6016dotted or dashed lines, or some combination of both.
6017
6018@item lwd=2
6019Line widths.  Desired width of lines, in multiples of the ``standard''
6020line width.  Affects axis lines as well as lines drawn with
6021@code{lines()}, etc.  Not all devices support this, and some have
6022restrictions on the widths that can be used.
6023
6024@item col=2
6025Colors to be used for points, lines, text, filled regions and images.
6026A number from the current palette (see @code{?palette}) or a named colour.
6027
6028@item col.axis
6029@itemx col.lab
6030@itemx col.main
6031@itemx col.sub
6032The color to be used for axis annotation, @math{x} and @math{y} labels,
6033main and sub-titles, respectively.
6034
6035@item font=2
6036An integer which specifies which font to use for text.  If possible,
6037device drivers arrange so that @code{1} corresponds to plain text,
6038@code{2} to bold face, @code{3} to italic, @code{4} to bold italic
6039and @code{5} to a symbol font (which include Greek letters).
6040
6041@item font.axis
6042@itemx font.lab
6043@itemx font.main
6044@itemx font.sub
6045The font to be used for axis annotation, @math{x} and @math{y} labels,
6046main and sub-titles, respectively.
6047
6048@item adj=-0.1
6049Justification of text relative to the plotting position.  @code{0} means
6050left justify, @code{1} means right justify and @code{0.5} means to
6051center horizontally about the plotting position.  The actual value is
6052the proportion of text that appears to the left of the plotting
6053position, so a value of @code{-0.1} leaves a gap of 10% of the text width
6054between the text and the plotting position.
6055
6056@item cex=1.5
6057Character expansion.  The value is the desired size of text characters
6058(including plotting characters) relative to the default text size.
6059
6060@item cex.axis
6061@itemx cex.lab
6062@itemx cex.main
6063@itemx cex.sub
6064The character expansion to be used for axis annotation, @math{x} and
6065@math{y} labels, main and sub-titles, respectively.
6066@end table
6067
6068@node Axes and tick marks, Figure margins, Graphical elements, Graphics parameters
6069@subsection Axes and tick marks
6070
6071Many of @R{}'s high-level plots have axes, and you can construct axes
6072yourself with the low-level @code{axis()} graphics function.  Axes have
6073three main components: the @emph{axis line} (line style controlled by the
6074@code{lty} graphics parameter), the @emph{tick marks} (which mark off unit
6075divisions along the axis line) and the @emph{tick labels} (which mark the
6076units.) These components can be customized with the following graphics
6077parameters.
6078
6079@table @code
6080@item lab=c(5, 7, 12)
6081The first two numbers are the desired number of tick intervals on the
6082@math{x} and @math{y} axes respectively.  The third number is the
6083desired length of axis labels, in characters (including the decimal
6084point.)  Choosing a too-small value for this parameter may result in all
6085tick labels being rounded to the same number!
6086
6087@item las=1
6088Orientation of axis labels.  @code{0} means always parallel to axis,
6089@code{1} means always horizontal, and @code{2} means always
6090perpendicular to the axis.
6091
6092@item mgp=c(3, 1, 0)
6093Positions of axis components.  The first component is the distance from
6094the axis label to the axis position, in text lines.  The second
6095component is the distance to the tick labels, and the final component is
6096the distance from the axis position to the axis line (usually zero).
6097Positive numbers measure outside the plot region, negative numbers
6098inside.
6099
6100@item tck=0.01
6101Length of tick marks, as a fraction of the size of the plotting region.
6102When @code{tck} is small (less than 0.5) the tick marks on the @math{x}
6103and @math{y} axes are forced to be the same size.  A value of 1 gives
6104grid lines.  Negative values give tick marks outside the plotting
6105region.  Use @code{tck=0.01} and @code{mgp=c(1,-1.5,0)} for internal
6106tick marks.
6107
6108@item xaxs="r"
6109@itemx yaxs="i"
6110Axis styles for the @math{x} and @math{y} axes, respectively.   With
6111styles @code{"i"} (internal) and @code{"r"} (the default) tick marks
6112always fall within the range of the data, however style @code{"r"}
6113leaves a small amount of space at the edges.  (@Sl{} has other styles
6114not implemented in @R{}.)
6115
6116@c Setting this parameter to @code{"d"} (direct axis) @emph{locks in} the
6117@c current axis and uses it for all future plots (or until the parameter is
6118@c set to one of the other values above, at least.) Useful for generating
6119@c series of fixed-scale plots.
6120@end table
6121
6122@node Figure margins, Multiple figure environment, Axes and tick marks, Graphics parameters
6123@subsection Figure margins
6124
6125
6126A single plot in @R{} is known as a @code{figure} and comprises a
6127@emph{plot region} surrounded by margins (possibly containing axis
6128labels, titles, etc.) and (usually) bounded by the axes themselves.
6129
6130@ifnotinfo
6131A typical figure is
6132
6133@image{images/fig11,7cm}
6134@end ifnotinfo
6135
6136Graphics parameters controlling figure layout include:
6137
6138@table @code
6139@item mai=c(1, 0.5, 0.5, 0)
6140Widths of the bottom, left, top and right margins, respectively,
6141measured in inches.
6142
6143@item mar=c(4, 2, 2, 1)
6144Similar to @code{mai}, except the measurement unit is text lines.
6145@end table
6146
6147@code{mar} and @code{mai} are equivalent in the sense that setting one
6148changes the value of the other.  The default values chosen for this
6149parameter are often too large; the right-hand margin is rarely needed,
6150and neither is the top margin if no title is being used.  The bottom and
6151left margins must be large enough to accommodate the axis and tick
6152labels.  Furthermore, the default is chosen without regard to the size
6153of the device surface: for example, using the @code{postscript()} driver
6154with the @code{height=4} argument will result in a plot which is about
615550% margin unless @code{mar} or @code{mai} are set explicitly.  When
6156multiple figures are in use (see below) the margins are reduced, however
6157this may not be enough when many figures share the same page.
6158
6159@node Multiple figure environment,  , Figure margins, Graphics parameters
6160@subsection Multiple figure environment
6161
6162@R{} allows you to create an @math{n} by @math{m} array of figures on a
6163single page.  Each figure has its own margins, and the array of figures
6164is optionally surrounded by an @emph{outer margin}, as shown in the
6165following figure.
6166
6167@ifnotinfo
6168@image{images/fig12,6cm}
6169@end ifnotinfo
6170
6171The graphical parameters relating to multiple figures are as follows:
6172
6173@table @code
6174@item mfcol=c(3, 2)
6175@itemx mfrow=c(2, 4)
6176Set the size of a multiple figure array.  The first value is the number of
6177rows; the second is the number of columns.  The only difference between
6178these two parameters is that setting @code{mfcol} causes figures to be
6179filled by column; @code{mfrow} fills by rows.
6180
6181The layout in the Figure could have been created by setting
6182@code{mfrow=c(3,2)}; the figure shows the page after four plots have
6183been drawn.
6184
6185Setting either of these can reduce the base size of symbols and text
6186(controlled by @code{par("cex")} and the pointsize of the device).  In a
6187layout with exactly two rows and columns the base size is reduced by a
6188factor of 0.83: if there are three or more of either rows or columns,
6189the reduction factor is 0.66.
6190
6191@item mfg=c(2, 2, 3, 2)
6192Position of the current figure in a multiple figure environment.  The first
6193two numbers are the row and column of the current figure; the last two
6194are the number of rows and columns in the multiple figure array.  Set
6195this parameter to jump between figures in the array.  You can even use
6196different values for the last two numbers than the @emph{true} values
6197for unequally-sized figures on the same page.
6198
6199@item fig=c(4, 9, 1, 4)/10
6200Position of the current figure on the page.  Values are the positions of
6201the left, right, bottom and top edges respectively, as a percentage of
6202the page measured from the bottom left corner.  The example value would
6203be for a figure in the bottom right of the page.  Set this parameter for
6204arbitrary positioning of figures within a page.  If you want to add a
6205figure to a current page, use @code{new=TRUE} as well (unlike S).
6206
6207@item oma=c(2, 0, 3, 0)
6208@itemx omi=c(0, 0, 0.8, 0)
6209Size of outer margins.  Like @code{mar} and @code{mai}, the first
6210measures in text lines and the second in inches, starting with the
6211bottom margin and working clockwise.
6212
6213@end table
6214
6215Outer margins are particularly useful for page-wise titles, etc.  Text
6216can be added to the outer margins with the @code{mtext()} function with
6217argument @code{outer=TRUE}.  There are no outer margins by default,
6218however, so you must create them explicitly using @code{oma} or
6219@code{omi}.
6220
6221More complicated arrangements of multiple figures can be produced by the
6222@code{split.screen()} and @code{layout()} functions, as well as by the
6223@pkg{grid} and @CRANpkg{lattice} packages.
6224
6225@node Device drivers, Dynamic graphics, Graphics parameters, Graphics
6226@section Device drivers
6227@cindex Graphics device drivers
6228
6229@R{} can generate graphics (of varying levels of quality) on almost any
6230type of display or printing device.  Before this can begin, however,
6231@R{} needs to be informed what type of device it is dealing with.  This
6232is done by starting a @emph{device driver}.  The purpose of a device
6233driver is to convert graphical instructions from @R{} (``draw a line,''
6234for example) into a form that the particular device can understand.
6235
6236Device drivers are started by calling a device driver function.  There
6237is one such function for every device driver: type @code{help(Devices)}
6238for a list of them all.  For example, issuing the command
6239
6240@example
6241> postscript()
6242@end example
6243
6244@noindent
6245causes all future graphics output to be sent to the printer in
6246PostScript format.  Some commonly-used device drivers are:
6247
6248@table @code
6249@item X11()
6250@findex X11
6251For use with the X11 window system on Unix-alikes
6252@item windows()
6253@findex windows
6254For use on Windows
6255@item quartz()
6256@findex quartz
6257For use on macOS
6258@item postscript()
6259@findex postscript
6260For printing on PostScript printers, or creating PostScript graphics
6261files.
6262@item pdf()
6263@findex pdf
6264Produces a PDF file, which can also be included into PDF files.
6265@item png()
6266@findex png
6267Produces a bitmap PNG file. (Not always available: see its help page.)
6268@item jpeg()
6269@findex jpeg
6270Produces a bitmap JPEG file, best used for @code{image} plots.
6271(Not always available: see its help page.)
6272@end table
6273
6274When you have finished with a device, be sure to terminate the device
6275driver by issuing the command
6276
6277@example
6278> dev.off()
6279@end example
6280
6281This ensures that the device finishes cleanly; for example in the case
6282of hardcopy devices this ensures that every page is completed and has
6283been sent to the printer.  (This will happen automatically at the normal
6284end of a session.)
6285
6286@menu
6287* PostScript diagrams for typeset documents::
6288* Multiple graphics devices::
6289@end menu
6290
6291@node PostScript diagrams for typeset documents, Multiple graphics devices, Device drivers, Device drivers
6292@subsection PostScript diagrams for typeset documents
6293
6294By passing the @code{file} argument to the @code{postscript()} device
6295driver function, you may store the graphics in PostScript format in a
6296file of your choice.  The plot will be in landscape orientation unless
6297the @code{horizontal=FALSE} argument is given, and you can control the
6298size of the graphic with the @code{width} and @code{height} arguments
6299(the plot will be scaled as appropriate to fit these dimensions.) For
6300example, the command
6301
6302@example
6303> postscript("file.ps", horizontal=FALSE, height=5, pointsize=10)
6304@end example
6305
6306@noindent
6307will produce a file containing PostScript code for a figure five inches
6308high, perhaps for inclusion in a document.  It is important to note that
6309if the file named in the command already exists, it will be overwritten.
6310This is the case even if the file was only created earlier in the same
6311@R{} session.
6312
6313Many usages of PostScript output will be to incorporate the figure in
6314another document.  This works best when @emph{encapsulated} PostScript
6315is produced: @R{} always produces conformant output, but only marks the
6316output as such when the @code{onefile=FALSE} argument is supplied.  This
6317unusual notation stems from @Sl{}-compatibility: it really means that
6318the output will be a single page (which is part of the EPSF
6319specification).  Thus to produce a plot for inclusion use something like
6320
6321@example
6322> postscript("plot1.eps", horizontal=FALSE, onefile=FALSE,
6323             height=8, width=6, pointsize=10)
6324@end example
6325
6326
6327@node Multiple graphics devices,  , PostScript diagrams for typeset documents, Device drivers
6328@subsection Multiple graphics devices
6329
6330In advanced use of @R{} it is often useful to have several graphics
6331devices in use at the same time.  Of course only one graphics device can
6332accept graphics commands at any one time, and this is known as the
6333@emph{current device}.  When multiple devices are open, they form a
6334numbered sequence with names giving the kind of device at any position.
6335
6336The main commands used for operating with multiple devices, and their
6337meanings are as follows:
6338
6339@table @code
6340@item X11()
6341[UNIX]
6342@item windows()
6343@itemx win.printer()
6344@itemx win.metafile()
6345[Windows]
6346@item quartz()
6347[macOS]
6348@item postscript()
6349@itemx pdf()
6350@item png()
6351@item jpeg()
6352@item tiff()
6353@item bitmap()
6354@itemx @dots{}
6355Each new call to a device driver function opens a new graphics device,
6356thus extending by one the device list.  This device becomes the current
6357device, to which graphics output will be sent.
6358
6359@item dev.list()
6360@findex dev.list
6361Returns the number and name of all active devices.  The device at
6362position 1 on the list is always the @emph{null device} which does not
6363accept graphics commands at all.
6364
6365@item dev.next()
6366@itemx dev.prev()
6367@findex dev.next
6368@findex dev.prev
6369Returns the number and name of the graphics device next to, or previous
6370to the current device, respectively.
6371
6372@item dev.set(which=@var{k})
6373@findex dev.set
6374Can be used to change the current graphics device to the one at position
6375@var{k} of the device list.  Returns the number and label of the device.
6376
6377@item dev.off(@var{k})
6378@findex dev.off
6379Terminate the graphics device at point @var{k} of the device list.  For
6380some devices, such as @code{postscript} devices, this will either print
6381the file immediately or correctly complete the file for later printing,
6382depending on how the device was initiated.
6383
6384@item dev.copy(device, @dots{}, which=@var{k})
6385@itemx dev.print(device, @dots{}, which=@var{k})
6386Make a copy of the device @var{k}.  Here @code{device} is a device
6387function, such as @code{postscript}, with extra arguments, if needed,
6388specified by @samp{@dots{}}.  @code{dev.print} is similar, but the
6389copied device is immediately closed, so that end actions, such as
6390printing hardcopies, are immediately performed.
6391
6392@item graphics.off()
6393Terminate all graphics devices on the list, except the null device.
6394@end table
6395
6396@node Dynamic graphics,  , Device drivers, Graphics
6397@section Dynamic graphics
6398@cindex Dynamic graphics
6399
6400@R{} does not have builtin capabilities for dynamic or
6401interactive graphics, e.g.@: rotating point clouds or to ``brushing''
6402(interactively highlighting) points. However, extensive dynamic graphics
6403facilities are available in the system GGobi by Swayne, Cook and Buja
6404available from
6405
6406@quotation
6407@uref{http://ggobi.org/}
6408@end quotation
6409
6410@noindent
6411and these can be accessed from @R{} via the package @CRANpkg{rggobi}, described at
6412@uref{http://ggobi.org/rggobi.html}.
6413
6414Also, package @CRANpkg{rgl} provides ways to interact with 3D plots, for example
6415of surfaces.
6416
6417@node Packages, OS facilities, Graphics, Top
6418@chapter Packages
6419@cindex Packages
6420
6421All @R{} functions and datasets are stored in @emph{packages}.  Only
6422when a package is loaded are its contents available.  This is done both
6423for efficiency (the full list would take more memory and would take
6424longer to search than a subset), and to aid package developers, who are
6425protected from name clashes with other code.  The process of developing
6426packages is described in @ref{Creating R packages, , Creating R
6427packages, R-exts, Writing R Extensions}.  Here, we will describe them
6428from a user's point of view.
6429
6430To see which packages are installed at your site, issue the command
6431
6432@example
6433> library()
6434@end example
6435
6436@noindent
6437with no arguments.  To load a particular package (e.g., the @CRANpkg{boot}
6438package containing functions from Davison & Hinkley (1997)), use a
6439command like
6440
6441@example
6442> library(boot)
6443@end example
6444
6445Users connected to the Internet can use the @code{install.packages()}
6446and @code{update.packages()} functions (available through the
6447@code{Packages} menu in the Windows and macOS GUIs, @pxref{Installing
6448packages, , , R-admin, R Installation and Administration}) to install
6449and update packages.
6450
6451To see which packages are currently loaded, use
6452
6453@example
6454> search()
6455@end example
6456
6457@noindent
6458to display the search list.  Some packages may be loaded but not
6459available on the search list (@pxref{Namespaces}): these will be
6460included in the list given by
6461
6462@example
6463> loadedNamespaces()
6464@end example
6465
6466
6467To see a list of all available help topics in an installed package,
6468use
6469
6470@example
6471> help.start()
6472@end example
6473
6474@noindent
6475to start the @HTML{} help system, and then navigate to the package
6476listing in the @code{Reference} section.
6477
6478@menu
6479* Standard packages::
6480* Contributed packages and CRAN::
6481* Namespaces::
6482@end menu
6483
6484@node Standard packages, Contributed packages and CRAN, Packages, Packages
6485@section Standard packages
6486
6487The standard (or @emph{base}) packages are considered part of the @R{}
6488source code.  They contain the basic functions that allow @R{} to work,
6489and the datasets and standard statistical and graphical functions that
6490are described in this manual.  They should be automatically available in
6491any @R{} installation.  @xref{Which add-on packages exist for R?, , R
6492packages, R-FAQ, R FAQ}, for a complete list.
6493
6494@node Contributed packages and CRAN, Namespaces, Standard packages, Packages
6495@section Contributed packages and @acronym{CRAN}
6496@cindex CRAN
6497
6498There are thousands of contributed packages for @R{}, written by many
6499different authors.  Some of these packages implement specialized
6500statistical methods, others give access to data or hardware, and others
6501are designed to complement textbooks.  Some (the @emph{recommended}
6502packages) are distributed with every binary distribution of @R{}.  Most
6503are available for download from @acronym{CRAN}
6504(@uref{https://CRAN.R-project.org/} and its mirrors) and other
6505repositories such as Bioconductor (@uref{https://www.bioconductor.org/}).
6506@c and Omegahat (@uref{http://www.omegahat.net/}).
6507The @emph{R FAQ}
6508contains a list of CRAN packages current at the time of release, but the
6509collection of available packages changes very frequently.
6510
6511@node Namespaces,  , Contributed packages and CRAN, Packages
6512@section Namespaces
6513@cindex Namespace
6514@findex ::
6515@findex :::
6516
6517Packages have @emph{namespaces}, which do three things: they allow the
6518package writer to hide functions and data that are meant only for
6519internal use, they prevent functions from breaking when a user (or other
6520package writer) picks a name that clashes with one in the package, and
6521they provide a way to refer to an object within a particular package.
6522
6523For example, @code{t()} is the transpose function in @R{}, but users
6524might define their own function named @code{t}.  Namespaces prevent
6525the user's definition from taking precedence, and breaking every
6526function that tries to transpose a matrix.
6527
6528There are two operators that work with namespaces.  The double-colon
6529operator @code{::} selects definitions from a particular namespace.
6530In the example above, the transpose function will always be available
6531as @code{base::t}, because it is defined in the @code{base} package.
6532Only functions that are exported from the package can be retrieved in
6533this way.
6534
6535The triple-colon operator @code{:::} may be seen in a few places in R
6536code: it acts like the double-colon operator but also allows access to
6537hidden objects.  Users are more likely to use the @code{getAnywhere()}
6538function, which searches multiple packages.
6539
6540Packages are often inter-dependent, and loading one may cause others to
6541be automatically loaded.  The colon operators described above will also
6542cause automatic loading of the associated package.  When packages with
6543namespaces are loaded automatically they are not added to the search
6544list.
6545
6546@node OS facilities, A sample session, Packages, Top
6547@chapter OS facilities
6548
6549@R{} has quite extensive facilities to access the OS under which it is
6550running: this allows it to be used as a scripting language and that
6551ability is much used by @R{} itself, for example to install packages.
6552
6553Because @R{}'s own scripts need to work across all platforms,
6554considerable effort has gone into make the scripting facilities as
6555platform-independent as is feasible.
6556
6557@menu
6558* Files and directories::
6559* Filepaths::
6560* System commands::
6561* Compression and Archives::
6562@end menu
6563
6564@node Files and directories, Filepaths, OS facilities, OS facilities
6565@section Files and directories
6566
6567There are many functions to manipulate files and directories. Here are
6568pointers to some of the more commonly used ones.
6569
6570To create an (empty) file or directory, use @code{file.create} or
6571@code{dir.create}.  (These are the analogues of the POSIX utilities
6572@command{touch} and @command{mkdir}.)  For temporary files and
6573directories in the @R{} session directory see @code{tempfile}.
6574
6575Files can be removed by either @code{file.remove} or @code{unlink}: the
6576latter can remove directory trees.
6577
6578For directory listings use @code{list.files} (also available as
6579@code{dir}) or @code{list.dirs}. These can select files using a regular
6580expression: to select by wildcards use @code{Sys.glob}.
6581
6582Many types of information on a filepath (including for example if it is
6583a file or directory) can be found by @code{file.info}.
6584
6585There are several ways to find out if a file `exists' (a file can
6586exist on the filesystem and not be visible to the current user).
6587There are functions @code{file.exists}, @code{file.access} and
6588@code{file_test} with various versions of this test: @code{file_test} is
6589a version of the POSIX @command{test} command for those familiar with
6590shell scripting.
6591
6592Function @code{file.copy} is the @R{} analogue of the POSIX command
6593@command{cp}.
6594
6595Choosing files can be done interactively by @code{file.choose}: the
6596Windows port has the more versatile functions @code{choose.files} and
6597@code{choose.dir} and there are similar functions in the @pkg{tcltk}
6598package: @code{tk_choose.files} and @code{tk_choose.dir}.
6599
6600Functions @code{file.show} and @code{file.edit} will display and edit
6601one or more files in a way appropriate to the @R{} port, using the
6602facilities of a console (such as RGui on Windows or R.app on macOS) if
6603one is in use.
6604
6605There is some support for @emph{links} in the filesystem: see functions
6606@code{file.link} and @code{Sys.readlink}.
6607
6608
6609@node Filepaths, System commands, Files and directories, OS facilities
6610@section Filepaths
6611
6612With a few exceptions, @R{} relies on the underlying OS functions to
6613manipulate filepaths.  Some aspects of this are allowed to depend on the
6614OS, and do, even down to the version of the OS.  There are POSIX
6615standards for how OSes should interpret filepaths and many @R{} users
6616assume POSIX compliance: but Windows does not claim to be compliant and
6617other OSes may be less than completely compliant.
6618
6619The following are some issues which have been encountered with filepaths.
6620
6621@itemize @bullet
6622@item
6623POSIX filesystems are case-sensitive, so @file{foo.png} and
6624@file{Foo.PNG} are different files.  However, the defaults on Windows
6625and macOS are to be case-insensitive, and FAT filesystems (commonly used
6626on removable storage) are not normally case-sensitive (and all filepaths
6627may be mapped to lower case).
6628
6629@item
6630Almost all the Windows' OS services support the use of slash or
6631backslash as the filepath separator, and @R{} converts the known
6632exceptions to the form required by Windows.
6633
6634@item
6635The behaviour of filepaths with a trailing slash is OS-dependent.  Such
6636paths are not valid on Windows and should not be expected to work.
6637POSIX-2008 requires such paths to match only directories, but earlier
6638versions allowed them to also match files.  So they are best avoided.
6639
6640@item
6641Multiple slashes in filepaths such as @file{/abc//def} are valid on
6642POSIX filesystems and treated as if there was only one slash.  They are
6643@emph{usually} accepted by Windows' OS functions.  However, leading
6644double slashes may have a different meaning.
6645
6646@item
6647Windows' UNC filepaths (such as @file{\\server\dir1\dir2\file} and
6648@file{\\?\UNC\server\dir1\dir2\file}) are not supported, but they may
6649work in some @R{} functions.  POSIX filesystems are allowed to treat a
6650leading double slash specially.
6651
6652@item
6653Windows allows filepaths containing drives and relative to the current
6654directory on a drive, e.g.@: @file{d:foo/bar} refers to
6655@file{d:/a/b/c/foo/bar} if the current directory @emph{on drive
6656@file{d:}} is @file{/a/b/c}.  It is intended that these work, but the
6657use of absolute paths is safer.
6658@end itemize
6659
6660Functions @code{basename} and @code{dirname} select parts of a file
6661path: the recommended way to assemble a file path from components is
6662@code{file.path}.  Function @code{pathexpand} does `tilde expansion',
6663substituting values for home directories (the current user's, and
6664perhaps those of other users).
6665
6666On filesystems with links, a single file can be referred to by many
6667filepaths.  Function @code{normalizePath} will find a canonical
6668filepath.
6669
6670Windows has the concepts of short (`8.3') and long file names:
6671@code{normalizePath} will return an absolute path using long file names
6672and @code{shortPathName} will return a version using short names.  The
6673latter does not contain spaces and uses backslash as the separator, so
6674is sometimes useful for exporting names from @R{}.
6675
6676File @emph{permissions} are a related topic.  @R{} has support for the
6677POSIX concepts of read/write/execute permission for owner/group/all but
6678this may be only partially supported on the filesystem, so for example
6679on Windows only read-only files (for the account running the @R{}
6680session) are recognized.  Access Control Lists (ACLs) are employed on
6681several filesystems, but do not have an agreed standard and @R{} has no
6682facilities to control them.  Use @code{Sys.chmod} to change permissions.
6683
6684@node System commands, Compression and Archives, Filepaths, OS facilities
6685@section System commands
6686
6687Functions @code{system} and @code{system2} are used to invoke a system
6688command and optionally collect its output.  @code{system2} is a little
6689more general but its main advantage is that it is easier to write
6690cross-platform code using it.
6691
6692@code{system} behaves differently on Windows from other OSes (because
6693the API C call of that name does).  Elsewhere it invokes a shell to run
6694the command: the Windows port of @R{} has a function @code{shell} to do
6695that.
6696
6697To find out if the OS includes a command, use @code{Sys.which}, which
6698attempts to do this in a cross-platform way (unfortunately it is not a
6699standard OS service).
6700
6701Function @code{shQuote} will quote filepaths as needed for commands in
6702the current OS.
6703
6704@node Compression and Archives,  , System commands, OS facilities
6705@section Compression and Archives
6706
6707Recent versions of @R{} have extensive facilities to read and write
6708compressed files, often transparently.  Reading of files in @R{} is to a
6709vey large extent done by @emph{connections}, and the @code{file}
6710function which is used to open a connection to a file (or a URL) and is
6711able to identify the compression used from the `magic' header of the
6712file.
6713
6714The type of compression which has been supported for longest is
6715@command{gzip} compression, and that remains a good general compromise.
6716Files compressed by the earlier Unix @command{compress} utility can also
6717be read, but these are becoming rare.  Two other forms of compression,
6718those of the @command{bzip2} and @command{xz} utilities are also
6719available.  These generally achieve higher rates of compression
6720(depending on the file, much higher) at the expense of slower
6721decompression and much slower compression.
6722
6723There is some confusion between @command{xz} and @command{lzma}
6724compression (see @uref{https://en.wikipedia.org/wiki/Xz} and
6725@uref{https://en.wikipedia.org/wiki/LZMA}): @R{} can read files
6726compressed by most versions of either.
6727
6728File archives are single files which contain a collection of files, the
6729most common ones being `tarballs' and zip files as used to distribute
6730@R{} packages.  @R{} can list and unpack both (see functions @code{untar}
6731and @code{unzip}) and create both (for @command{zip} with the help of an
6732external program).
6733
6734@node A sample session, Invoking R, OS facilities, Top
6735@appendix A sample session
6736
6737The following session is intended to introduce to you some features of
6738the @R{} environment by using them.  Many features of the system will be
6739unfamiliar and puzzling at first, but this puzzlement will soon
6740disappear.
6741@c This is written for the UNIX user.  Those using Windows will
6742@c need to adapt the discussion slightly.
6743
6744@table @code
6745@item Start @R{} appropriately for your platform (@pxref{Invoking R}).
6746
6747The @R{} program begins, with a banner.
6748
6749(Within @R{} code, the prompt on the left hand side will not be shown to
6750avoid confusion.)
6751
6752@item help.start()
6753Start the @HTML{} interface to on-line help (using a web browser
6754available at your machine).  You should briefly explore the features of
6755this facility with the mouse.
6756
6757Iconify the help window and move on to the next part.
6758
6759@item x <- rnorm(50)
6760@itemx y <- rnorm(x)
6761Generate two pseudo-random normal vectors of @math{x}- and
6762@math{y}-coordinates.
6763
6764@item plot(x, y)
6765Plot the points in the plane.  A graphics window will appear automatically.
6766
6767@item ls()
6768See which @R{} objects are now in the @R{} workspace.
6769
6770@item rm(x, y)
6771Remove objects no longer needed. (Clean up).
6772
6773@item x <- 1:20
6774Make @math{x = (1, 2, @dots{}, 20)}.
6775
6776@item w <- 1 + sqrt(x)/2
6777A `weight' vector of standard deviations.
6778
6779@item dummy <- data.frame(x=x, y= x + rnorm(x)*w)
6780@itemx dummy
6781Make a @emph{data frame} of two columns, @math{x} and @math{y}, and look
6782at it.
6783
6784@item fm <- lm(y ~ x, data=dummy)
6785@itemx summary(fm)
6786Fit a simple linear regression and look at the
6787analysis.  With @code{y} to the left of the tilde,
6788we are modelling @math{y} dependent on @math{x}.
6789
6790@item fm1 <- lm(y ~ x, data=dummy, weight=1/w^2)
6791@itemx summary(fm1)
6792Since we know the standard deviations, we can do a weighted regression.
6793
6794@item attach(dummy)
6795Make the columns in the data frame visible as variables.
6796
6797@item lrf <- lowess(x, y)
6798Make a nonparametric local regression function.
6799
6800@item plot(x, y)
6801Standard point plot.
6802
6803@item lines(x, lrf$y)
6804Add in the local regression.
6805
6806@item abline(0, 1, lty=3)
6807The true regression line: (intercept 0, slope 1).
6808
6809@item abline(coef(fm))
6810Unweighted regression line.
6811
6812@item abline(coef(fm1), col = "red")
6813Weighted regression line.
6814
6815@item detach()
6816Remove data frame from the search path.
6817
6818@item plot(fitted(fm), resid(fm),
6819@itemx @w{@ @ @ @ @ xlab="Fitted values"},
6820@itemx @w{@ @ @ @ @ ylab="Residuals"},
6821@itemx @w{@ @ @ @ @ main="Residuals vs Fitted")}
6822A standard regression diagnostic plot to check for heteroscedasticity.
6823Can you see it?
6824
6825@item qqnorm(resid(fm), main="Residuals Rankit Plot")
6826A normal scores plot to check for skewness, kurtosis and outliers.  (Not
6827very useful here.)
6828
6829@item rm(fm, fm1, lrf, x, dummy)
6830Clean up again.
6831@end table
6832
6833The next section will look at data from the classical experiment of
6834Michelson to measure the speed of light.  This dataset is available in
6835the @code{morley} object, but we will read it to illustrate the
6836@code{read.table} function.
6837
6838@table @code
6839@item filepath <- system.file("data", "morley.tab" , package="datasets")
6840@itemx filepath
6841Get the path to the data file.
6842
6843@item file.show(filepath)
6844Optional.  Look at the file.
6845
6846@item mm <- read.table(filepath)
6847@itemx mm
6848Read in the Michelson data as a data frame, and look at it.
6849There are five experiments (column @code{Expt}) and each has 20 runs
6850(column @code{Run}) and @code{sl} is the recorded speed of light,
6851suitably coded.
6852
6853@item mm$Expt <- factor(mm$Expt)
6854@itemx mm$Run <- factor(mm$Run)
6855Change @code{Expt} and @code{Run} into factors.
6856
6857@item attach(mm)
6858Make the data frame visible at position 2 (the default).
6859
6860@item plot(Expt, Speed, main="Speed of Light Data", xlab="Experiment No.")
6861Compare the five experiments with simple boxplots.
6862
6863@item fm <- aov(Speed ~ Run + Expt, data=mm)
6864@itemx summary(fm)
6865Analyze as a randomized block, with `runs' and `experiments' as factors.
6866
6867@item fm0 <- update(fm, . ~ . - Run)
6868@itemx anova(fm0, fm)
6869Fit the sub-model omitting `runs', and compare using a formal analysis
6870of variance.
6871
6872@item detach()
6873@itemx rm(fm, fm0)
6874Clean up before moving on.
6875
6876@end table
6877
6878We now look at some more graphical features: contour and image plots.
6879
6880@table @code
6881@item x <- seq(-pi, pi, len=50)
6882@itemx y <- x
6883@math{x} is a vector of 50 equally spaced values in
6884@ifnottex
6885the interval [-pi\, pi].
6886@end ifnottex
6887@iftex
6888@tex
6889$-\pi\leq x \leq \pi$.
6890@end tex
6891@end iftex
6892@math{y} is the same.
6893
6894@item f <- outer(x, y, function(x, y) cos(y)/(1 + x^2))
6895@math{f} is a square matrix, with rows and columns indexed by @math{x}
6896and @math{y} respectively, of values of the function
6897@eqn{\cos(y)/(1 + x^2),cos(y)/(1 + x^2)}.
6898
6899@item oldpar <- par(no.readonly = TRUE)
6900@itemx par(pty="s")
6901Save the plotting parameters and set the plotting region to ``square''.
6902
6903@item contour(x, y, f)
6904@itemx contour(x, y, f, nlevels=15, add=TRUE)
6905Make a contour map of @math{f}; add in more lines for more detail.
6906
6907@item fa <- (f-t(f))/2
6908@code{fa} is the ``asymmetric part'' of @math{f}.  (@code{t()} is
6909transpose).
6910
6911@item contour(x, y, fa, nlevels=15)
6912Make a contour plot, @dots{}
6913
6914@item par(oldpar)
6915@dots{} and restore the old graphics parameters.
6916
6917@item image(x, y, f)
6918@itemx image(x, y, fa)
6919Make some high density image plots, (of which you can get
6920hardcopies if you wish), @dots{}
6921
6922@item objects(); rm(x, y, f, fa)
6923@dots{} and clean up before moving on.
6924@end table
6925
6926@R{} can do complex arithmetic, also.
6927
6928@table @code
6929@item th <- seq(-pi, pi, len=100)
6930@itemx z <- exp(1i*th)
6931@code{1i} is used for the complex number @math{i}.
6932
6933@item par(pty="s")
6934@itemx plot(z, type="l")
6935Plotting complex arguments means plot imaginary versus real parts.  This
6936should be a circle.
6937
6938@item w <- rnorm(100) + rnorm(100)*1i
6939Suppose we want to sample points within the unit circle.  One method
6940would be to take complex numbers with standard normal real and imaginary
6941parts @dots{}
6942
6943@item w <- ifelse(Mod(w) > 1, 1/w, w)
6944@dots{} and to map any outside the circle onto their reciprocal.
6945
6946@item plot(w, xlim=c(-1,1), ylim=c(-1,1), pch="+",xlab="x", ylab="y")
6947@itemx lines(z)
6948All points are inside the unit circle, but the distribution is not
6949uniform.
6950
6951@item w <- sqrt(runif(100))*exp(2*pi*runif(100)*1i)
6952@itemx plot(w, xlim=c(-1,1), ylim=c(-1,1), pch="+", xlab="x", ylab="y")
6953@itemx lines(z)
6954The second method uses the uniform distribution.  The points should now
6955look more evenly spaced over the disc.
6956
6957@item rm(th, w, z)
6958Clean up again.
6959
6960@item q()
6961Quit the @R{} program.  You will be asked if you want to save the @R{}
6962workspace, and for an exploratory session like this, you probably do not
6963want to save it.
6964@end table
6965
6966@node Invoking R, The command-line editor, A sample session, Top
6967@appendix Invoking R
6968
6969Users of @R{} on Windows or macOS should read the OS-specific section
6970first, but command-line use is also supported.
6971
6972@menu
6973* Invoking R from the command line::
6974* Invoking R under Windows::
6975* Invoking R under macOS::
6976* Scripting with R::
6977@end menu
6978
6979@node Invoking R from the command line, Invoking R under Windows, Invoking R, Invoking R
6980@appendixsec Invoking R from the command line
6981
6982When working at a command line on UNIX or Windows, the command @samp{R}
6983can be used both for starting the main @R{} program in the form
6984
6985@display
6986@code{R} [@var{options}] [@code{<}@var{infile}] [@code{>}@var{outfile}],
6987@end display
6988
6989@noindent
6990or, via the @code{R CMD} interface, as a wrapper to various @R{} tools
6991(e.g., for processing files in @R{} documentation format or manipulating
6992add-on packages) which are not intended to be called ``directly''.
6993
6994At the Windows command-line, @command{Rterm.exe} is preferred to
6995@command{R}.
6996
6997You need to ensure that either the environment variable @env{TMPDIR} is
6998unset or it points to a valid place to create temporary files and
6999directories.
7000
7001Most options control what happens at the beginning and at the end of an
7002@R{} session.  The startup mechanism is as follows (see also the on-line
7003help for topic @samp{Startup} for more information, and the section below
7004for some Windows-specific details).
7005
7006@itemize @bullet
7007@item
7008Unless @option{--no-environ} was given, @R{} searches for user and site
7009files to process for setting environment variables.  The name of the
7010site file is the one pointed to by the environment variable
7011@env{R_ENVIRON}; if this is unset, @file{@var{R_HOME}/etc/Renviron.site}
7012is used (if it exists).  The user file is the one pointed to by the
7013environment variable @env{R_ENVIRON_USER} if this is set; otherwise,
7014files @file{.Renviron} in the current or in the user's home directory
7015(in that order) are searched for.  These files should contain lines of
7016the form @samp{@var{name}=@var{value}}.  (See @code{help("Startup")} for
7017a precise description.)  Variables you might want to set include
7018@env{R_PAPERSIZE} (the default paper size), @env{R_PRINTCMD} (the
7019default print command) and @env{R_LIBS} (specifies the list of @R{}
7020library trees searched for add-on packages).
7021
7022@item
7023Then @R{} searches for the site-wide startup profile unless the command
7024line option @option{--no-site-file} was given.  The name of this file is
7025taken from the value of the @env{R_PROFILE} environment variable.  If
7026that variable is unset, the default
7027@file{@var{R_HOME}/etc/Rprofile.site} is used if this exists.
7028
7029@item
7030Then, unless @option{--no-init-file} was given, @R{} searches for a user
7031profile and sources it.  The name of this file is taken from the
7032environment variable @env{R_PROFILE_USER}; if unset, a file called
7033@file{.Rprofile} in the current directory or in the user's home
7034directory (in that order) is searched for.
7035
7036@item
7037It also loads a saved workspace from file @file{.RData} in the current
7038directory if there is one (unless @option{--no-restore} or
7039@option{--no-restore-data} was specified).
7040
7041@item
7042Finally, if a function @code{.First()} exists, it is executed.  This
7043function (as well as @code{.Last()} which is executed at the end of the
7044@R{} session) can be defined in the appropriate startup profiles, or
7045reside in @file{.RData}.
7046@end itemize
7047
7048In addition, there are options for controlling the memory available to
7049the @R{} process (see the on-line help for topic @samp{Memory} for more
7050information).  Users will not normally need to use these unless they
7051are trying to limit the amount of memory used by @R{}.
7052
7053@R{} accepts the following command-line options.
7054
7055@table @option
7056@item --help
7057@itemx -h
7058Print short help message to standard output and exit successfully.
7059
7060@item --version
7061Print version information to standard output and exit successfully.
7062
7063@item --encoding=@var{enc}
7064Specify the encoding to be assumed for input from the console or
7065@code{stdin}.  This needs to be an encoding known to @code{iconv}: see
7066its help page.  (@code{--encoding @var{enc}} is also accepted.)  The
7067input is re-encoded to the locale @R{} is running in and needs to be
7068representable in the latter's encoding (so e.g.@: you cannot re-encode
7069Greek text in a French locale unless that locale uses the UTF-8
7070encoding).
7071
7072@item RHOME
7073Print the path to the @R{} ``home directory'' to standard output and
7074exit successfully.  Apart from the front-end shell script and the man
7075page, @R{} installation puts everything (executables, packages, etc.)
7076into this directory.
7077
7078@item --save
7079@itemx --no-save
7080Control whether data sets should be saved or not at the end of the @R{}
7081session.  If neither is given in an interactive session, the user is
7082asked for the desired behavior when ending the session with @kbd{q()};
7083in non-interactive use one of these must be specified or implied by some
7084other option (see below).
7085
7086@item --no-environ
7087Do not read any user file to set environment variables.
7088
7089@item --no-site-file
7090Do not read the site-wide profile at startup.
7091
7092@item --no-init-file
7093Do not read the user's profile at startup.
7094
7095@item --restore
7096@itemx --no-restore
7097@itemx --no-restore-data
7098Control whether saved images (file @file{.RData} in the directory where
7099@R{} was started) should be restored at startup or not.  The default is
7100to restore. (@option{--no-restore} implies all the specific
7101@option{--no-restore-*} options.)
7102
7103@item --no-restore-history
7104Control whether the history file (normally file @file{.Rhistory} in the
7105directory where @R{} was started, but can be set by the environment
7106variable @env{R_HISTFILE}) should be restored at startup or not.  The
7107default is to restore.
7108
7109@item --no-Rconsole
7110(Windows only) Prevent loading the @file{Rconsole} file at startup.
7111
7112@item --vanilla
7113Combine @option{--no-save}, @option{--no-environ},
7114@option{--no-site-file}, @option{--no-init-file} and
7115@option{--no-restore}.  Under Windows, this also includes
7116@option{--no-Rconsole}.
7117
7118@item -f @var{file}
7119@itemx --file=@var{file}
7120(not @command{Rgui.exe}) Take input from @var{file}: @samp{-} means
7121@code{stdin}.  Implies @option{--no-save} unless @option{--save} has
7122been set.  On a Unix-alike, shell metacharacters should be avoided in
7123@var{file} (but spaces are allowed).
7124
7125@item -e @var{expression}
7126(not @command{Rgui.exe}) Use @var{expression} as an input line.  One or
7127more @option{-e} options can be used, but not together with @option{-f}
7128or @option{--file}.  Implies @option{--no-save} unless @option{--save}
7129has been set.  (There is a limit of 10,000 bytes on the total length of
7130expressions used in this way.  Expressions containing spaces or shell
7131metacharacters will need to be quoted.)
7132
7133@item --no-readline
7134(UNIX only) Turn off command-line editing via @strong{readline}.  This
7135is useful when running @R{} from within Emacs using the @acronym{ESS}
7136(``Emacs Speaks Statistics'') package.  @xref{The command-line editor},
7137for more information.  Command-line editing is enabled for default
7138interactive use (see @option{--interactive}).  This option also affects
7139tilde-expansion: see the help for @code{path.expand}.
7140
7141@item --min-vsize=@var{N}
7142@itemx --min-nsize=@var{N}
7143For expert use only: set the initial trigger sizes for garbage
7144collection of vector heap (in bytes) and @emph{cons cells} (number)
7145respectively.  Suffix @samp{M} specifies megabytes or millions of cells
7146respectively.  The defaults are 6Mb and 350k respectively and can also
7147be set by environment variables @env{R_NSIZE} and @env{R_VSIZE}.
7148
7149@item --max-ppsize=@var{N}
7150Specify the maximum size of the pointer protection stack as @var{N}
7151locations.  This defaults to 10000, but can be increased to allow
7152large and complicated calculations to be done.  Currently the maximum
7153value accepted is 100000.
7154
7155@item --max-mem-size=@var{N}
7156(Windows only) Specify a limit for the amount of memory to be used both
7157for @R{} objects and working areas.  This is set by default to the
7158smaller of the amount of physical RAM in the machine and for 32-bit
7159@R{}, 1.5Gb@footnote{2.5Gb on versions of Windows that support 3Gb per
7160process and have the support enabled: see the @file{rw-FAQ} Q2.9; 3.5Gb
7161on most 64-bit versions of Windows.}, and must be between 32Mb and the
7162maximum allowed on that version of Windows.
7163
7164@item --quiet
7165@itemx --silent
7166@itemx -q
7167Do not print out the initial copyright and welcome messages.
7168
7169@item --no-echo
7170Make @R{} run as quietly as possible.  This option is intended to
7171support programs which use @R{} to compute results for them.  It implies
7172@option{--quiet} and @option{--no-save}.
7173
7174@item --interactive
7175(UNIX only) Assert that @R{} really is being run interactively even if
7176input has been redirected: use if input is from a FIFO or pipe and fed
7177from an interactive program.  (The default is to deduce that @R{} is
7178being run interactively if and only if @file{stdin} is connected to a
7179terminal or @code{pty}.)  Using @option{-e}, @option{-f} or
7180@option{--file} asserts non-interactive use even if
7181@option{--interactive} is given.
7182
7183Note that this does not turn on command-line editing.
7184
7185@item --ess
7186(Windows only) Set @code{Rterm} up for use by @code{R-inferior-mode} in
7187@acronym{ESS}, including asserting interactive use (without the
7188command-line editor) and no buffering of @file{stdout}.
7189
7190@item --verbose
7191Print more information about progress, and in particular set @R{}'s
7192option @code{verbose} to @code{TRUE}.  @R{} code uses this option to
7193control the printing of diagnostic messages.
7194
7195@item --debugger=@var{name}
7196@itemx -d @var{name}
7197(UNIX only) Run @R{} through debugger @var{name}.  For most debuggers
7198(the exceptions are @command{valgrind} and recent versions of
7199@command{gdb}), further command line options are disregarded, and should
7200instead be given when starting the @R{} executable from inside the
7201debugger.
7202
7203@item --gui=@var{type}
7204@itemx -g @var{type}
7205(UNIX only) Use @var{type} as graphical user interface (note that this
7206also includes interactive graphics).  Currently, possible values for
7207@var{type} are @samp{X11} (the default) and, provided that @samp{Tcl/Tk}
7208support is available, @samp{Tk}. (For back-compatibility, @samp{x11} and
7209@samp{tk} are accepted.)
7210
7211@item --arch=@var{name}
7212(UNIX only) Run the specified sub-architecture.
7213
7214@item --args
7215This flag does nothing except cause the rest of the command line to be
7216skipped: this can be useful to retrieve values from it with
7217@code{commandArgs(TRUE)}.
7218@end table
7219
7220Note that input and output can be redirected in the usual way (using
7221@samp{<} and @samp{>}), but the line length limit of 4095 bytes still
7222applies.  Warning and error messages are sent to the error channel
7223(@code{stderr}).
7224
7225The command @code{R CMD} allows the invocation of various tools which
7226are useful in conjunction with @R{}, but not intended to be called
7227``directly''.  The general form is
7228
7229@example
7230R CMD @var{command} @var{args}
7231@end example
7232
7233@noindent
7234where @var{command} is the name of the tool and @var{args} the arguments
7235passed on to it.
7236
7237Currently, the following tools are available.
7238
7239@table @code
7240@item BATCH
7241Run @R{} in batch mode.  Runs @command{R --restore --save} with possibly
7242further options (see @code{?BATCH}).
7243@item COMPILE
7244(UNIX only) Compile C, C++, Fortran @dots{} files for use with @R{}.
7245@item SHLIB
7246Build shared library for dynamic loading.
7247@item INSTALL
7248Install add-on packages.
7249@item REMOVE
7250Remove add-on packages.
7251@item build
7252Build (that is, package) add-on packages.
7253@item check
7254Check add-on packages.
7255@item LINK
7256(UNIX only) Front-end for creating executable programs.
7257@item Rprof
7258Post-process @R{} profiling files.
7259@item Rdconv
7260@itemx Rd2txt
7261Convert Rd format to various other formats, including @HTML{}, @LaTeX{},
7262plain text, and extracting the examples.  @code{Rd2txt} can be used as
7263shorthand for @code{Rd2conv -t txt}.
7264@item Rd2pdf
7265Convert Rd format to PDF.
7266@item Stangle
7267Extract S/R code from Sweave or other vignette documentation
7268@item Sweave
7269Process Sweave or other vignette documentation
7270@item Rdiff
7271Diff @R{} output ignoring headers etc
7272@item config
7273Obtain configuration information
7274@item javareconf
7275(Unix only) Update the Java configuration variables
7276@item rtags
7277(Unix only) Create Emacs-style tag files from C, R, and Rd files
7278@item open
7279(Windows only) Open a file via Windows' file associations
7280@item texify
7281(Windows only) Process (La)TeX files with R's style files
7282@end table
7283
7284Use
7285
7286@example
7287R CMD @var{command} --help
7288@end example
7289
7290@noindent
7291to obtain usage information for each of the tools accessible via the
7292@code{R CMD} interface.
7293
7294In addition, you can use options @option{--arch=},
7295@option{--no-environ}, @option{--no-init-file}, @option{--no-site-file}
7296and @option{--vanilla} between @command{R} and @command{CMD}: these
7297affect any @R{} processes run by the tools.  (Here @option{--vanilla} is
7298equivalent to @option{--no-environ --no-site-file --no-init-file}.)
7299However, note that @command{R CMD} does not of itself use any @R{}
7300startup files (in particular, neither user nor site @file{Renviron}
7301files), and all of the @R{} processes run by these tools (except
7302@command{BATCH}) use @option{--no-restore}.  Most use @option{--vanilla}
7303and so invoke no @R{} startup files: the current exceptions are
7304@command{INSTALL}, @command{REMOVE}, @command{Sweave} and
7305@command{SHLIB} (which uses @option{--no-site-file --no-init-file}).
7306
7307@example
7308R CMD @var{cmd} @var{args}
7309@end example
7310
7311@noindent
7312for any other executable @command{@var{cmd}} on the path or given by an
7313absolute filepath: this is useful to have the same environment as @R{}
7314or the specific commands run under, for example to run @command{ldd} or
7315@command{pdflatex}.  Under Windows @var{cmd} can be an executable or a
7316batch file, or if it has extension @code{.sh} or @code{.pl} the
7317appropriate interpreter (if available) is called to run it.
7318
7319
7320@node Invoking R under Windows, Invoking R under macOS, Invoking R from the command line, Invoking R
7321@appendixsec Invoking R under Windows
7322
7323There are two ways to run @R{} under Windows.  Within a terminal window
7324(e.g.@: @code{cmd.exe} or a more capable shell), the methods described in
7325the previous section may be used, invoking by @code{R.exe} or more
7326directly by @code{Rterm.exe}.  For interactive use, there is a
7327console-based GUI (@code{Rgui.exe}).
7328
7329The startup procedure under Windows is very similar to that under
7330UNIX, but references to the `home directory' need to be clarified, as
7331this is not always defined on Windows.  If the environment variable
7332@env{R_USER} is defined, that gives the home directory.  Next, if the
7333environment variable @env{HOME} is defined, that gives the home
7334directory.  After those two user-controllable settings, @R{} tries to
7335find system defined home directories.  It first tries to use the
7336Windows "personal" directory (typically @code{My Documents} in
7337recent versions of Windows).  If that fails, and
7338environment variables @env{HOMEDRIVE} and @env{HOMEPATH} are defined
7339(and they normally are) these define the home directory.  Failing all
7340those, the home directory is taken to be the starting directory.
7341
7342You need to ensure that either the environment variables @env{TMPDIR},
7343@env{TMP} and @env{TEMP} are either unset or one of them points to a
7344valid place to create temporary files and directories.
7345
7346Environment variables can be supplied as @samp{@var{name}=@var{value}}
7347pairs on the command line.
7348
7349If there is an argument ending @file{.RData} (in any case) it is
7350interpreted as the path to the workspace to be restored: it implies
7351@option{--restore} and sets the working directory to the parent of the
7352named file.  (This mechanism is used for drag-and-drop and file
7353association with @code{RGui.exe}, but also works for @code{Rterm.exe}.
7354If the named file does not exist it sets the working directory
7355if the parent directory exists.)
7356
7357The following additional command-line options are available when
7358invoking @code{RGui.exe}.
7359
7360@table @option
7361@item --mdi
7362@itemx --sdi
7363@itemx --no-mdi
7364Control whether @code{Rgui} will operate as an MDI program
7365(with multiple child windows within one main window) or an SDI application
7366(with multiple top-level windows for the console, graphics and pager).  The
7367command-line setting overrides the setting in the user's @file{Rconsole} file.
7368
7369@item --debug
7370Enable the ``Break to debugger'' menu item in @code{Rgui}, and trigger
7371a break to the debugger during command line processing.
7372@end table
7373
7374Under Windows with @code{R CMD} you may also specify your own
7375@file{.bat}, @file{.exe}, @file{.sh} or @file{.pl} file.  It will be run
7376under the appropriate interpreter (Perl for @file{.pl}) with several
7377environment variables set appropriately, including @env{R_HOME},
7378@env{R_OSTYPE}, @env{PATH}, @env{BSTINPUTS} and @env{TEXINPUTS}.  For
7379example, if you already have @file{latex.exe} on your path, then
7380
7381@example
7382R CMD latex.exe mydoc
7383@end example
7384@noindent
7385will run @LaTeX{} on @file{mydoc.tex}, with the path to @R{}'s
7386@file{share/texmf} macros appended to @env{TEXINPUTS}.  (Unfortunately,
7387this does not help with the MiKTeX build of @LaTeX{}, but
7388@command{R CMD texify mydoc} will work in that case.)
7389
7390@node Invoking R under macOS, Scripting with R, Invoking R under Windows, Invoking R
7391@appendixsec Invoking R under macOS
7392
7393There are two ways to run @R{} under macOS.  Within a @code{Terminal.app}
7394window by invoking @code{R}, the methods described in the first
7395subsection apply.  There is also console-based GUI (@code{R.app}) that by
7396default is installed in the @code{Applications} folder on your
7397system.  It is a standard double-clickable macOS application.
7398
7399The startup procedure under macOS is very similar to that under UNIX, but
7400@code{R.app} does not make use of command-line arguments.  The `home
7401directory' is the one inside the R.framework, but the startup and
7402current working directory are set as the user's home directory unless a
7403different startup directory is given in the Preferences window
7404accessible from within the GUI.
7405
7406@node Scripting with R,  , Invoking R under macOS, Invoking R
7407@appendixsec Scripting with R
7408
7409If you just want to run a file @file{foo.R} of @R{} commands, the
7410recommended way is to use @command{R CMD BATCH foo.R}.  If you want to
7411run this in the background or as a batch job use OS-specific facilities
7412to do so: for example in most shells on Unix-alike OSes @command{R CMD
7413BATCH foo.R &} runs a background job.
7414
7415You can pass parameters to scripts via additional arguments on the
7416command line: for example (where the exact quoting needed will depend on
7417the shell in use)
7418
7419@example
7420R CMD BATCH "--args arg1 arg2" foo.R &
7421@end example
7422
7423@noindent
7424will pass arguments to a script which can be retrieved as a character
7425vector by
7426
7427@example
7428args <- commandArgs(TRUE)
7429@end example
7430
7431This is made simpler by the alternative front-end @command{Rscript},
7432which can be invoked by
7433
7434@example
7435Rscript foo.R arg1 arg2
7436@end example
7437
7438@noindent
7439and this can also be used to write executable script files like (at
7440least on Unix-alikes, and in some Windows shells)
7441
7442@example
7443#! /path/to/Rscript
7444args <- commandArgs(TRUE)
7445...
7446q(status=<exit status code>)
7447@end example
7448
7449@noindent
7450If this is entered into a text file @file{runfoo} and this is made
7451executable (by @command{chmod 755 runfoo}), it can be invoked for
7452different arguments by
7453
7454@example
7455runfoo arg1 arg2
7456@end example
7457
7458@noindent
7459For further options see @command{help("Rscript")}.  This writes @R{}
7460output to @file{stdout} and @file{stderr}, and this can be redirected in
7461the usual way for the shell running the command.
7462
7463If you do not wish to hardcode the path to @command{Rscript} but have it
7464in your path (which is normally the case for an installed @R{} except on
7465Windows, but e.g.@: macOS users may need to add @file{/usr/local/bin}
7466to their path), use
7467
7468@example
7469#! /usr/bin/env Rscript
7470...
7471@end example
7472
7473@noindent
7474At least in Bourne and bash shells, the @code{#!} mechanism does
7475@strong{not} allow extra arguments like
7476@code{#! /usr/bin/env Rscript --vanilla}.
7477
7478One thing to consider is what @code{stdin()} refers to.  It is
7479commonplace to write @R{} scripts with segments like
7480
7481@example
7482chem <- scan(n=24)
74832.90 3.10 3.40 3.40 3.70 3.70 2.80 2.50 2.40 2.40 2.70 2.20
74845.28 3.37 3.03 3.03 28.95 3.77 3.40 2.20 3.50 3.60 3.70 3.70
7485@end example
7486
7487@noindent
7488and @code{stdin()} refers to the script file to allow such traditional
7489usage.  If you want to refer to the process's @file{stdin}, use
7490@code{"stdin"} as a @code{file} connection, e.g.@: @code{scan("stdin", ...)}.
7491
7492Another way to write executable script files (suggested by Fran@,{c}ois
7493Pinard) is to use a @emph{here document} like
7494
7495@example
7496#!/bin/sh
7497[environment variables can be set here]
7498R --no-echo [other options] <<EOF
7499
7500   R program goes here...
7501
7502EOF
7503@end example
7504
7505@noindent
7506but here @code{stdin()} refers to the program source and
7507@code{"stdin"} will not be usable.
7508
7509Short scripts can be passed to @command{Rscript} on the command-line
7510@emph{via} the @option{-e} flag.  (Empty scripts are not accepted.)
7511
7512Note that on a Unix-alike the input filename (such as @file{foo.R})
7513should not contain spaces nor shell metacharacters.
7514
7515
7516@node The command-line editor, Function and variable index, Invoking R, Top
7517@appendix The command-line editor
7518
7519@appendixsection Preliminaries
7520
7521When the @acronym{GNU} @strong{readline} library is available at the
7522time @R{} is configured for compilation under UNIX, an inbuilt command
7523line editor allowing recall, editing and re-submission of prior commands
7524is used.  Note that other versions of @strong{readline} exist and may be
7525used by the inbuilt command line editor: this used to happen on macOS.
7526
7527It can be disabled (useful for usage with @acronym{ESS} @footnote{The
7528`Emacs Speaks Statistics' package; see the @acronym{URL}
7529@uref{https://ESS.R-project.org/}}) using the startup option
7530@option{--no-readline}.
7531
7532Windows versions of @R{} have somewhat simpler command-line editing: see
7533@samp{Console} under the @samp{Help} menu of the @acronym{GUI}, and the
7534file @file{README.Rterm} for command-line editing under
7535@code{Rterm.exe}.
7536
7537When using @R{} with GNU@footnote{It is possible to build @R{} using an
7538emulation of GNU @strong{readline}, such as one based on NetBSD's
7539@strong{editline} (also known as @strong{libedit}), in which case only a
7540subset of the capabilities may be provided.} @strong{readline}
7541capabilities, the functions described below are available, as well as
7542others (probably) documented in @command{man readline} or @command{info
7543readline} on your system.
7544
7545Many of these use either Control or Meta characters.  Control
7546characters, such as @kbd{Control-m}, are obtained by holding the
7547@key{CTRL} down while you press the @key{m} key, and are written as
7548@kbd{C-m} below.  Meta characters, such as @kbd{Meta-b}, are typed by
7549holding down @key{META}@footnote{On a PC keyboard this is usually the
7550Alt key, occasionally the `Windows' key.  On a Mac keyboard normally no
7551meta key is available.} and pressing @key{b}, and written as @kbd{M-b}
7552in the following.  If your terminal does not have a @key{META} key
7553enabled, you can still type Meta characters using two-character
7554sequences starting with @kbd{ESC}.  Thus, to enter @kbd{M-b}, you could
7555type @key{ESC}@key{b}.  The @kbd{ESC} character sequences are also
7556allowed on terminals with real Meta keys.  Note that case is significant
7557for Meta characters.
7558
7559Some but not all versions@footnote{In particular, not versions 6.3 or
7560later: this is worked around as from @R{} 3.4.0.} of @strong{readline}
7561will recognize resizing of the terminal window so this is best avoided.
7562
7563@appendixsection Editing actions
7564
7565The @R{} program keeps a history of the command lines you type,
7566including the erroneous lines, and commands in your history may be
7567recalled, changed if necessary, and re-submitted as new commands.  In
7568Emacs-style command-line editing any straight typing you do while in
7569this editing phase causes the characters to be inserted in the command
7570you are editing, displacing any characters to the right of the cursor.
7571In @emph{vi} mode character insertion mode is started by @kbd{M-i} or
7572@kbd{M-a}, characters are typed and insertion mode is finished by typing
7573a further @key{ESC}.  (The default is Emacs-style, and only that is
7574described here: for @emph{vi} mode see the @strong{readline}
7575documentation.)
7576
7577Pressing the @key{RET} command at any time causes the command to be
7578re-submitted.
7579
7580Other editing actions are summarized in the following table.
7581
7582@appendixsection Command-line editor summary
7583
7584@subheading Command recall and vertical motion
7585
7586@table @kbd
7587@item C-p
7588Go to the previous command (backwards in the history).
7589@item C-n
7590Go to the next command (forwards in the history).
7591@item C-r @var{text}
7592Find the last command with the @var{text} string in it.  This can be
7593cancelled by @code{C-g} (and on some versions of @R{} by @code{C-c}).
7594@end table
7595
7596On most terminals, you can also use the up and down arrow keys instead
7597of @kbd{C-p} and @kbd{C-n}, respectively.
7598
7599@subheading Horizontal motion of the cursor
7600
7601@table @kbd
7602@item C-a
7603Go to the beginning of the command.
7604@item C-e
7605Go to the end of the line.
7606@item M-b
7607Go back one word.
7608@item M-f
7609Go forward one word.
7610@item C-b
7611Go back one character.
7612@item C-f
7613Go forward one character.
7614@end table
7615
7616On most terminals, you can also use the left and right arrow keys
7617instead of @kbd{C-b} and @kbd{C-f}, respectively.
7618
7619@subheading Editing and re-submission
7620
7621@table @kbd
7622@item @var{text}
7623Insert @var{text} at the cursor.
7624@item C-f @var{text}
7625Append @var{text} after the cursor.
7626@item @key{DEL}
7627Delete the previous character (left of the cursor).
7628@item C-d
7629Delete the character under the cursor.
7630@item M-d
7631Delete the rest of the word under the cursor, and ``save'' it.
7632@item C-k
7633Delete from cursor to end of command, and ``save'' it.
7634@item C-y
7635Insert (yank) the last ``saved'' text here.
7636@item C-t
7637Transpose the character under the cursor with the next.
7638@item M-l
7639Change the rest of the word to lower case.
7640@item M-c
7641Change the rest of the word to upper case.
7642@item @key{RET}
7643Re-submit the command to @R{}.
7644@end table
7645
7646The final @key{RET} terminates the command line editing sequence.
7647
7648The @strong{readline} key bindings can be customized in the usual way
7649@emph{via} a @file{~/.inputrc} file.  These customizations can be
7650conditioned on application @code{R}, that is by including a section like
7651
7652@example
7653$if R
7654  "\C-xd": "q('no')\n"
7655$endif
7656@end example
7657
7658@node Function and variable index, Concept index, The command-line editor, Top
7659@appendix Function and variable index
7660
7661@printindex vr
7662
7663@node Concept index, References, Function and variable index, Top
7664@appendix Concept index
7665
7666@printindex cp
7667
7668@node References,  , Concept index, Top
7669@appendix References
7670
7671D.@: M.@: Bates and  D.@: G.@: Watts (1988), @emph{Nonlinear Regression
7672Analysis and Its Applications.} John Wiley & Sons, New York.
7673
7674@noindent
7675Richard A.@: Becker, John M.@: Chambers and Allan R.@: Wilks (1988),
7676@emph{The New S Language.} Chapman & Hall, New York.
7677This book is often called the ``@emph{Blue Book}''.
7678
7679@noindent
7680John M.@: Chambers and Trevor J.@: Hastie eds. (1992),
7681@emph{Statistical Models in S.} Chapman & Hall, New York.
7682This is also called the ``@emph{White Book}''.
7683
7684@noindent
7685John M.@: Chambers (1998)
7686@emph{Programming with Data}. Springer, New York.
7687This is also called the ``@emph{Green Book}''.
7688
7689@noindent
7690A.@: C.@: Davison and D.@: V.@: Hinkley (1997), @emph{Bootstrap Methods
7691and Their Applications}, Cambridge University Press.
7692
7693@noindent
7694Annette J.@: Dobson (1990), @emph{An Introduction to Generalized Linear
7695Models}, Chapman and Hall, London.
7696
7697@noindent
7698Peter McCullagh and John A.@: Nelder (1989), @emph{Generalized Linear
7699Models.} Second edition, Chapman and Hall, London.
7700
7701@noindent
7702John A.@ Rice (1995), @emph{Mathematical Statistics and Data Analysis.}
7703Second edition.  Duxbury Press, Belmont, CA.
7704
7705@noindent
7706S.@: D.@: Silvey (1970), @emph{Statistical Inference.} Penguin, London.
7707
7708@bye
7709