lisp/PSD.doc/ch7.n

         Copyright (c) 1980 The Regents of the University of California.
 All rights reserved.

 %sccs.include.redist.roff%

 @(#)ch7.n 6.3 (Berkeley) 04/17/91

." $Header: ch7.n,v 1.3 83/07/01 11:22:58 layer Exp $
.Lc The Lisp Reader 7
.sh 2 Introduction \n(ch 1
.pp
The
.i read
function is responsible for converting
a stream of
characters into a Lisp expression.
.i Read
is table driven and the table it uses is called a
.i readtable.
The
.i print
function does the
inverse of
.i read ;
it converts a Lisp expression into a stream of
characters.
Typically the conversion is done in such
a way that if that stream of characters were read by
.i read ,
the
result would be an expression equal to the one
.i print
was given.
.i Print
must also refer to the readtable in order to determine
how to format its output.
The
.i explode
function, which returns a list of characters rather than
printing them, must also refer to the readtable.
.pp
A readtable is created
with the
.i makereadtable
function, modified with the
.i setsyntax
function and interrogated with the
.i getsyntax
function.
The structure of a readtable is hidden from the user - a
readtable should
only be manipulated with the three functions mentioned above.
.pp
There is one distinguished readtable called the
.i current
.i readtable
whose value determines what
.i read ,
.i print
and
.i explode
do.
The current readtable is the value of the symbol
.i readtable .
Thus it is possible to rapidly change
the current syntax by lambda binding
a different readtable to the symbol
.i readtable.
When the binding is undone, the syntax reverts to its old form.
.sh +0 Syntax Classes
.pp
The readtable describes how each of the 128 ascii characters should
be treated by the reader and printer.
Each character belongs to a
.i syntax
.i class
which has three properties:
.ip character class -
Tells what the reader should do when it sees this character.
There are a large number of character classes.
They are described below.
.ip separator -
Most types of tokens the reader constructs are one character
long.
Four token types have an arbitrary length: number (1234),
symbol print name (franz),
escaped symbol print name (|franz|), and string ("franz").
The reader can easily determine when it has
come to the
end of one of the last two types: it just looks for the
matching delimiter (| or ").
When the reader is reading a number or symbol print name, it
stops reading when it comes to a character with the
.i separator
property.
The separator character is pushed back into the input stream and will
be the first character read when the reader is called again.
.ip escape -
Tells the printer when to put escapes in front of, or around, a symbol
whose print name contains this character.
There are three possibilities: always escape a symbol with this character
in it, only escape a symbol if this is the only character in the symbol,
and only escape a symbol if this is the first character in the symbol.
[note: The printer will always escape a symbol which, if printed out, would
look like a valid number.]
.pp
When the Lisp system is built, Lisp code is added to a C-coded kernel
and the result becomes the standard lisp system.
The readtable present in the C-coded kernel, called the
.i raw
.i readtable ,
contains the bare necessities for reading in Lisp code.
During the
construction of the complete Lisp system,
a copy is made of the raw readtable and
then the copy is modified by adding macro characters.
The result is what is called the
.i standard
.i readtable .
When a new readtable is created with
.i makereadtable,
a copy is made of either the
raw readtable
or the current readtable (which is likely to be the standard readtable).
.sh +0 Reader Operations
.pp
The reader has a very simple algorithm.
It is either
.i scanning
for a token,
.i collecting
a token,
or
.i processing
a token.
Scanning involves reading characters and throwing
away those which don't start tokens (such as blanks and tabs).
Collecting means gathering the characters which make up a
token into a buffer.
Processing may involve creating symbols, strings, lists,
fixnums, bignums or flonums or calling a user written function called
a character macro.
.pp
The components of the syntax class determine when the reader
switches between the scanning, collecting and processing states.
The reader will continue scanning as long as the character class
of the characters it reads is
.i cseparator.
When it reads a character whose character class is not
.i cseparator
it stores that character in its buffer and begins the collecting phase.
.pp
If the character class of that first character is
.i ccharacter ,
.i cnumber ,
.i cperiod ,
or
.i csign .
then it will continue collecting until it runs into a character whose
syntax class has the
.i separator
property.
(That last character will be pushed back into the input buffer and will
be the first character read next time.)
Now the reader goes into the processing phase, checking to see if the
token it read is a number or symbol.
It is important to note that after
the first character is collected the component of the syntax class which
tells the reader to stop
collecting is the
.i separator
property, not the character class.
.pp
If the character class of the character which stopped the scanning is not
.i ccharacter ,
.i cnumber ,
.i cperiod ,
or
.i csign .
then the reader processes that character immediately.
The character classes
.i csingle-macro ,
.i csingle-splicing-macro ,
and
.i csingle-infix-macro
will act like
.i ccharacter
if the following token is not a
.i separator.
The processing which is done for a given character class
is described in detail in the next section.
.sh +0 Character Classes
.tl '\\$1''raw readtable:\\$2'
.tl '''standard readtable:\\$3'
..
.pc
.Cc ccharacter A-Z a-z ^H !#$%&*,/:;<=>?@^_`{}~ A-Z a-z ^H !$%&*/:;<=>?@^_{}~
.pc %
A normal character.
.Cc cnumber 0-9 0-9
This type is a digit.
The syntax for an integer (fixnum or bignum) is a string of
.i cnumber
characters optionally followed by a
.i cperiod.
If the digits are not followed by a
.i cperiod ,
then they are interpreted in base
.i ibase
which must be eight or ten.
The syntax for a floating point number is
either zero or more
.i cnumber 's
followed by a
.i cperiod
and then followed by one or more
.i cnumber 's.
A floating point number
may also be an integer or floating point number followed
by 'e' or 'd', an optional '+' or '-'
and then zero or more
.i cnumber 's.
.Cc csign +- +-
A leading sign for a number.
No other characters should be given this class.
.Cc cleft-paren ( (
A left parenthesis.
Tells the reader to begin forming a list.
.Cc cright-paren ) )
A right parenthesis.
Tells the reader that it has reached the end of a list.
.Cc cleft-bracket [ [
A left bracket.
Tells the reader that it should begin forming a list.
See the description of
.i cright-bracket
for the difference between cleft-bracket and cleft-paren.
.Cc cright-bracket ] ]
A right bracket.
A
.i cright-bracket
finishes the formation of the current
list and all enclosing lists until it finds one which
begins with a
.i cleft-bracket
or until it reaches the
top level list.
.Cc cperiod . .
The period is used to separate element of a cons cell
[e.g. (a . (b . nil)) is the same as (a b)].
.i cperiod
is also used in numbers as described above.
.Cc cseparator ^I-^M esc space ^I-^M esc space
Separates tokens. When the reader is scanning, these character
are passed over.
Note: there is a difference between the
.i cseparator
character class and the
.i separator
property of a syntax class.
.Cc csingle-quote \\' \\'
This causes
.i read
to be called recursively and the list
(quote <value read>) to be returned.
.Cc csymbol-delimiter | |
This causes the reader to begin collecting characters and to stop only
when another identical
.i csymbol-delimiter
is seen.
The only way to escape a
.i csymbol-delimiter
within a symbol name is with a
.i cescape
character.
The collected characters are converted into a string which becomes
the print name of a symbol.
If a symbol with an identical print name already exists, then the
allocation is not done, rather the existing symbol is used.
.Cc cescape \e \e
This causes the next character to read in to be treated as a
.b vcharacter .
A character whose syntax class is
.b vcharacter
has a character class
.i ccharacter
and does not have
the
.i separator
property so it will not separate symbols.
.Cc cstring-delimiter """" """"
This is the same as
.i csymbol-delimiter
except the result is returned as a string instead of a symbol.
.Cc csingle-character-symbol none none
This returns a symbol whose print name is the the single character
which has been collected.
.Cc cmacro none `,
The reader calls the macro function associated with this character and
the current readtable, passing it no arguments.
The result of the macro is added to the structure the reader is building,
just as if that form were directly read by the reader.
More details on macros are provided below.
.Cc csplicing-macro none #;
A
.i csplicing-macro
differs from a
.i cmacro
in the way the result is incorporated in the structure the reader is
building.
A
.i csplicing-macro
must return a list of forms (possibly empty).
The reader acts as
if it read each element of
the list itself without
the surrounding parenthesis.
.Cc csingle-macro none none
This causes to reader to check the next character.
If it is a
.i cseparator
then this acts like a
.i cmacro.
Otherwise, it acts like a
.i ccharacter.
.Cc csingle-splicing-macro none none
This is triggered like a
.i csingle-macro
however the result is spliced in like a
.i csplicing-macro.
.Cc cinfix-macro none none
This is differs from a
.i cmacro
in that the macro function is passed a form representing what the reader
has read so far.
The result of the macro replaces what the reader had read so far.
.Cc csingle-infix-macro none none
This differs from the
.i cinfix-macro
in that the macro will only be triggered if the character following the
.i csingle-infix-macro
character is a
.i cseparator .
.Cc cillegal ^@-^G^N-^Z^\e-^_rubout ^@-^G^N-^Z^\e-^_rubout
The characters cause the reader to signal an error if read.
.sh +0 Syntax Classes
.pp
The readtable maps each character into a syntax class.
The syntax class contains three pieces of information:
the character class, whether this is a separator, and the escape
properties.
The first two properties are used by the reader, the last by
the printer (and
.i explode ).
The initial lisp system has the following syntax classes defined.
The user may add syntax classes with
.i add-syntax-class .
For each syntax class, we list the properties of the class and
which characters have this syntax class by default.
More information about each syntax class can be found under the
description of the syntax class's character class.
.(b
.tl '\\$1''raw readtable:\\$2'
.tl '\\$4''standard readtable:\\$3'
.tl '\\$5'''
.)b
..
.pc
.Sy vcharacter A-Z a-z ^H !#$%&*,/:;<=>?@^_`{}~ A-Z a-z ^H !$%&*/:;<=>?@^_{}~ ccharacter
.pc %
.Sy vnumber 0-9 0-9 cnumber
.Sy vsign +- +- csign
.Sy vleft-paren ( ( cleft-paren escape-always separator
.Sy vright-paren ) ) cright-paren escape-always separator
.Sy vleft-bracket [ [ cleft-bracket escape-always separator
.Sy vright-bracket ] ] cright-bracket escape-always separator
.Sy vperiod . . cperiod escape-when-unique
.Sy vseparator ^I-^M esc space ^I-^M esc space cseparator escape-always separator
.Sy vsingle-quote \\' \\' csingle-quote escape-always separator
.Sy vsymbol-delimiter | | csingle-delimiter escape-always
.Sy vescape \e \e cescape escape-always
.Sy vstring-delimiter """" """" cstring-delimiter escape-always
.Sy vsingle-character-symbol none none csingle-character-symbol separator
.Sy vmacro none `, cmacro escape-always separator
.Sy vsplicing-macro none #; csplicing-macro escape-always separator
.Sy vsingle-macro none none csingle-macro escape-when-unique
.Sy vsingle-splicing-macro none none csingle-splicing-macro escape-when-unique
.Sy vinfix-macro none none cinfix-macro escape-always separator
.Sy vsingle-infix-macro none none csingle-infix-macro escape-when-unique
.Sy villegal ^@-^G^N-^Z^\e-^_rubout ^@-^G^N-^Z^\e-^_rubout cillegal escape-always separator
.sh +0 Character Macros
.pp
Character macros are
user written functions which are executed during the reading process.
The value returned by a character macro may or may not be used by
the reader, depending on the type of macro and the value returned.
Character macros are always attached to a single character with
the
.i setsyntax
function.
.sh +1 Types
There are three types of character macros: normal, splicing and infix.
These types differ in the arguments they are given or in what is done
with the result they return.
.sh +1 Normal
.pp
A normal macro
is passed no arguments.
The value returned by a normal macro is simply used by
the reader as if it had read the value itself.
Here is an example of a macro which returns the abbreviation
for a given state.
.Eb
->(de\kAfun stateabbrev nil
 \h'|\nAu'(cdr (assq (read) '((california . ca) (pennsylvania . pa)))))
stateabbrev
-> (setsyntax '\e! 'vmacro 'stateabbrev)
t
-> '( ! california ! wyoming ! pennsylvania)
(ca nil pa)
.Ee
Notice what happened to
 ! wyoming.
Since it wasn't in the table, the associated function
returned nil.
The creator of the macro may have wanted to leave the
list alone, in such a case, but couldn't with this
type of reader macro.
The splicing macro, described next, allows a character macro function
to return a value that is ignored.
.sh +0 Splicing
.pp
The value returned from a splicing macro must be a list or nil.
If the value is nil, then the value is ignored, otherwise the reader
acts as if it read each object in the list.
Usually the list only contains one element.
If the reader is reading at the top level (i.e. not collecting elements
of list),
then it is illegal for a splicing macro to return more then one
element in the list.
The major advantage of a splicing macro over a normal macro is the
ability of the splicing macro to return nothing.
The comment character (usually ;) is a splicing macro bound to a
function which reads to the end of the line and always returns nil.
Here is the previous example written as a splicing macro
.Eb
-> (de\kAfun stateabbrev nil
\h'|\nAu'(\kC(lam\kBbda (value)
 \h'|\nBu'(cond \kA(value (list value))
 \h'|\nAu'(t nil)))
 \h'|\nCu'(cdr (assq (read) '((california . ca) (pennsylvania . pa))))))
-> (setsyntax '! 'vsplicing-macro 'stateabbrev)
-> '(!pennsylvania ! foo !california)
(pa ca)
-> '!foo !bar !pennsylvania
pa
->
.Ee
.sh +0 Infix
.pp
Infix macros are passed a
.i conc
structure representing what has been read so far.
Briefly, a
tconc
structure is a single list cell whose car points to
a list and whose cdr points to the last list cell in that list.
The interpretation by the reader of the value
returned by an infix macro depends on
whether the macro is called while the reader is constructing a
list or whether it is called at the top level of the reader.
If the macro is called while a list is
being constructed, then the value returned should be a tconc
structure.
The car of that structure replaces the list of elements that the
reader has been collecting.
If the macro is called at top level, then it will be passed the
value nil, and the value it returns should either be nil
or a tconc structure.
If the macro returns nil, then the value is ignored and the reader
continues to read.
If the macro returns a tconc structure of one element (i.e. whose car
is a list of one element), then that single element is returned
as the value of
.i read.
If the macro returns a tconc structure of more than one element,
then that list of elements is returned as the value of read.
.Eb
-> (de\kAfun plusop (x)
 \h'|\nAu'(cond \kB((null x) (tconc nil '\e+))
 \h'|\nBu'(t (lconc nil (list 'plus (caar x) (read))))))

plusop
-> (setsyntax '\e+ 'vinfix-macro 'plusop)
t
-> '(a + b)
(plus a b)
-> '+
|+|
->
.Ee
.sh -1 Invocations
.pp
There are three different circumstances in which you would like
a macro function to be triggered.
.ip Always -
Whenever the macro character is seen, the macro should be invoked.
This is accomplished by using the character classes
.i cmacro ,
.i csplicing-macro ,
or
.i cinfix-macro ,
and by using the
.i separator
property.
The syntax classes
.b vmacro ,
.b vsplicing-macro ,
and
.b vsingle-macro
are defined this way.
.ip When first -
The macro should only be triggered when the macro character is the first
character found after the scanning process.
A syntax class for a
.i when
.i first
macro would
be defined
using
.i cmacro ,
.i csplicing-macro ,
or
.i cinfix-macro
and not including the
.i separator
property.
.ip When unique -
The macro should only be triggered when the macro character is the only
character collected in the token collection
phase of the reader,
i.e the macro character is preceeded by zero or more
.i cseparator s
and followed by a
.i separator.
A syntax class for a
.i when
.i unique
macro would
be defined using
.i csingle-macro ,
.i csingle-splicing-macro ,
or
.i csingle-infix-macro
and not including the
.i separator
property.
The syntax classes so defined are
.b vsingle-macro ,
.b vsingle-splicing-macro ,
and
.b vsingle-infix-macro .
.sh -1 Functions
.Lf setsyntax 's_symbol 's_synclass ['ls_func]
.Wh
ls_func is the name of a function or a lambda body.
.Re
t
.Se
S_symbol should be a symbol whose print name is only one character.
The syntax class for
that character is
set to s_synclass in the current readtable.
If s_synclass is a class that requires a character macro, then
ls_func must be supplied.
.No
The symbolic syntax codes are new to Opus 38.
For compatibility, s_synclass can be one of the fixnum syntax codes
which appeared in older versions of the
.Fr
Manual.
This compatibility is only temporary: existing code which uses the
fixnum syntax codes should be converted.
.Lf getsyntax 's_symbol
.Re
the syntax class of the first character
of s_symbol's print name.
s_symbol's print name must be exactly one character long.
.No
This function is new to Opus 38.
It supercedes (status syntax) which no longer exists.
.Lf add-syntax-class 's_synclass 'l_properties
.Re
s_synclass
.Se
Defines the syntax class s_synclass to have properties l_properties.
The list l_properties should contain a character classes mentioned
above.
l_properties may contain one of the escape properties:
.i escape-always ,
.i escape-when-unique ,
or
.i escape-when-first .
l_properties may contain the
.i separator
property.
After a syntax class has been defined with
.i add-syntax-class ,
the
.i setsyntax
function can be used to give characters that syntax class.
.Eb
; Define a non-separating macro character.
; This type of macro character is used in UCI-Lisp, and
; it corresponds to a FIRST MACRO in Interlisp
-> (add-syntax-class 'vuci-macro '(cmacro escape-when-first))
vuci-macro
->
.Ee