vasm/doc/interface.texi

@section Introduction

This chapter is under construction!


This chapter describes some of the internals of @command{vasm}
 and tries to explain
what has to be done to write a cpu module, a syntax module
or an output module for @command{vasm}.
However if someone wants to write one, I suggest to contact me first,
so that it can be integrated into the source tree.

Note that this documentation may mention explicit values when introducing
symbolic constants. This is due to copying and pasting from the source
code. These values may not be up to date and in some cases can be overridden.
Therefore do never use the absolute values but rather the symbolic
representations.


@section Building vasm

This section deals with the steps necessary to build the typical
@command{vasm} executable from the sources.

@subsection Directory Structure

    The vasm-directory contains the following important files and
    directories:
@table @file
@item    vasm/
The main directory containing the assembler sources.

@item    vasm/Makefile
The Makefile used to build @command{vasm}.

@item    vasm/syntax/<syntax-module>/
Directories for the syntax modules.

@item    vasm/cpus/<cpu-module>/
Directories for the cpu modules.

@item vasm/obj/
Directory the object modules will be stored in.

@end table

    All compiling is done from the main directory and
    the executables will be placed there as well.
    The main assembler for a combination of @code{<cpu>} and
    @code{<syntax>} will
    be called @command{vasm<cpu>_<syntax>}. All output modules are
    usually integrated in every executable and can be selected at
    runtime.

@subsection Adapting the Makefile

    Before building anything you have to insert correct values for
    your compiler and operating system in the @file{Makefile}.

@table @code
    @item TARGET
       Here you may define an extension which is appended to the executable's
       name. Useful, if you build various targets in the same directory.

    @item TARGETEXTENSION
       Defines the file name extension for executable files. Not needed for
       most operating systems. For Windows it would be @file{.exe}.

    @item CC
       Here you have to insert a command that invokes an ANSI C
       compiler you want to use to build vasm. It must support
       the @option{-I} option the same like e.g. @command{vc} or
       @command{gcc}.

    @item COPTS
       Here you will usually define an option like @option{-c} to instruct
       the compiler to generate an object file.
       Additional options, like the optimization level, should also be
       inserted here as well. When the host operating system is different
       from a Unix (MacOSX and MiNT are Unix), you have to define one of the
       following preprocessor macros:
       @table @code
          @item -DAMIGA
          AmigaOS (M68k or PPC), MorphOS, AROS.
          @item -DATARI
          Atari TOS.
          @item -DMSDOS
          CP/M, MS-DOS, Windows.
       @end table

    @item CCOUT
       Here you define the option which is used to specify the name of
       an output file, which is usually @option{-o}.

    @item LD
       Here you insert a command which starts the linker. This may be the
       the same as under @code{CC}.

    @item LDFLAGS
       Here you have to add options which are necessary for linking.
       E.g. some compilers need special libraries for floating-point.

    @item LDOUT
       Here you define the option which is used by the linker to specify
       the output file name.

    @item RM
      Specify a command to delete a file, e.g. @code{rm -f}.
@end table

    An example for the Amiga using @command{vbcc} would be:
@example
      TARGET = _os3
      TARGETEXTENSION =
      CC = vc +aos68k
      CCOUT = -o
      COPTS = -c -c99 -cpu=68020 -DAMIGA -O1
      LD = $(CC)
      LDOUT = $(CCOUT)
      LDFLAGS = -lmieee
      RM = delete force quiet
@end example

    An example for a typical Unix-installation would be:
@example
      TARGET =
      TARGETEXTENSION =
      CC = gcc
      CCOUT = -o
      COPTS = -c -O2
      LD = $(CC)
      LDOUT = $(CCOUT)
      LDFLAGS = -lm
      RM = rm -f
@end example

Open/Net/Free/Any BSD i386 systems will probably require the following
an additional @option{-D_ANSI_SOURCE} in @code{COPTS}.


@subsection Building vasm

Note to users of Open/Free/Any BSD i386 systems: You will probably have to use
GNU make instead of BSD make, i.e. in the following examples replace "make"
with "gmake".

    Type:
@example
      make CPU=<cpu> SYNTAX=<syntax>
@end example
    For example:
@example
      make CPU=ppc SYNTAX=std
@end example

The following CPU modules can be selected:
@itemize
@item @code{CPU=6502}
@item @code{CPU=6800}
@item @code{CPU=arm}
@item @code{CPU=c16x}
@item @code{CPU=jagrisc}
@item @code{CPU=m68k}
@item @code{CPU=ppc}
@item @code{CPU=test}
@item @code{CPU=tr3200}
@item @code{CPU=vidcore}
@item @code{CPU=x86}
@item @code{CPU=z80}
@end itemize

The following syntax modules can be selected:
@itemize
@item @code{SYNTAX=std}
@item @code{SYNTAX=mot}
@item @code{SYNTAX=madmac}
@item @code{SYNTAX=oldstyle}
@item @code{SYNTAX=test}
@end itemize

For Windows and various Amiga targets there are already Makefiles included,
which you may either copy on top of the default @file{Makefile}, or call
it explicitely with @command{make}'s @option{-f} option:
@example
    make -f Makefile.OS4 CPU=ppc SYNTAX=std
@end example


@section General data structures

This section describes the fundamental data structures used in vasm
which are usually necessary to understand for writing any kind of
module (cpu, syntax or output). More detailed information is given in
the respective sections on writing specific modules where necessary.

@subsection Source

A source structure represents a source text module, which can be
either the main source text, an included file or a macro. There is
always a link to the parent source from where the current source context
was included or called.

@table @code
@item struct source *parent;
        Pointer to the parent source context. Assembly continues there
        when the current source context ends.

@item int parent_line;
        Line number in the parent source context, from where we were called.
        This information is needed, because line numbers are only reliable
        during parsing and later from the atoms. But an include directive
        doesn't create an atom.

@item char *name;
        File name of the main source or include file, or macro name.

@item char *text;
        Pointer to the source text start.

@item size_t size;
        Size of the source text to assemble in bytes.

@item macro *macro;
        Pointer to macro structure, when currently inside a macro
        (see also @code{num_params}).

@item unsigned long repeat;
        Number of repetitions of this source text. Usually this is 1, but
        for text blocks between a @code{rept} and @code{endr} directive
        it allows any number of repetitions, which is decremented everytime
        the end of this source text block is reached.

@item char *irpname;
        Name of the iterator symbol in special repeat loops which use a
        sequence of arbitrary values, being assigned to this symbol within
        the loop. Example: @code{irp} directive in std-syntax.

@item struct macarg *irpvals;
        A list of arbitrary values to iterate over in a loop. With each
        iteration the frontmost value is removed from the list until it is
        empty.

@item int cond_level;
        Current level of conditional nesting while entering this source
        text. It is automatically restored to the previous level when
        leaving the source prematurely through @code{end_source()}.

@item struct macarg *argnames;
        The current list of named macro arguments.

@item int num_params;
        Number of macro parameters passed at the invocation point from
        the parent source. For normal source files this entry will be -1.
        For macros 0 (no parameters) or higher.

@item char *param[MAXMACPARAMS];
        Pointer to the macro parameters.

@item int param_len[MAXMACPARAMS];
        Number of characters per macro parameter.

@item int num_quals;
        (If @code{MAX_QUALIFIERS!=0}.) Number of qualifiers for a macro.
        when not passed on invocation these are the default qualifiers.

@item char *qual[MAX_QUALIFIERS];
        (If @code{MAX_QUALIFIERS!=0}.) Pointer to macro qualifiers.

@item int qual_len[MAX_QUALIFIERS];
        (If @code{MAX_QUALIFIERS!=0}.) Number of characters per macro qualifier.

@item unsigned long id;
        Every source has its unique id. Useful for macros supporting
        the special @code{\@@} parameter.

@item char *srcptr;
        The current source text pointer, pointing to the beginning of
        the next line to assemble.

@item int line;
        Line number in the current source context. After parsing the
        line number of the current atom is stored here.

@item size_t bufsize;
        Current size of the line buffer (@code{linebuf}). The size of the
        line buffer is extended automatically, when an overflow happens.

@item char *linebuf;
        A buffer for the current line being assembled
        in this source text. A child-source, like a macro, can refer to
        arguments from this buffer, so every source has got its own.
        When returning to the parent source, the linebuf is deallocated
        to save memory.

@item expr *cargexp;
        (If @code{CARGSYM} was defined.) Pointer to the current expression
        assigned to the CARG-symbol (used to select a macro argument) in
        this source instance. So it can be restored when reentering this
        instance.

@item long reptn;
        (If @code{REPTNSYM} was defined.) Current value of the repetition
        counter symbol in this source instance. So it can be restored when
        reentering this instance.
@end table

@subsection Sections

One of the top level structures is a linked list of sections describing
continuous blocks of memory. A section is specified by an object of
type @code{section} with the following members that can be accessed by
the modules:

@table @code
@item  struct section *next;
        A pointer to the next section in the list.

@item  char *name;
        The name of the section.

@item  char *attr;
        A string describing the section flags in ELF notation (see,
        for example, documentation o the @code{.section} directive of
        the standard syntax mopdule.

@item  atom *first;
@itemx atom *last;
        Pointers to the first and last atom of the section. See following
        sections for information on atoms.

@item  taddr align;
        Alignment of the section in bytes.

@item  uint32_t flags;
        Flags of the section. Currently available flags are:
@table @code
@item HAS_SYMBOLS
        At least one symbol is defined in this section.
@item RESOLVE_WARN
        The current atom changed its size multiple times, so atom_size()
        is now called with this flag set in its section to make the
        backend (e.g. @code{instruction_size()}) aware of it and do less
        aggressive optimizations.
@item UNALLOCATED
        Section is unallocated, which means it doesn't use any memory space
        in the output file. Such a section will be removed before creating
        the output file and all its labels converted into absolute expression
        symbols. Used for "offset" sections. Refer to
        @code{switch_offset_section()}.
@item LABELS_ARE_LOCAL
        As long as this flag is set new labels in a section are defined
        as local labels, with the section name as global parent label.
@item ABSOLUTE
        Section is loaded at an absolute address in memory.
@item PREVABS
        Remembers state of the @code{ABSOLUTE} flag before entering
        relocated-org mode (@code{IN_RORG}). So it can be restored later.
@item IN_RORG
        Section has entered relocated-org mode, which also sets the
        @code{ABSOLUTE} flag. In this mode code is written into the current
        section, but relocated to an absolute address. No relocation
        information are generated.
@item NEAR_ADDRESSING
        Section is marked as suitable for cpu-specific "near" addressing
        modes. For example, base-register relative. The cpu backend can use
        this information as an optimization hint when referencing symbols
        from this section.
@end table

@item  taddr org;
        Start address of a section. Usually zero.

@item  taddr pc;
        Current address in this section. Can be used
        while traversing through the section. Has to be updated by a
        module using it. Is set to @code{org} at the beginning.

@item   unsigned long idx;
        A member usable by the output module for private purposes.

@end table

@subsection Symbols

Symbols are represented by a linked list of type @code{symbol} with the
following members that can be accessed by the modules:.

@table @code

@item  int type;
        Type of the symbol. Available are:
@table @code
@item #define LABSYM 1
        The symbol is a label defined at a specific location.

@item #define IMPORT 2
        The symbol is imported from another file.

@item #define EXPRESSION 3
        The symbol is defined using an expression.
@end table

@item  uint32_t flags;
        Flags of this symbol. Available are:
@table @code
@item #define TYPE_UNKNOWN  0
        The symbol has no type information.

@item #define TYPE_OBJECT   1
        The symbol defines an object.

@item #define TYPE_FUNCTION 2
        The symbol defines a function.

@item #define TYPE_SECTION  3
        The symbol defines a section.

@item #define TYPE_FILE     4
      The symbol defines a file.

@item #define EXPORT (1<<3)
        The symbol is exported to other files.

@item #define INEVAL (1<<4)
        Used internally.

@item #define COMMON (1<<5)
        The symbol is a common symbol.

@item #define WEAK (1<<6)
        The symbol is weak, which means the linker may overwrite it with
        any global definition of the same name. Weak symbols may also stay
        undefined, in which case the linker would assign them a value of
        zero.

@item #define LOCAL (1<<7)
        Only informational. A symbol can be explicitely declared as local
        by a syntax-module directive.

@item #define VASMINTERN (1<<8)
        Vasm-internal symbol, which is usually not exported into an output
        file.

@item #define PROTECTED (1<<9)
        Used internally to protect the current-PC symbol from deletion.

@item #define REFERENCED (1<<10)
        Symbol was referenced in the source and a relocation entry has
        been created.

@item #define ABSLABEL (1<<11)
        Label was defined inside an absolute section, or during
        relocated-org mode. So it has an absolute address and will not
        generate a relocation entry when being referenced.

@item #define EQUATE (1<<12)
        Symbols flagged as @code{EQUATE} are constant and its value must
        not be changed.

@item #define REGLIST (1<<13)
        Symbol is a register list definition.

@item #define USED (1<<14)
        Symbol appeared in an expression. Symbols which were only defined,
        (as label or equte) and never used throughout the whole source,
        don't get this flag set.

@item #define NEAR (1<<15)
        Symbol may be referenced by "near" addressing mode. For example,
        base register relative. Used as an optimization hint in the cpu
        backend.

@item #define RSRVD_S (1L<<24)
        The range from bit 24 to 27 (counted from the LSB) is reserved for
        use by the syntax module.

@item #define RSRVD_O (1L<<28)
        The range from bit 28 to 31 (counted from the LSB) is reserved for
        use by the output module.
@end table

The type-flags can be extracted using the @code{TYPE()} macro which
expects a pointer to a symbol as argument.

@item  char *name;
        The name of the symbol.

@item   expr *expr;
        The expression in case of @code{EXPRESSION} symbols.

@item   expr *size;
        The size of the symbol, if specified.

@item  section *sec;
        The section a @code{LABSYM} symbol is defined in.

@item  taddr pc;
        The address of a @code{LABSYM} symbol.

@item  taddr align;
        The alignment of the symbol in bytes.

@item  unsigned long idx;
        A member usable by the output module for private purposes.

@end table

@subsection Register symbols

Optional register symbols are available when the backend defines
@code{HAVE_REGSYMS} in @file{cpu.h} together with the hash table size.
Example:
@example
#define HAVE_REGSYMS
#define REGSYMHTSIZE 256
@end example

A register symbol is defined by an object of type @code{regsym}
with the following members that can be accessed by the modules:

@table @code
@item char *reg_name;
      Symbol name.
@item int reg_type;
      Optional type of register.
@item unsigned int reg_flags;
      Optional register symbol flags.
@item unsigned int reg_num;
      Register number or value.
@end table

Refer to @file{symbol.h} for functions to create and find register
symbols.

@subsection Atoms

The contents of each section are a linked list built out of non-separable
atoms. The general structure of an atom is:

@example
typedef struct atom @{
  struct atom *next;
  int type;
  taddr align;
  taddr lastsize;
  unsigned changes;
  source *src;
  int line;
  listing *list;
  union @{
    instruction *inst;
    dblock *db;
    symbol *label;
    sblock *sb;
    defblock *defb;
    void *opts;
    int srcline;
    char *ptext;
    printexpr *pexpr;
    expr *roffs;
    taddr *rorg;
    assertion *assert;
    aoutnlist *nlist;
  @} content;
@} atom;
@end example

The members have the following meaning:

@table @code
@item  struct atom *next;
Pointer to the following atom (0 if last).

@item  int type;
The type of the atom. Can be one of
@table @code
@item #define LABEL 1
A label is defined here.

@item #define DATA  2
Some data bytes of fixed length and constant data are put here.

@item #define INSTRUCTION 3
Generally refers to a machine instruction or pseudo/opcode. These atoms
can change length during optimization passes and will be translated to
@code{DATA}-atoms later.

@item #define SPACE 4
Defines a block of data filled with one value (byte). BSS sections usually
contain only such atoms, but they are also sometimes useful as shorter
versions of @code{DATA}-atoms in other sections.

@item #define DATADEF 5
Defines data of fixed size which can contain cpu specific operands and
expressions. Will be translated to @code{DATA}-atoms later.

@item #define LINE 6
A source text line number (usually from a high level language) is bound
to the atom's address. Useful for source level debugging in certain ABIs.

@item #define OPTS 7
A means to change assembler options at a specific source text line.
For example optimization settings, or the cpu type to generate code for.
The cpu module has to define @code{HAVE_CPU_OPTS} and export the required
functions if it wants to use this type of atom.

@item #define PRINTTEXT 8
A string is printed to stdout during the final assembler pass. A newline
is automatically appended.

@item #define PRINTEXPR 9
Prints the value of an expression during the final assembler pass to stdout.

@item #define ROFFS 10
Set the program counter to an address relative to the section's start
address. These atoms will be translated into @code{SPACE} atoms in the
final pass.

@item #define RORG 11
Assemble this block under the given base address, while the code is still
written into the original memory region.

@item #define RORGEND 12
Ends a RORG block and returns to the original addessing.

@item #define ASSERT 13
The assertion expression is checked in the final pass and an error message
is generated (using the expression string and an optional message out of
this atom) when it evaluates to 0.

@item #define NLIST 14
Defines a stab-entry for the a.out object file format. nlist-style stabs
can also occur embedded in other object file formats, like ELF.
@end table

@item taddr align;
The alignment of this atom. Address must be dividable by @code{align}.

@item taddr lastsize;
The size of this atom in the last resolver pass. When the size has
changed in the current pass, the assembler will request another resolver
run through the section.

@item unsigned changes;
Number of changes in the size of this atom since pass number
@code{FASTOPTPHASE}. An increasing number usually indicates a problem in
the cpu backend's optimizer and will be flagged by setting
@code{RESOLVE_WARN} in the Section flags, as soon as @code{changes} exceeds
@code{MAXSIZECHANGES}. So the backend can choose not to optimize this atom
as aggressive as before.

@item source *src;
Pointer to the source text object to which this atom belongs.

@item  int line;
The source line number that created this atom.

@item listing *list;
Pointer to the listing object to which this atoms belong.

@item    instruction *inst;
(In union @code{content}.) Pointer to an instruction structure in the case
of an @code{INSTRUCTION}-atom. Contains the following elements:
@table @code
@item  int code;
The cpu specific code of this instruction.

@item  char *qualifiers[MAX_QUALIFIERS];
(If @code{MAX_QUALIFIERS!=0}.) Pointer to the qualifiers of this instruction.

@item  operand *op[MAX_OPERANDS];
(If @code{MAX_OPERANDS!=0}.) The cpu-specific operands of this instruction.

@item  instruction_ext ext;
(If the cpu module defines @code{HAVE_INSTRUCTION_EXTENSION}.)
A cpu-module-specific structure. Typically used to store appropriate
opcodes, allowed addressing modes, supported cpu derivates etc.
@end table

@item    dblock *db;
(In union @code{content}.) Pointer to a dblock structure in the case
of a @code{DATA}-atom. Contains the following elements:
@table @code
@item  taddr size;
The number of bytes stored in this atom.

@item  char *data;
A pointer to the data.

@item  rlist *relocs;
A pointer to relocation information for the data.
@end table

@item    symbol *label;
(In union @code{content}.) Pointer to a symbol structure in the case
of a @code{LABEL}-atom.

@item    sblock *sb;
(In union @code{content}.) Pointer to a sblock structure in the case
of a @code{SPACE}-atom. Contains the following elements:
@table @code
@item  taddr space;
The size of the empty/filled space in bytes.

@item expr *space_exp;
The above size as an expression, which will be evaluated during assembly
and copied to @code{space} in the final pass.

@item  int size;
The size of each space-element and of the fill-pattern in bytes.

@item  unsigned char fill[MAXBYTES];
The fill pattern, up to MAXBYTES bytes.

@item expr *fill_exp;
Optional. Evaluated and copied to @code{fill} in the final pass, when not null.

@item rlist *relocs;
A pointer to relocation information for the space.

@item taddr maxalignbytes;
An optional number of maximum padding bytes to fulfil the atom's alignment
requirement. Zero means there is no restriction.
@end table

@item    defblock *defb;
(In union @code{content}.) Pointer to a defblock structure in the case
of a @code{DATADEF}-atom. Contains the following elements:
@table @code
@item  taddr bitsize;
The size of the definition in bits.

@item  operand *op;
Pointer to a cpu-specific operand structure.

@end table

@item    void *opts;
(In union @code{content}.) Points to a cpu module specific options object
in the case of a @code{OPTS}-atom.

@item    int srcline;
(In union @code{content}.) Line number for source level debugging in the
case of a @code{LINE}-atom.

@item    char *ptext;
(In union @code{content}.) A string to print to stdout in case of a
@code{PRINTTEXT}-atom.

@item    printexpr *pexpr;
(In union @code{content}.) Pointer to a printexpr structure in the case of
a @code{PRINTEXPR}-atom. Contains the following elements:
@table @code
@item expr *print_exp;
Pointer to an expression to evaluate and print.

@item short type;
Format type of the printed value. We can print as hexadecimal
(@code{PEXP_HEX}), signed decimal (@code{PEXP_SDEC}),
unsigned decimal (@code{PEXP_UDEC}), binary (@code{PEXP_BIN}) OR
ASCII (@code{PEXP_ASC}).

@item short size;
Size (precision) of the printed value in bits. Excessive bits will be
masked out, and sign-extended when requested.
@end table

@item    expr *roffs;
(In union @code{content}.) The expression holds the relative section offset
to align to in case of a @code{ROFFS}-atom.

@item    taddr *rorg;
(In union @code{content}.) Assemble the code under the base address in
@code{rorg} in case of a @code{RORG}-atom.

@item    assertion *assert;
(In union @code{content}.) Pointer to an assertion structure in the case of
an @code{ASSERT}-atom. Contains the following elements:
@table @code
@item expr *assert_exp;
Pointer to an expression which should evaluate to non-zero.

@item char *exprstr;
Pointer to the expression as text (to be used in the output).

@item char *msgstr;
Pointer to the message, which would be printed when @code{assert_exp} evaluates
to zero.
@end table

@item    aoutnlist *nlist;
(In union @code{content}.) Pointer to an nlist structure, describing an
aout stab entry, in case of an @code{NLIST}-atom. Contains the following
elements:
@table @code
@item char *name;
Name of the stab symbol.
@item int type;
Symbol type. Refer to @code{stabs.h} for definitions.
@item int other;
Defines the nature of the symbol (function, object, etc.).
@item int desc;
Debugger information.
@item expr *value;
Symbol's value.
@end table

@end table

@subsection Relocations

@code{DATA} and @code{SPACE} atoms can have a relocation list attached
that describes how this data must be modified when linking/relocating.
They always refer to the data in this atom only.

There are a number of predefined standard relocations and it is possible
to add other cpu-specific relocations. Note however, that it is always
preferrable to use standard relocations, if possible. Chances that an
output module supports a certain relocation are much higher if it is a
standard relocation.

A relocation list uses this structure:

@example
typedef struct rlist @{
  struct rlist *next;
  void *reloc;
  int type;
@} rlist;
@end example

Type identifies the relocation type. All the standard relocations have
type numbers between @code{FIRST_STANDARD_RELOC} and
@code{LAST_STANDARD_RELOC}. Consider @file{reloc.h} to see which
standard relocations are available.

 The detailed information can be accessed
via the pointer @code{reloc}. It will point to a structure that depends
on the relocation type, so a module must only use it if it knows the
relocation type.

All standard relocations point to a type @code{nreloc} with the following
members:
@table @code
@item  size_t byteoffset;
Offset in bytes, from the start of the current @code{DATA} atom, to the
beginning of the relocation field. This may also be the address which is
used as a basis for PC-relative relocations. Or a common basis for several
separated relocation fields, which will be translated into a single
relocation type by the output module.

@item  size_t bitoffset;
Offset in bits to the beginning of the relocation field, adds to
@code{byteoffset*bitsperbyte}. Bits are counted in a bit-stream from lower
to higher address bytes. But note, that inside a little-endian byte they
are counted from the LSB to the MSB, while they are counted from the MSB to
the LSB for big-endian targets.

@item  int size;
The size of the relocation field in bits.

@item  taddr mask;
The mask defines which portion of the relocated value is set by this
relocation field.

@item taddr addend;
Value to be added to the symbol value.

@item  symbol *sym;
The symbol referred by this relocation

@end table

To describe the meaning of these entries, we will define the steps that
shall be executed when performing a relocation:

@enumerate 1
@item Extract the @code{size} bits from the data atom, starting with bit
        number @code{byteoffset*bitsperbyte+bitoffset}. We start counting
        bits from the lowest to the highest numbered byte in memory.
        Inside a big-endian byte we count from the MSB to the LSB. Inside
        a little-endian byte we count from the LSB to the MSB.

@item Determine the relocation value of the symbol. For a simple absolute
        relocation, this will be the value of the symbol @code{sym} plus
        the @code{addend}. For other relocation types, more complex
        calculations will be needed.
        For example, in a program-counter relative relocation,
        the value will be obtained by subtracting the address of the data
        atom plus @code{byteoffset} from the value
        of @code{sym} plus @code{addend}.

@item Calculate the bit-wise "and" of the value obtained in the step above
        and the @code{mask} value.

@item Normalize, i.e. shift the value above right as many bit positions as
        there are low order zero bits in @code{mask}.

@item Add this value to the value extracted in step 1.

@item Insert the low order @code{size} bits of this value into the data atom
        starting with bit @code{byteoffset*bitsperbyte+bitoffset}.
@end enumerate


@subsection Errors

Each module can provide a list of possible error messages contained
e.g. in @file{syntax_errors.h} or @file{cpu_errors.h}. They are a
comma-separated list of a printf-format string and error flags. Allowed
flags are @code{WARNING}, @code{ERROR}, @code{FATAL}, @code{MESSAGE} and
@code{NOLINE}.
They can be combined using or (@code{|}). @code{NOLINE} has to be set for
error messages during initialiation or while writing the output, when
no source text is available. Errors cause the assembler to return false.
@code{FATAL} causes the assembler to terminate
immediately.

The errors can be emitted using the function @code{syntax_error(int n,...)},
@code{cpu_error(int n,...)} or @code{output_error(int n,...)}. The first
argument is the number of the error message (starting from zero). Additional
arguments must be passed according to the format string of the
corresponding error message.

@section Syntax modules

A new syntax module must have its own subdirectory under @file{vasm/syntax}.
At least the files @file{syntax.h}, @file{syntax.c} and @file{syntax_errors.h}
must be written.

@subsection The file @file{syntax.h}

@table @code

@item #define ISIDSTART(x)/ISIDCHAR(x)
These macros should return non-zero if and only if the argument is a
valid character to start an identifier or a valid character inside an
identifier, respectively.
@code{ISIDCHAR} must be a superset of @code{ISIDSTART}.

@item #define ISBADID(p,l)
Even with @code{ISIDSTART} and @code{ISIDCHAR} checked, there may be
combinations of characters which do not form a valid initializer (for
example, a single character). This macro returns non-zero, when this is
the case. First argument is a pointer to the new identifier and second
is its length.

@item #define ISEOL(x)
This macro returns true when the string pointing at @code{x} is either
a comment character or end-of-line.

@item #define CHKIDEND(s,e) chkidend((s),(e))
Defines an optional function to be called at the end of the identifier
recognition process. It allows you to adjust the length of the identifier
by returning a modified @code{e}. Default is to return @code{e}. The
function is defined as @code{char *chkidend(char *startpos,char *endpos)}.

@item #define BOOLEAN(x) -(x)
Defines the result of boolean operations. Usually this is @code{(x)}, as
in C, or @code{-(x)} to return -1 for True.

@item #define NARGSYM "NARG"
Defines the name of an optional symbol which contains the number of
arguments in a macro.

@item #define CARGSYM "CARG"
Defines the name of an optional symbol which can be used to select a
specific macro argument with @code{\.}, @code{\+} and @code{\-}.

@item #define REPTNSYM "REPTN"
Defines the name of an optional symbol containing the counter of the
current repeat iteration.

@item #define EXPSKIP() s=exp_skip(s)
Defines an optional replacement for skip() to be used in expr.c, to skip
blanks in an expression. Useful to forbid blanks in an expression and to
ignore the rest of the line (e.g. to treat the rest as comment). The
function is defined as @code{char *exp_skip(char *stream)}.

@item #define IGNORE_FIRST_EXTRA_OP 1
Should be defined when the syntax module wants to ignore the operand field
on instructions without an operand. Useful, when everything following
an operand should be regarded as comment, without a comment character.

@item #define MAXMACPARAMS 35
Optionally defines the maximum number of macro arguments, if you need more than
the default number of 9.

@item #define SKIP_MACRO_ARGNAME(p) skip_identifier(p)
An optional function to skip a named macro argument in the macro
definition.
Argument is the current source stream pointer.
The default is to skip an identifier.

@item #define MACRO_ARG_OPTS(m,n,a,p) NULL
An optional function to parse and skip options, default values and
qualifiers for each macro argument. Returns @code{NULL} when no argument
options have been found.
Arguments are:
  @table @code
    @item struct macro *m;
      Pointer to the macro structure being currently defined.
    @item int n;
      Argument index, starting with zero.
    @item char *a;
      Name of this argument.
    @item char *p;
      Current source stream pointer. An updated pointer will be returned.
  @end table
Defaults to unused.

@item #define MACRO_ARG_SEP(p) (*p==',' ? skip(p+1) : NULL)
An optional function to skip a separator between the macro argument
names in the macro definition. Returns NULL when no valid separator is
found.
Argument is the current source stream pointer.
Defaults to using comma as the only valid separator.

@item #define MACRO_PARAM_SEP(p) (*p==',' ? skip(p+1) : NULL)
An optional function to skip a separator between the macro parameters
in a macro call. Returns NULL when no valid separator is found.
Argument is the current source stream pointer.
Defaults to using comma as the only valid separator.

@item #define EXEC_MACRO(s)
An optional function to be called just before a macro starts execution.
Parameters and qualifiers are already parsed.
Argument is the @code{source} pointer of the new macro.
Defaults to unused.

@end table

@subsection The file @file{syntax.c}

A syntax module has to provide the following elements (all other funtions
should be @code{static} to prevent name clashes):

@table @code

@item char *syntax_copyright;
A string that will be emitted as part of the copyright message.

@item hashtable *dirhash;
A pointer to the hash table with all directives.

@item char commentchar;
A character used to introduce a comment until the end of the line.

@item char *defsectname;
Name of a default section which vasm creates when a label or code occurs
in the source, but the programmer forgot to specify a section. Assigning
NULL means that there is no default and vasm will show an error in this
case.

@item char *defsecttype;
Type of the default section (see above). May be NULL.

@item int init_syntax();
Will be called during startup, after argument parsing Must return zero if
initializations failed, non-zero otherwise.

@item int syntax_args(char *);
This function will be called with the command line arguments (unless they
were already recognized by other modules). If an argument was recognized,
return non-zero.

@item char *skip(char *);
A function to skip whitespace etc.

@item char *skip_operand(char *);
A function to skip an instruction's operand. Will terminate at end of line
or the next comma, returning a pointer to the rest of the line behind
the comma.

@item void eol(char *);
This function should check that the argument points to the end of a line
(only comments or whitespace following). If not, an error or warning
message should be omitted.

@item char *const_prefix(char *,int *);
Check if the first argument points to the start of a constant. If yes
return a pointer to the real start of the number (i.e. skip a prefix
that may indicate the base) and write the base of the number through the
pointer passed as second argument. Return zero if it does not point to a
number.

@item char *const_suffix(char *,char *);
First argument points to the start of the constant (including prefix) and
the second argument to first character after the constant (excluding suffix).
Checks for a constant-suffix and skips it. Return pointer to the first
character after that constant. Example: constants with a 'h' suffix to
indicate a hexadecimal base.

@item void parse(void);
This is the main parsing function. It has to read lines via
the @code{read_next_line()} function, parse them and create sections,
atoms and symbols. Pseudo directives are usually handled by the syntax
module. Instructions can be parsed by the cpu module using
@code{parse_instruction()}.

@item char *parse_macro_arg(struct macro *,char *,struct namelen *,struct namelen *);
Called to parse a macro parameter by using the source stream pointer in
the second argument. The start pointer and length of a single passed
parameter is written to the first @code{struct namelen}, while the optionally
selected named macro argument is passed in the second @code{struct namelen}.
When the @code{len} field of the second @code{namelen} is zero, then the
argument is selected by position instead by name. Returns the updated
source stream pointer after successful parsing.

@item int expand_macro(source *,char **,char *,int);
Expand parameters and special commands inside a macro source. The second
argument is a pointer to the current source stream pointer, which is
updated on any succesful expansion. The function will return the
number of characters written to the destination buffer (third argument)
in this case. Returning @code{-1} means: no expansion took place.
The last argument defines the space in characters which is left in the
destination buffer.

@item char *get_local_label(char **);
Gets a pointer to the current source pointer. Has to check if a valid
local label is found at this point. If yes return a pointer to the
vasm-internal symbol name representing the local label and update
the current source pointer to point behind the label.

Have a look at the support functions provided by the frontend to help.

@end table

@section CPU modules

A new cpu module must have its own subdirectory under @file{vasm/cpus}.
At least the files @file{cpu.h}, @file{cpu.c} and @file{cpu_errors.h}
must be written.

@subsection The file @file{cpu.h}

A cpu module has to provide the following elements (all other functions
should be @code{static} to prevent name clashes) in @code{cpu.h}:

@table @code
@item #define MAX_OPERANDS 3
Maximum number of operands of one instruction.

@item #define MAX_QUALIFIERS 0
Maximum number of mnemonic-qualifiers per mnemonic.

@item #define NO_MACRO_QUALIFIERS
Define this, when qualifiers shouldn't be allowed for macros. For some
architectures, like ARM, macro qualifiers make no sense.

@item typedef int32_t taddr;
Data type to represent a target-address. Preferrably use the ones from
@file{stdint.h}.

@item typedef uint32_t utaddr;
Unsigned data type to represent a target-address.

@item #define LITTLEENDIAN 1
@itemx #define BIGENDIAN 0
Define these according to the target endianess. For CPUs which support big-
and little-endian, you may assign a global variable here. So be aware of
it, and never use @code{#if BIGENDIAN}, but always @code{if(BIGENDIAN)} in
your code.

@item #define VASM_CPU_<cpu> 1
Insert the cpu specifier.

@item #define INST_ALIGN 2
Minimum instruction alignment.

@item #define DATA_ALIGN(n) ...
Default alignment for @code{n}-bit data. Can also be a function.

@item #define DATA_OPERAND(n) ...
Operand class for n-bit data definitions. Can also be a function.
Negative values denote a floating point data definition of -n bits.

@item typedef ... operand;
Structure to store an operand.

@item typedef ... mnemonic_extension;
Mnemonic extension.
@end table

Optional features, which can be enabled by defining the following macros:

@table @code
@item #define HAVE_INSTRUCTION_EXTENSION 1
If cpu-specific data should be added to all instruction atoms.

@item typedef ... instruction_ext;
Type for the above extension.

@item #define NEED_CLEARED_OPERANDS 1
Backend requires a zeroed operand structure when calling @code{parse_operand()}
for the first time. Defaults to undefined.

@item START_PARENTH(x)
Valid opening parenthesis for instruction operands. Defaults to @code{'('}.

@item END_PARENTH(x)
Valid closing parenthesis for instruction operands. Defaults to @code{')'}.

@item #define MNEMONIC_VALID(i)
An optional function with the arguments @code{(int idx)}. Returns true
when the mnemonic with index @code{idx} is valid for the current state of
the backend (e.g. it is available for the selected cpu architecture).

@item #define MNEMOHTABSIZE 0x4000
You can optionally overwrite the default hash table size defined in
@file{vasm.h}. May be necessary for larger mnemonic tables.

@item #define OPERAND_OPTIONAL(p,t)
When defined, this is a function with the arguments
@code{(operand *op,int type)}, which returns true when the given operand
type (@code{type}) is optional. The function is only called for missing
operands and should also initialize @code{op} with default values (e.g. 0).
@end table

Implementing additional target-specific unary operations is done by defining
the following optional macros:

@table @code
@item #define EXT_UNARY_NAME(s)
Should return True when the string in @code{s} points to an operation name
we want to handle.

@item #define EXT_UNARY_TYPE(s)
Returns the operation type code for the string in @code{s}. Note that the
last valid standard operation is defined as @code{LAST_EXP_TYPE}, so the
target-specific types will start with @code{LAST_EXP_TYPE+1}.

@item #define EXT_UNARY_EVAL(t,v,r,c)
Defines a function with the arguments @code{(int t, taddr v, taddr *r, int c)}
to handle the operation type @code{t} returning an @code{int} to indicate
whether this type has been handled or not. Your operation will by applied on
the value @code{v} and the result is stored in @code{*r}. The flag @code{c}
is passed as 1 when the value is constant (no relocatable addresses involved).

@item #define EXT_FIND_BASE(b,e,s,p)
Defines a function with the arguments
@code{(symbol **b, expr *e, section *s, taddr p)}
to save a pointer to the base symbol of expression @code{e} into the
symbol pointer, pointed to by @code{b}. The type of this base is given
by an @code{int} return code. Further on, @code{e->type} has to checked
to be one of the operations to handle.
The section pointer @code{s} and the current pc @code{p} are needed to call
the standard @code{find_base()} function.
@end table

@subsection The file @file{cpu.c}

A cpu module has to provide the following elements (all other functions
and data should be @code{static} to prevent name clashes) in @code{cpu.c}:

@table @code
@item int bitsperbyte;
The number of bits per byte of the target cpu.

@item int bytespertaddr;
The number of bytes per @code{taddr}.

@item mnemonic mnemonics[];
The mnemonic table keeps a list of mnemonic names and operand types the
assembler will match against using @code{parse_operand()}. It may also
include a target specific @code{mnemonic_extension}.

@item char *cpu_copyright;
A string that will be emitted as part of the copyright message.

@item char *cpuname;
A string describing the target cpu.

@item int init_cpu();
Will be called during startup, after argument parsing. Must return zero if
initializations failed, non-zero otherwise.

@item int cpu_args(char *);
This function will be called with the command line arguments (unless they
were already recognized by other modules). If an argument was recognized,
return non-zero.

@item char *parse_cpu_special(char *);
This function will be called with a source line as argument and allows
the cpu module to handle cpu-specific directives etc. Functions like
@code{eol()} and @code{skip()} should be used by the syntax module to
keep the syntax consistent.

@item operand *new_operand();
Allocate and initialize a new operand structure.

@item int parse_operand(char *text,int len,operand *out,int requires);
Parses the source at @code{text} with length @code{len} to fill the target
specific operand structure pointed to by @code{out}. Returns @code{PO_MATCH}
when the operand matches the operand-type passed in @code{requires} and
@code{PO_NOMATCH} otherwise. When the source is definitely identified as
garbage, the function may return @code{PO_CORRUPT} to tell the assembler
that it is useless to try matching against any other operand types.
Another special case is @code{PO_SKIP}, which is also a match, but skips
the next operand from the mnemonic table (because it was already handled
together with the current operand).

@item taddr instruction_size(instruction *ip, section *sec, taddr pc);
Returns the size of the instruction @code{ip} in bytes, which must be
identical to the number of bytes written by @code{eval_instruction()}
(see below).

@item dblock *eval_instruction(instruction *ip, section *sec, taddr pc);
Converts the instruction @code{ip} into a DATA atom, including relocations,
if necessary.

@item dblock *eval_data(operand *op, taddr bitsize, section *sec, taddr pc);
Converts a data operand into a DATA atom, including relocations.

@item void init_instruction_ext(instruction_ext *);
(If @code{HAVE_INSTRUCTION_EXTENSION} is set.)
Initialize an instruction extension.

@item char *parse_instruction(char *,int *,char **,int *,int *);
(If @code{MAX_QUALIFIERS} is greater than 0.)
Parses instruction and saves extension locations.

@item int set_default_qualifiers(char **,int *);
(If @code{MAX_QUALIFIERS} is greater than 0.)
Saves pointers and lengths of default qualifiers for the selected CPU and
returns the number of default qualifiers. Example: for a M680x0 CPU this
would be a single qualifier, called "w". Used by @code{execute_macro()}.

@item cpu_opts_init(section *);
(If @code{HAVE_CPU_OPTS} is set.)
Gives the cpu module the chance to write out @code{OPTS} atoms with
initial settings before the first atom is generated.

@item cpu_opts(void *);
(If @code{HAVE_CPU_OPTS} is set.)
Apply option modifications from an @code{OPTS} atom. For example:
change cpu type or optimization flags.

@item print_cpu_opts(FILE *,void *);
(If @code{HAVE_CPU_OPTS} is set.)
Called from @code{print_atom()} to print an @code{OPTS} atom's contents.

@end table


@section Output modules

Output modules can be chosen at runtime rather than compile time. Therefore,
several output modules are linked into one vasm executable and their
structure differs somewhat from syntax and cpu modules.

Usually, an output module for some object format @code{fmt} should be contained
in a file @file{output_<fmt>.c} (it may use/include other files if necessary).
To automatically include this format in the build process, the @file{make.rules}
has to be extended. The module should be added to the @code{OBJS} variable
at the start of @file{make.rules}. Also, a dependency line should be added
(see the existing output modules).

An output module must only export a single function which will return
pointers to necessary data/functions. This function should have the
following prototype:
@example
int init_output_<fmt>(
      char **copyright,
      void (**write_object)(FILE *,section *,symbol *),
      int (**output_args)(char *)
    );
@end example

In case of an error, zero must be returned.
Otherwise, It should perform all necessary initializations, return non-zero
and return the following output parameters via the pointers passed as arguments:

@table @code
@item copyright
A pointer to the copyright string.

@item write_object
A pointer to a function emitting the output. It will be called after the
assembler has completed and will receive pointers to the output file,
to the first section of the section list and to the first symbol
in the symbol list. See the section on general data structures for further
details.


@item output_args
A pointer to a function checking arguments. It will be called with all
command line arguments (unless already handled by other modules). If the
output module recognizes an appropriate option, it has to handle it
and return non-zero. If it is not an option relevant to this output module,
zero must be returned.

@end table

At last, a call to the @code{output_init_<fmt>} has to be added in the
@code{init_output()} function in @file{vasm.c} (should be self-explanatory).

Some remarks:
@itemize @minus

@item
Some output modules can not handle all supported CPUs. Nevertheless,
they have to be written in a way that they can be compiled. If code
references CPU-specifics, they have to be enclosed in
@code{#ifdef VASM_CPU_MYCPU} ... @code{#endif} or similar.

Also, if the selected CPU is not supported, the init function should fail.

@item
Error/warning messages can be emitted with the @code{output_error} function.
As all output modules are linked together, they have a common list of error
messages in the file @file{output_errors.h}. If a new message is needed, this
file has to be extended (see the section on general data structures for
details).

@item
@command{vasm} has a mechanism to specify rather complex relocations in a
standard way (see the section on general data structures). They can be
extended with CPU specific relocations, but usually CPU modules will
try to create standard relocations (sometimes several standard relocations
can be used to implement a CPU specific relocation). An output
module should try to find appropriate relocations supported by the
object format. The goal is to avoid special CPU specific
relocations as much as possible.

@end itemize

Volker Barthelmann                                      vb@@compilers.de

@bye