dictd-1.13.1/doc/dicf.ms

 Created: Sun Mar 8 07:46:40 1998
 Revised: Sun Mar 8 10:34:44 1998 by faith@acm.org
 Distribution of this memo is unlimited.

 $Id: dicf.ms,v 1.1 1998/07/05 22:10:06 faith Exp $

.po 0
.lt 7.2i
.nr LL 7.2i
.nr LT 7.2i
.tl 'DICT Development Group''\*(LH'
.tl '''\*(DA'

\\$1

..


\\$3

.XS
\\*(SN\t\\$3
.XE
..


\\$2

.XS
\\*(SN\t\\$2
.XE
..


\\$2

..
.while \\n[.$]>=2 \{\
. as an-result \\$1\\$2\\*[an-empty]
. shift 2
.\}
\\*[an-result]
..


DICT Interchange Format


.HU "Status of this Memo"

This document is a DICT Development Group Technical Report.

.HU "Abstract"

The DICT Interchange Format (DICF) is a human-readable format for the
interchange of dictionary databases for the use with DICT protocol
client/server software.

.bp
.HU "Table of Contents"

.so  dicf.toc


.HR 1 0 "Introduction"

.HN 2 "Requirements"

In this document, we adopt the convention discussed in Section 1.3.2
of [RFC1122] of using the capitalized words MUST, REQUIRED, SHOULD,
RECOMMENDED, MAY, and OPTIONAL to define the significance of each
particular requirement specified in this document.

In brief: "MUST" (or "REQUIRED") means that the item is an absolute
requirement of the specification; "SHOULD" (or "RECOMMENDED") means
there may exist valid reasons for ignoring this item, but the full
implications should be understood before doing so; and "MAY" (or
"OPTIONAL") means that his item is optional, and may be omitted
without careful consideration.

.HN 1 "Design Considerations"

.HN 2 "Introduction"

The goal of DICF is to provide a format suitable for the interchange of
dictionary databases. New databases can be converted into DICF, and then
the DICF can be analyzed and indexed by DICT server software.

If machine translation were the only use of DICF, then SGML might be the
best choice for DICF syntax. However, we expect humans to create new
dictionary databases by hand, and we have found that the use of SGML
without specialized editors is difficult for humans to read and edit. We
have found a minimalistic syntax, similar to
 nroff or TeX, easier to edit using a wide variety of text-based editors.

Further, we would like to be able to combine a DICF formatting engine into
other pieces of software, such as DICT protocol clients and servers.
Therefore, we would like the parsing and formatting requirements of DICF to
be lightweight and easy to implement. We also would like underlying engine
to be powerful enough to support unforeseen extensions that might be needed
to support complex databases. From this viewpoint, engines bases on SGML
or TeX would be too large to be easily embedded in other applications. The
 nroff language [NROFF86] is well documented, extensible, and relatively small.
We have decided to base the DICF langauge syntax and capabilities on an
extended subset of
 nroff .
.HN 2 "Requirements"

DICF must be translatable to the following formats:

 \(bu
ASCII text [ASCII]
 \(bu
UTF-8 encoded text [ISO10646,RFC2044]
 \(bu
HTML [XXX]


Further, DICF must be sufficiently powerful to support the indexing
requirements of current and future versions of DICT client/server software;
and DICF must be editable without a specialized editor.

.HN 2 "Limitations"

DICF provide simpler capabilities that a full
 nroff implementation:

 \(bu
Page control commands are not needed, since a single definition will always
appear on a single logical page. This includes the commands:
 .pl ", " .bp ", " .pn ", " .po ", " .ne ", " .mk ", and " .rt .  \(bu
Because definitions and formatting instructions will be included in a
single file, the ability to access other files and programs should not be
supported. This includes the commands:
 .so ", " .nx ", and " .pi . Note that the elimination of these command also eliminates security
considerations from the formatting language.


Because the notion of pages have changed, commands dealing with traps
(i.e.,
 .wh ", " .ch ", " .dt ", " .em ", and " .it ) and
titles
(i.e.,
 .tl ", " .pc, ", and " .lt ) may have slightly different meanings than in standard
 nroff .
.HN 2 "Extensions"

Because of the necessity of dealing with UTF-8 encoded characters and the
requirement that DICF file can be easily edited by standard editors, the
syntax for special characters is extended.

.HN 1 "Language Syntax"

.HN 1 "Commands"

.HN 2 "Entries"

An "entry" will contain a definitiona and will usually be marked for
indexing by at least one headword (if no headwords are marked in an entry,
then the entry cannot be searched for).

An entry starts with:

 .e

or with:

 .e word

which is equivalent to:

 .e
 .h word

as described in the next section.

An entry ends when:

 \(bu
the next entry begins,
 \(bu
the end of the file is reached,
 \(bu
or

 ..

is seen on the input.


Blank lines at the end of an entry MUST BE elided.

.HN 2 "Headwords"

For an entry, several words might be marked as "headwords". These
headwords are placed in an index such that a search on the index for the
headword will return the entry containing the headword. By default, if a
headword is indexed, then all standard search methods should find that
headword and return the definition.

However, some headwords may best identify an entry only when an "exact"
search is performed, and should not be returned for various inexact
searches. For example, uncommon spellings or misspellings of a word may
reasonably identify an entry when an exact search is performed, but would
confuse a user if these spellings were also returned when an inexact search
was performed. Another example arises in gazetteer-like dictionaries: an
exact search for "city, state" should return the appropriate informatin,
but inexact searches should only return a list of cities without state
names \(em otherwise too many matches are returned for inexact searches,
making these types of searches useless (because so many cities have the
same name).

Other headwords may be marked so that they are only returned for specific
types of searches. For example, an entry may mark several words as
"mentioned". These words should not identify the entry for the usual exact
or inexact searches, because doing to would return too many unrelated
definitions. However, a special "mentioned" search may return entries
which mention or provide usage examples for words which are peripheral to
the main word being defined by the entry.

The best types of searches provided depend on the specific database. DICF
defines the three most common builtin marks for headwords:


 .h word

marks "word" as a headword for common exact and inexact searches.


 .he word

marks "word" as a headword only for exact searches. If the DICT server
does not support multiple types of index entries, then this headword will
not be indexed.


 .hm word

marks "word" as a headword only for special "mentioned" searches. If the
DICT server does not support multiple types of index entries, then this
headword will not be indexed.

By default, a word marked as a headword will be inserted in place in the
text of the definition. If this insertion is not desired, the following
alternative forms of these commands are provided:
 .hn ", " .hen ", and " .hmn .
.HN 2 "Cross References"

For an entry, several words might be marked as "cross references"
so that, during definition display, selection of one of these words will
search for another definition. By default, these words are inserted in
place in the text of the definition. As with the headword commands, an
alternative form is provided that does not have this behavior.


 .x word

marks "word" as a cross reference to another entry, inserting the word in
place in the definition text.


 .xn word

marks "word" as a cross reference to another entry, but does not insert the
word in the definition text. If the DICT server does not support cross
references to words which do not appear in the text, then "word" will be
ignored. This command is provided for orthogonality with the headword
commands and for support of future DICT and DICF capabilities.

.HN 2 "Word Marks"

Some words (including compound words, or phrases) may be marked as having
special features which may imply status as a headword and/or cross
reference, or suggest the use of font changes to set the word appart. For
example:

 .title Book/paper/record/movie title
 .person Name of a person
 .syn Synonym
 .ant Antonym
 .cf Compare with
 .see See also
 .genus Name of a genus
 .species Name of a species
 .subspecies Name of a subspecies
 .order Name of an order
 .phylum Name of a phylum
 .class Name of a class
 .family Name of a family
 .chem Chemical notation
 .math Mathematical notation [XXX shouldn't be here]


.HN 2 "Informational Marks"

In many databases, the text of the entry will follow the headword in a free
format. For other databases, marking some parts of the entry may be
helpful for later machine processing or formatting:

 .note Note
 .usage Usage note
 .q Quote, usually as an example
 .qa Author of previous quote
 .ex Example of usage
 .au Author of entry
 .s Source of entry
 .pron Pronounciation
 .syl Syllabification
 .pos Part of speech
 .var Variant
 .altsp Alternative spelling
 .pl Spelling of plural form
 .sing Spelling of singular form


.HN 1 "Security Considerations"

Because DICF commands cannot cause the execution of arbitrary programs,
DICF raises no security issues.

.HN 1 "References"

.XP
[ASCII] US-ASCII. Coded Character Set - 7-Bit American Standard Code
for Information Interchange. Standard ANSI X3.4-1986, ANSI, 1986.

.XP
[ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information
technology -- Universal Multiple-Octet Coded Character Set (UCS) --
Part 1: Architecture and Basic Multilingual Plane. UTF-8 is described
in Annex R, adopted but not yet published. UTF-16 is described in
Annex Q, adopted but not yet published.

.XP
[NROFF86] Ossanna, Joseph F. Nroff/Troff User's Manual, updated for 4.3BSD
by Mark Seiden (USD-24). Published in UNIX User's Supplementary Documents
(USD): 4.3 Berkeley Software Distribution, Virtual VAX-11 Version, April
1986 (Computer Systems Research Group, Computer Science Division,
Department of Electrical Engineering and COmputer Science, University of
California, Berkeley).

.XP
[RFC2044] Yergeau, F., "UTF-8, a transformation format of Unicode and
ISO 10646", RFC-2044, Alis Technologies, October 1996.

.HN 1 "Acknowledgements"

.HN 1 "Author's Addresses"


Rickard E. Faith
EMail: faith@cs.unc.edu (or faith@acm.org)


Bret Martin
EMail: bamartin@miranda.org


 Local Variables:
 mode: nroff
 mode: font-lock
 fill-column: 70
 End: