=== charset.[ch] ===

This defines the 8- and 16- bit character types and their encodings.

 int init_charset(void);

This function must be called to initialise the library (but is called
by init_parser()).  Returns -1 on error.

 void deinit_charser(void);

May be called to free memory when the library is no longer required.
It is called by deinit_parser().

The 8-bit type is char8, which is a typedef for char.  We would have
liked to use unsigned char, but this tends to produce innumerable
warnings from compilers.  The 16-bit type is char16, which is a typedef
for unsigned short.  We didn't use C wide character mechanism for various
reasons; we can't remember what they all were but one was that they are
typically 32 bits and we didn't want to double the size of everything.

The type Char is used for all character data returned by the parser.
It is a typedef for either char8 or char16, depending on how the
system was compiled.

The type CharacterEncoding is an enumeration (expect it to become a
pointer to a structure in some future release).  Currently supported
values include CE_UTF_8, CE_ISO_8859_x for 1<=x<=9, and CE_UTF_16[BL]
where B or L indicates big- or little-endian.

If the system is compiled in 16-bit mode, the internal encoding (the
encoding used for the type Char) is CE_UTF_16B or CE_UTF_16L - UTF-16
in native byte order.  If the system is compiled in 8-bit mode, the
internal encoding is CE_unspecified_ascii_superset - an unspecified
superset of ASCII in which all codes >= 0xa0 are treated as valid
name characters, and no character set translation is done on input
or output.  Do not attempt to output 16-bit characters when compiled
in 8-bit mode; the results are wrong.

 extern CharacterEncoding InternalCharacterEncoding

This variable reflects the internal encoding and should not be
assigned to.

 extern const char8 *CharacterEncodingName[CE_enum_count];
 extern const char8 *CharacterEncodingNameAndByteOrder[CE_enum_count];

These arrays map CharacterEncodings to their names, with and without
suffixes indicating the byte order.

 CharacterEncoding FindEncoding(char8 *name);

This function looks up an encoding by name.  It understands various
aliases (ISO-Latin-1 for ISO-8859-1 for example).  It returns
CE_unknown if the name is not recognised.

=== ctype.[ch] ===

This provides macros related to character types.

 int init_ctype16(void);

This function must be called to initialise the library (but is called
by init_parser()).  Returns -1 on error.

 void deinit_ctype16(void);

May be called to free memory when the library is no longer required.
It is called by deinit_parser().

The following macros may evaluate their argument more than once, so
don't do is_xml_namestart(*c++).

 #define is_xml_legal(c) ...

True if c is a legal XML character.

 #define is_xml_namechar(c) ...
 #define is_xml_namestart(c) ...

True if c is an XML name character or name start character
respectively.

 #define is_xml_whitespace(c) ...

True if c is an XML white space character.

=== string16.[ch] ===

This provides functions corresponding to the usual C library string
functions.

 char16 *strchr16(const char16 *, int);
 char8 *strchr8(const char8 *, int);
 Char *strchr16(const Char *, int);

These are versions of strchr() for char8, char16, and Char
respectively.  There are similar functions corresponding to strdup(),
strlen(), strcmp(), strncmp(), strcpy(), strncpy(), strcat(),
strcasecmp(), strncasecmp(), and strstr().

 void translate_latin1_utf16(const char8 *from, char16 *to);
 void translate_utf16_latin1(const char16 *from, char8 *to);
 char16 *translate_latin1_utf16_m(const char8 *from, char16 *to);
 char8 *translate_utf16_latin1_m(const char16 *from, char8 *to);

These functions convert between 8- and 16-bit characters.  The
conversion is trivial.  For 8-to-16, the value is unchanged, so it is
right for Latin-1.  For 16-to-8, the value is unchanged if it is <=
255, and is replaced by 'X' for other values.  They are useful for
converting a URL read from an XML document (and therefore represented
as a string of Char) to a string usable with url_open(), and for
converting a command-line argument.

The _m versions realloc() the destination buffer, so you can pass a
null argument or an exisiting malloc()ed string which will be expanded
if necessary; the (possibly new) destination buffer is returned.

The functions char8tochar16 and char16tochar8, and the macros
char8toChar and Chartochar8 have been removed because they were not
thread safe.  Use the functions described above instead.

=== stdio16.[ch] ===

This provides a partial implementation of the standard i/o library that
handles 16-bit characters.  So far much more is implemented for output
than input.

 int init_stdio16(void);

This function must be called to initialise the library (but is called
by init_parser()).  Returns -1 on error.

 void deinit_stdio16(void);

May be called to free memory when the library is no longer required.
It is called by deinit_parser().

The central datatype is the FILE16 which corresponds to the usual FILE
structure.  Each FILE16 has an associated encoding; characters are
translated to this encoding on output (and will be translated from it
on input when this is implemented).

There are three predefined FILE16s: Stdin, Stdout and Stderr.  By
default their encoding is ISO-Latin-1.  

 int Fprintf(FILE16 *file, const char *format, ...);
 int Vfprintf(FILE16 *file, const char *format, va_list args);
 int Printf(const char *format, ...);
 int Vprintf(const char *format, va_list args);
 int Sprintf(void *buf, CharacterEncoding enc, const char *format, ...);
 int Vsprintf(void *buf, CharacterEncoding enc, const char *format, 
              va_list args);

These correspond to the usual stdio functions.  There are two additional
format specifiers: %ls and %S.  %ls expects a string of char16.  %S
expects a string of Char - that is, it expects 8- or 16- bit characters
depending on which the system is compiled for.

 int Fclose(FILE16 *file);
 int Fflush(FILE16 *file);
 int Fseek(FILE16 *file, long offset, int ptrname);

Again, these correspond to the usual stdio functions.

 CharacterEncoding GetFileEncoding(FILE16 *file);
 void SetFileEncoding(FILE16 *file, CharacterEncoding encoding);

These get and set the character encoding associated with a file.

 void SetCloseUnderlying(FILE16 *file, int cu);

FILE16s typically have some underlying mechanism that does the i/o.
For example, it may use an ordinary FILE, or it may write to a string.
This function controls whether a close operation is performed on
the underlying structure when the FILE16 is closed.  For a FILE this
would be calling fclose(), for a string it would be free().

 int Readu(FILE16 *file, unsigned char *buf, int max_count);
 int Writeu(FILE16 *file, unsigned char *buf, int count);

These perform low-level read and write on the FILE16.  No character
translation is done.

 FILE16 *MakeFILE16FromFILE(FILE *f, const char *type);
 FILE16 *MakeFILE16FromString(void *buf, long size, const char *type);
 FILE16 *MakeFILE16FromGzip(gzFile file, const char *type);
 FILE16 *MakeFILE16FromWinsock(int sock, const char *type);

These functions create FILE16s.  MakeFILE16FromGzip uses a LIBZ stream
to read or write compressed files.  MakeFILE16FromWinsock is only used
under MS Windows, where sockets seem to work differently from oridinary
file descriptors.

On systems where it makes a difference (not Unix), FILEs used in
FILE16s are set to binary mode when the FILE16 is first read or
written, so that the standard i/o library doesn't translate bytes that
happen to look like linefeeds in cr-lf, and vice versa.  Note that
using Stdin/out/err will therefore put stdin/out/err into binary mode.

=== url.[ch] ===

This defines functions for accessing URLs.

 int init_url(void);

This function must be called to initialise the library (but is called
by init_parser()).  Returns -1 on error.

 void deinit_url(void);

May be called to free memory when the library is no longer required.
It is called by deinit_parser().

 char8 *url_merge(const char8 *url, const char8 *base,
                        char8 **scheme, char8 **host, int *port, char8 **path);

This merges a URL with a base URL.  The merged URL is returned.  If
base, scheme, host, port and path are non-null, the parts of the
merged URL are returned in them.  The caller should free the returned
strings when they are no longer required.

 char8 *default_base_url(void);

This returns a default base URL that can be used when no better choice
is available.  It returns a file: URL referring to the current
directory (file:`pwd`/).  The caller should free the returned string
when it is no longer required.

extern FILE16 *url_open(const char8 *url, const char8 *base, 
                        const char8 *type, char8 **merged_url);

This returns a FILE16 connected to the specified URL.  The URL is
first merged with the specified base URL, or with the default base URL
if it is null.  If you want relative URLs to fail, give a base URL
of "".  The type should be "r" for reading, "w" for writing.

=== input.[ch] ===

This defines structures and functions related to reading from
entities.  Some of the functionality of this file - relating to
character encoding translation - should (and probably will) be moved
to stdio16.c.

An InputSource is an entity that is open for reading.  To parse an
entity, it is opened and the resulting source is pushed onto
the parser's input stack.

 InputSource EntityOpen(Entity e);

This takes an entity and returns an source.

 InputSource SourceFromFILE16(const char8 *description, FILE16 *file16);
 InputSource SourceFromStream(const char8 *description, FILE *file);

These are ways of getting a source when what you have is not an entity
but an existing open stream (such as stdin).  A fake entity is created
with the description as its system ID.  If the description contains
a slash character, it will be used as the entity's base URL, so if you
know where the stream came from you can pass in its URL as the
dscription; otherwise use something like "stdin" so that the user
gets reasonable error messages.

 void SourceClose(InputSource source);

This closes and frees a source.  Usually the parser will call this when
it comes to the end of the source.

 InputSource NewInputSource(Entity e, FILE16 *f16);

This creates an input source referring to a given entity and stream.
It is only intended for direct use by the user if the parser's
entity opener has been set (for example to implement a public ID
catalogue).

 int SourceTell(InputSource s);
 int SourceSeek(InputSource s, int offset);

These correspond to the standard fseek() and ftell() functions.  They
should be used with extreme care since arbitrary seeking will
typically result in parse errors.  Note that the offset is in bytes,
not characters.

=== dtd.[ch] ===

This defines structures and functions related to a document's DTD.
Much of it is private to the implementation, and most of the
structures it defines are created and destroyed by the parser rather
than the user.

A DTD is represented by a Dtd structure.  This contains the name given
in the DOCTYPE declaration and the entities, element types, attribute
definitions and notations defined.  Even if a document does not have a
DOCTYPE declaration, it has a Dtd; this contains dummy declarations
for the elements and attributes mentioned in the document.

 void FreeDtd(Dtd dtd)

This frees a Dtd.  Even though the Dtd is created automatically,
the user should free it; see FreeParser().

Entities are represented by Entity structures.  All entities have a
name (except for top-level entities and the dummy entity created to
represent the internal DTD part).  An entity is either internal or
external.  External entities have a system ID (a URL) which is used to
open them and optionally a public ID.  Internal entities contain their
text as Char string in the internal encoding.

 Entity NewExternalEntity(const Char *name,
                          const char8 *publicid, const char8 *systemid,
                          NotationDefinition notation,
                          Entity parent);

This creates a new external entity.  It is called directly by the user
only to create a top-level entity for parsing, in which case the
notation and parent should be null, and the name and public ID may be
null.  The name and IDs are copied.

 void FreeEntity(Entity e);

This frees an entity.

 const char8 *EntityURL(Entity e);

This returns the URL of an entity, obtained by merging its system ID
with the URL of any parent entity.

 const char8 *EntityBaseURL(Entity e);
 void EntitySetBaseURL(Entity e, const char8 *url);

These get and set the base URL for an entity (that is, the base URL
used when interpreting URLs that appear in the entity).

Element types are represented by ElementDefinition structures.  These
contain the name of the element ("name" field), and its declared
content and attributes.  The "prefix" field contains the prefix if the
name contains a colon (otherwise null), and the "local" field contains
the part of the name after the colon (or the whole name if there is
no colon).

Attribute definitions are represented by AttributeDefinition structures.
These contain the name of the attribute, its declared type, allowed values
and default.  The "name", "prefix" and "local" fields are the same as
for ElementDefinition.

Notation definitions are represented by NotationDefinition structures.
These contain the name of the notation, and its system and public IDs.

=== namespaces.[ch] ===

This defines structures analogous to those in dtd.[ch], but for elements
and attributes within a namespace rather than a DTD.

 void init_namespaces(void);

This function must be called to initialise the library (but is called
by init_parser()).  Returns -1 on error.

 void deinit_namespaces(void);

May be called to free memory when the library is no longer required.
It is called by deinit_parser().

A namespace is represented by a Namespace structure.  This contains
the URI of the the namespace ("uri" field), and lists of the element
types and global attributes in the namespace.  Each element type has a
list of per-element-type (ie unqualified) attributes.

It is natural that namespaces are shared between documents.  If two
documents refer to an element type with the same name and namespace,
the structures representing them should be equal, and likewise for
attributes.  This poses a problem for storage allocation: if a process
(say a server of some kind) repeatedly reads documents, it will
accumulate namespaces.  If it is treating the documents independently,
this is undesirable.  To accommodate this, namespaces are grouped into
"namespace universes" of type NamespaceUniverse.  By default, all
instances of the parser use a common namespace universe, which can be
specified by passing a null argument to functions that take a
NamespaceUniverse.  For server applications that do not want to
accumulate namespaces, it is possible to set the namespace universe of
each parser instance to a new universe, and free it after freeing the
parser (XXX how to do this is not yet described).  Alternatively the
common namespace universe can be cleared by calling
reinit_namespaces().

Element types in a namespace are represented by NSElementDefinition
structures.  These contain the (unqualified) name of the element
("name" field) and the namespace itself ("namespace" field).

Attribute definitions in a namespace are represented by
NSAttributeDefinition structures.  These contain the (unqualified)
name of the attribute ("name" field) and the namespace itself
("namespace" field).  Per-element-type attributes also contain
the NSElementDefinition they are associated with ("element" field);
this field is null for global attributes.

Unfortunately "namespace" turns out to be a reserved word in C++.
If __cplusplus is defined, the include files use "name_space"
instead.  You should of course compile the RXP library as C code,
even if your program is in C++.

=== xmlparser.[ch] ===

This defines structures and functions for parsing an XML document.

 int init_parser(void)

This function must be called to initialise the library, and it calls
the other init_* functions.  It is called by NewParser(), but if you
call any other functions before NewParser() you should call
init_parser() yourself first.  Returns -1 on error.

 void deinit_parser(void);

May be called to free memory when the library is no longer required.
It calls the other deinit_* functions.

An instance of the parser is represented by a Parser structure.  It
contains the current state of the parse.

 Parser NewParser(void);

The creates a new parser instance.

 void FreeParser(Parser p);

This frees a parser.  It does not free the Dtd structure, because this
could conceivably be shared between parsers (though this documentation
does not explain how to do that).  You should normally free the Dtd
when you free the Parser by doing FreeDtd(p->dtd).

 void ParserSetFlag(Parser p,  ParserFlag flag, int value);
 #define ParserGetFlag(p, flag) ...

There are numerous flags that can be applied to a parser.  ParserSetFlag
sets the specified flag to a value which should be non-zero to set it,
zero to clear it.  ParserGetFlag returns zero or non-zero (not necessarily
one!) according to whether the flag is clear or set.

The (documented) flags are

    ExpandCharacterEntities
    ExpandGeneralEntities

If these are set, entity references are expanded.  If not, the
references are treated as text, in which case any text returned that
starts with an ampersand must be an entity reference (and provided
MergePCData is off, all entity references will be returned as separate
pcdata XBits).  On by default.

    NormaliseAttributeValues (also NormalizeAttributeValues)

If this is set, attributes are normalised according to the standard.
You might want to not normalise if you are writing something like an
editor.  On by default.

    ErrorOnBadCharacterEntities

If this is set, character entities which expand to illegal values are
an error, otherwise they are ignored with a warning.  Off by default
(should probably be on).

    ErrorOnUndefinedEntities

If this is set, undefined general entity references are an error,
otherwise a warning is given and a fake entity constructed whose value
looks the same as the entity reference.  Off by default (should probably
be on).

    ReturnComments

If this is set, comments are returned as XBits, otherwise they are ignored.
Off by default.

    ErrorOnUndefinedElements
    ErrorOnUndefinedAttributes

If these are set and there is a DTD, references to undeclared elements
and attributes are an error.  Off by default.

    WarnOnRedefinitions

If this is on, a warning is given for redeclared elements, attributes,
entities and notations.  On by default.

    TrustSDD
    ProcessDTD

If TrustSDD is set and a DOCTYPE declaration is present, the internal
part is processed and if the document was not declared standalone or
if Validate is set the external part is processed.  Otherwise, whether
the DOCTYPE is automatically processed depends on ProcessDTD; if
ProcessDTD is not set the user must call ParseDtd() if desired.

    ReturnDefaultedAttributes

If this is set, the returned attributes will include ones defaulted as
a result of ATTLIST declarations, otherwise missing attributes will not
be returned.  Off by default.

    MergePCData

If this is set, text data will be merged across comments and entity
references.  Off by default.

    XMLStrictWFErrors

If this is set, various well-formedness errors will be reported as errors
rather than warnings.  Off by default.

    Validate

If this is on, the parser will validate the document.  Off by default.

    NoNoDTDWarning

Usually, if Validate is set, the parser will produce a warning if the
document has no DTD.  This flag suppresses the warning (useful if you
want to validate if possible, but not complain if not).  Off by default.

    ErrorOnValidityErrors

If this is on, validity errors will be reported as errors rather than
warnings.  This is useful if your program wants to rely on the
validity of its input.  Off by default.

    XMLSpace

If this is on, the parser will keep track of xml:space attributes
(see below).

    XMLNamespaces

If this is on, the parser processes namespace declarations (see
below).  Namespace declarations are *not* returned as part of the list
of attributes on an element.

 void ParserSetWarningCallback(Parser p, CallbackProc cb);

Usually warnings are printed (on the standard error stream).  This
function allows you to set a function to be called instead.  The function
should be declared like this:

 void my_warning_proc(XBit bit, void *arg)

The bit argument will contain a warning bit.  The arg argument will
be null unless it is set with

 void ParserSetCallbackArg(Parser p, void *arg);

 void ParserSetDtdCallback(Parser p, CallbackProc cb);

Usually comments and processing instructions inside the DOCTYPE
declaration are ignored.  This function allows you to set a callback
be called instead.  The function should be declared in the same way
as the warning callback.

 void ParserSetEntityOpener(Parser p, EntityOpenerProc opener);

Usually entities are opened by calling EntityOpen() on them.  This
function allows you to intercept entity opening with a callback, for
example to implement a catalogue.  The callback should declared like
this:

 InputSource my_entity_opener(Entity e, void *arg);

If your entity opener decides not to handle the entity, it should
return the result of calling EntityOpen(e).

 void ParserPerror(Parser p, XBit bit);

This function prints an error message according to the bit argument.
You should probably call it when the parser returns an error XBit, and
it may be useful to call it from a warning callback function.

 int ParserPush(Parser p, InputSource source);

This pushes an input source onto the parser's input stack.  The usual
sequence for opening a document is to do:

    p = NewParser();
    ent = NewExternalEntity(0, 0, filename-or-url, 0, 0);
    source = EntityOpen(ent);
    ParserPush(p, source);


The parser returns data as XBit structures.  You can read either
single "bits" - start and end tags, text data and so on - or entire
trees.  In the latter case the XBit structure returned contains
pointers to child XBits.  Each XBit has a "type" field whose value is
an XBitType enumeration which is one of the following:

    XBIT_start
    XBIT_empty

Returned for start and empty tags.  The XBit's "element_definition"
field points to the definition of the element.  The attributes field
contains a linked list of Attribute structures, each of which has a
"definition" field pointing to the attribute definition, a "value"
field (string of Char) containing the value, and a "next" field
pointing to the next attribute (or null).  

If the XMLSpace flag is set, the "wsm" field indicates the white-space
processing mode for the element, determined from the value of the
xml:space attribute if there is one or inherited if not.  Its value is
a WhiteSpaceMode enumeration which is one of WSM_unspecified,
WSM_default, or WSM_preserve.

If the XMLNamespaces flag is set, the "ns_element_definition" field of
the bit will contain the namespace version of the definition if the
element name is qualified or a default namespace is in effect,
otherwise null.  The ns_definition field of each attribute will
similarly contain the namespace version of the attribute definition if
the attribute name is qualified or belongs to a qulified element.  Two
element or attributes with the same local name and namespace URI will
have the same ns_[element_]definition even if they were read from
different documents (provided that the two parser instances are using
the same namespace universe).  The ns_dict field of the bit points to
a linked list of currently active namespace bindings (not yet
documented); for start bits these not freed until the corresponding
*end* bit is freed.

If the XMLNamespaces flag is not set, the ns_* fields do not contain
useful values.

    XBIT_end

Returned for end tags.  The "element_definition" field points to the
definition of the element.

    XBIT_pcdata

Returned for text.  The "pcdata_chars" field points to the text as a
string of Char.	

    XBIT_comment

Returned for comments.  The "comment_chars" field points to the
comment text as a string of Char.

    XBIT_cdsect

Returned for CDATA sections.  The "cdsect_chars" field points to the
comment text as a string of Char.

    XBIT_pi

Returned for processing instructions.  The "pi_name" field points to
the target and the "pi_chars" field to the comment text, as strings of
Char.

    XBIT_dtd

Returned for DOCTYPE declarations.  Two entities are created for the
internal and external parts.  These are stored in the "internal_part"
and "external_part" fields of the Dtd structure associated with the
parser.  Whether the declaration is processed (rther than just read)
is determined by the TrustSDD flag.

    XBIT_eof

Returned at the end of the document.
 
    XBIT_error

Returned when an error is detected.  The bit should normally be passed
to ParserPerror().

    XBIT_warning

This is never returned, but bits with this type are passed to warning
callbacks.

 XBit ReadXBit(Parser p);

This reads the next bit from a document.  Note that the parser may
(and does) re-use the XBit structure itself next time ReadXBit is
called.

 XBit PeekXBit(Parser p);

This reads the next bit wothout consuming it, so that ReadXBit() will
return it again.

 void FreeXBit(XBit xbit);

This frees the memory associated with an XBit (but not the XBit
structure itself).  It should be called after processing a bit, If you
need to keep any of the data, you can set the relevant field in the
bit to null before calling FreeXBit; it will then be your
responsibility to free that data yourself.

 XBit ReadXTree(Parser p);

This reads a whole tree.  That is, if the next bit is a start bit,
further bits are read until the end bit is encountered.  The
"nchildren" field of the returned bit contains the number of children
of the node, and they are stored in the children field as
bit->children[0] ... bit->children[bit->nchildren-1], and so on
recursively.

 void FreeXTree(XBit tree);

This frees a tree of XBits.

 XBit ParseDtd(Parser p, Entity e);

This processes entities representing the DOCTYPE declaration, created
when an XBIT_dtd but is returned.  You will typically use code something
like this:

    if(bit->type == XBit_dtd)
    {
	XBit b;
	b = ParseDtd(sf->pstate, p->dtd->internal_part);
	if(b->type == XBIT_error)
            ...
	b = ParseDtd(sf->pstate, p->dtd->external_part);
	if(b->type == XBIT_error)
            ...
    }