PyLTXML -- The LT-XML Python Interface Release 1.3, August 2002 Richard Tobin, Henry S. Thompson and Chris Brew Introduction ------------ This package interfaces our high-performance validating C API for XML to Python. It is known to work with Python 1.6 and later, but the binary version of this release is specialised to Python 2.2. It requires the LT-XML version 1.2. Please report any difficulties or bugs which you encounter and we will do our best to deal with them. There is no documentation beyond this file: please refer to the LT XML documentation for details of the C API and structures which are being made available in Python by this package. Many of PyLTXML functions have the same name as LT XML functions: check their documentation for details. This distribution is governed by the GNU Public License: see the accompanying Copyright and COPYING files for details. In the case of a binary distribution on its own, this means you may use PyLTXML yourself for any purpose, but may not redistribute it in any form, until and unless you obtain the source distribution as well, and comply with the GPL with respect to redistribution. See 00INSTALL for source installation instructions -- if you opened one of the binary distributions far enough to be reading this, your installation is almost certainly complete already. The module PyLTXML defines several types, functions, constants and one error. The types are: FileType DoctypeType ElementTypeType ContentParticleType AttrDefnType BitType ItemType OOBType ERefType QueryType There is no type corresponding to the C NSL_Data; the data field of an Item is just a list of strings or unicode strings and Items. The Python internal type objects for all the above types are exposed as PyLTXML.xxxType, e.g. PyLTXML.FileType. Values defined as enumerated types in the C version are represented as strings in Python, and are listed below with the slots and functions they apply to. The slots (or "attributes" as python calls them) for each type are tabulated below. The value type is given in parentheses -- 'unicode' stands for unicode string. File: doctype - (Doctype) the document type where - current input entity location, a four-tuple of (entityName (string), lineNum (int), charPos (int), url (entity url)) seenValidityError - (integer) 0/1 depending on whether a validity error has not/has been seen so far, or 0 if not validating Doctype: [ ddb - (string) name of DDB file , obsolete for XML] encoding - (string) name of (input) encoding xencoding - (string) name of output encoding sdd - (string) standalone declaration ("yes", "no", "unspecified") elementTypes - (dictionary->ElementType) element type declarations from DTD entities - (dictionary->unicode) internal or external general entity declarations; internal value is definition, external is SYSTEM id parameterEntities - (dictionary->unicode) similarly for parameter entities name - (unicode) DOCTYPE name, if any doctypeStatement - (unicode) The entire DOCTYPE string Bit: type - one of "bad", "start", "end", "empty", "eof", "text", "pi", "doctype", "comment" item - (Item) available if type is "start" or "empty" body - (unicode) available if type is "text", "pi", "doctype" or "comment" label - (unicode) available if type is "start", "empty", or "end" llabel - (unicode) ditto, local part of tag prefix - (unicode) ditto, prefix part of tag nsuri - (string) ditto, namespace part of tag or None if unprefixed isCData - (boolean) available if type is "text" isERef - (boolean) available if type is "text" Item: type - one of "inchoate", "non_empty", "empty", "free" label - (unicode) element's tag llabel - (unicode) local part of element's tag prefix - (unicode) prefix part of tag nsuri - (string) namespace URI part of element's tag, or None if not qualified nsdict - (dict -> string) Namespace declarations in force data - (tuple or list of unicode and Items) parent - (Item) the Item in whose data this item is, or None ElementType name - (unicode) type - one of "MIXED", "ANY", "EMPTY", "ELEMENT" particle - (ContentParticle) attrDefns - (dictionary -> AttrDefn) ContentParticle type - one of "#PCDATA", "NAME", "SEQUENCE", "CHOICE" name - (unicode) only if type=="NAME" repetition - one of "?", "+", "*" or None children - (list of ContentParticle) AttrDefn name - (unicode) type - one of "CDATA", "NMTOKEN", "ENTITY", "IDREF", "NMTOKENS", "ENTITIES",, "IDREFS",, "ID",, "NOTATION", "ENUMERATION" defType - one of "#REQUIRED", "#IMPLIED", "NONE", "#FIXED" defValue - (unicode) allowedValues - (list of unicode) OOBType type - one of "comment", "pi", "cdata" data - (unicode) ERef name - (unicode) Query: (none) All these slots are read-only, except Item.data Bits and Items are freed when there is no reference to them, and there is no way to free them explicitly. It should therefore be impossible to get an item of type "free". This can be disabled viea AutoFreeNSLObjects (see below). The functions are listed below. Optional arguments are in square brackets; omitting them is equivalent to passing a value of None, which has the same effect as passing NULL to the C function. Only non-obvious argument types are described. Open(filename, [doctype,] type) -> File type is one of NSL_read or'ed with zero or more of NSL_read_all_bits, NSL_read_no_consume_prolog, NSL_read_no_normalise_attributes, NSL_read_declaration_warnings, NSL_read_strict NSL_read_no_expand, NSL_read_validate, NSL_read_namespaces, NSL_read_defaulted_attributes, NSL_read_relaxed_any, NSL_read_allow_undeclared_nsattributes or NSL_write or'ed with zero or more of NSL_write_no_doctype, NSL_write_no_expand, NSL_write_plain, NSL_write_fancy, NSL_write_canonical, NSL_write_default, NSL_write_style OpenStream(filename, [doctype,] type, encoding) OpenURL(url, [doctype,] type, encoding) type as above encoding is the numerical value of an encoding taken from the dictionary CharacterEncodingNames FOpen(pfile, [doctype,] type) -> File pfile is a python file, eg sys.stdin type is as above OpenString(string, [doctype,] readType) -> File type is a read type as above DoctypeFromDdb(filename) -> Doctype [obsolete] ItemParse(file, item) -> Item GetNextBit(file) -> Bit or none Returns None at EOF so you should never see a bit of type "eof". Close(file) -> none Print(file, value) -> none value is a Bit, and Item or a string ForceNewline(file) -> none PrintStartTag(file,label) -> none PrintTextLiteral(file,string) -> none PrintEndTag(file, label) -> none Print start tag and attributes/text with ent refs as req'd/endtag label is a string. GetAttrStringVal(item, name) -> string GetAttrVal(item, name) -> string [no difference from GetAttrStringVal] PutAttrVal(item, name, value) -> integer (boolean) NewAttrVal(item, name, value) -> none name and value are unicode They are automatically "uniquified". ItemActualAttributes(item) -> list of attribute name,value pairs ItemActualAttributesNS(item) -> list of name,value,nsuri,localName tuples LookupPrefix(item, prefix) -> nsuri prefix is unicode ParseQuery(doctype, querystring) -> Query ParseQueryR(doctype, querystring) -> Query querystring is a string GetNextQueryItem(infile, query, [outfile]) -> Item or none RetrieveQueryItem(item, query, [fromitem]) -> Item or none fromitem should be None (or omitted) on the first call RetrieveQueryData(item,query) -> None -- not implemented yet Item(doctype, label, data) creates a new Item. data should be a list or None to create an empty Item; note that this is different from passing an empty list which creates a non-empty Item with no content. AutoFreeNSLObjects(boolean) -> None Turn garbage collection of NSLItems etc. on or off (default is on) Various error conditions signal an error; the error object is XMLinter.error, available as the value of PyLTXML.error A list of all recognised character encodings is available as PyLTXML.CharacterEncodingNames. EXAMPLE USAGE This distribution includes simple.py, a minimal example program, with associated data file in the example sub-directory. Command: python simple.py < small.xml Output: ('unknown', 'UTF-8') [(u'ID', u'P12830'), (u'GSC', u'123.676139019108'), (u'K1', u'71'), (u'K2',u'9352'), (u'K3', u'10887'), (u'K4', u'7782277'), (u'TYPE', u'UNCODED')] CONTACTS See http://www.ltg.ed.ac.uk/software/xml/ to get LT XML, and for documentation pointers for it. This software was downloaded from ftp://www.ltg.ed.ac.uk/pub/LTXML/PyLTXML-1.3.... Send comments or questions to HThompson@ed.ac.uk