.\" obligatory man page for utf library .\" $Header: /home/agc/src/ure-2.6/RCS/utf.3,v 1.2 1997/01/15 13:19:13 agc Exp agc $ .TH UTF 3 .SH NAME runetochar, chartorune, runelen, fullrune, utflen, utfrune, utfrrune, utfutf \- \fBUnicode Text Format\fR functionality .SH SYNOPSIS .nf \fB#include \fR .sp int \fBrunetochar\fR(\fIchar *cp, Rune *rp\fR); .sp int \fBchartorune\fR(\fIRune *rp, char *cp\fR); .sp int \fBrunelen\fR(\fIlong r\fR); .sp int \fBfullrune\fR(\fIchar *cp, int n\fR); .sp int \fButflen\fR(\fIchar *s\fR); .sp int \fButfbytes\fR(\fIchar *s\fR); .sp char *\fButfrune\fR(\fIchar *cp, long r\fR); .sp char *\fButfrrune\fR(\fIchar *cp, long r\fR); .sp char *\fButfutf\fR(\fIchar *big, char *little\fR); .sp int \fButf_snprintf\fR(\fIchar *buf, size_t size, char *format, ...\fR); .sp int \fButfcmp\fR(\fIchar *s1, char *s2\fR); .sp int \fButfncmp\fR(\fIchar *s1, char *s2, int rc\fR); .sp char *\fButfcpy\fR(\fIchar *dst, char *src\fR); .sp char *\fButfncpy\fR(\fIchar *dst, char *src\fR, \fIint nbytes\fR); .sp char *\fButfcat\fR(\fIchar *src, char *append\fR); .sp char *\fButfncat\fR(\fIchar *src, char *append\fR, \fIint nbytes\fR); .sp .fi .SH DESCRIPTION .PP The \fBUTF\fR routines are used to pack the Unicode text encoding into a standard character stream. To do that effectively, ASCII characters form the lowest 127 characters of UTF-8. These characters are interchangeable between the two character sets. A \fIRune\fR is a Unicode character, defined in the header file \fIutf.h\fR. .PP \fBrunetochar\fR translates a single Rune to a UTF sequence and returns the number of bytes produced. \fBchartorune\fR is the inverse of this function, returning the number of bytes consumed. \fBrunelen\fR returns the number of bytes in the encoding of a \fIRune\fR. \fBfullrune\fR checks that the first \fIn\fR bytes of the \fBUTF\fR string \fIcp\fR contain a complete \fBUTF\fR encoding. .PP \fButflen\fR returns the number of runes in a \fBUTF\fR string. \fButbytes\fR returns the number of bytes in a \fBUTF\fR string. \fButfrune\fR returns a pointer to the first occurrence of a rune in a \fBUTF\fR string. \fButfrrune\fR returns a pointer to the last. \fButfutf\fR searches for the first occurrence of a \fBUTF\fR string in another \fBUTF\fR string. .PP \fButf_snprintf\fR is a prticularly dumb implementation of snprintf for utf strings - it only interprets %%, %s and %d sequences in the format string, and does no field width calculation on those. .PP \fButfcmp\fR compares two strings lexicographically, Rune by Rune, and returns a value greater than 0, equal to zero, or less than zero depending on whether the first \fBUTF\fR string is greater than, the same as, or less than the second string. \fButfncmp\fR does the same comparison as \fButfcmp\fR, with a maximum upper bound of \fBrc\fR Runes. .PP \fButfcpy\fR copies from source to destination, Rune by Rune, and returns its destination string. No bounds checking is done on the number of Runes copied, or their individual sizes. The \fIdst\fR argument is returned. \fButfncpy\fR copies at most \fInbytes\fR bytes from source to destination, terminating when a null Rune is found in the source. If the number of bytes copied is less than \fInbytes\fR, then the destination string is paddedf with null (0) bytes. If it is equal to or greater than \fInbytes\fR, no zero bytes is added. The \fIdst\fR argument is returned. \fButfcat\fR appends the UTF string \fIappend\fR onto the UTF string \fIsrc\fR. \fButfncat\fR appends the UTF string \fIappend\fR onto the UTF string \fIsrc\fR, bearing in mind that the buffer \fIsrc\fR is only \fInbytes\fR long. .SH IMPLEMENTATION This implementation of \fBUTF\fR, nominally \fBUTF-8\fR, can encode a null Unicode character using a one-byte or a two-byte encoding. Typically, Plan 9 uses a one-byte encoding, whilst Java uses a two-byte encoding. Plan 9 type encoding makes backwards compatibility much easier, and loses nothing - all the Java functionality is there, there are no embedded null bytes in a UTF string, due to the encoding of second and third characters, and ordinary C strings are recognised as well, which is not the case in Java. By default, a one byte Null-byte encoding is used. .PP \fBUTF-8\fR is defined in X/Open Company Ltd., "File System Safe UCS Transformation Format (FSS_UTF)", X/Open Preliminary Specification, Document Number: P316, which also appears in ISO/IEC 10646, Annex P. .SH "BUGS" Undoubtably, these are many, and legion. .SH AUTHOR .PP Written by Alistair Crooks (agc@amdahl.com, or agc@westley.demon.co.uk), from a draft document written by Rob Pike and Ken Thompson, detailing the implementation of \fBUTF\fR in the Plan 9 operating system.