.\" obligatory man page for utf library
.\" $Header: /home/agc/src/ure-2.6/RCS/utf.3,v 1.2 1997/01/15 13:19:13 agc Exp agc $
.TH UTF 3
.SH NAME
runetochar, chartorune, runelen, fullrune, utflen, utfrune, utfrrune, utfutf \- \fBUnicode Text Format\fR functionality
.SH SYNOPSIS
.nf
\fB#include <utf.h>\fR
.sp
int \fBrunetochar\fR(\fIchar *cp, Rune *rp\fR);
.sp
int \fBchartorune\fR(\fIRune *rp, char *cp\fR);
.sp
int \fBrunelen\fR(\fIlong r\fR);
.sp
int \fBfullrune\fR(\fIchar *cp, int n\fR);
.sp
int \fButflen\fR(\fIchar *s\fR);
.sp
int \fButfbytes\fR(\fIchar *s\fR);
.sp
char *\fButfrune\fR(\fIchar *cp, long r\fR);
.sp
char *\fButfrrune\fR(\fIchar *cp, long r\fR);
.sp
char *\fButfutf\fR(\fIchar *big, char *little\fR);
.sp
int \fButf_snprintf\fR(\fIchar *buf, size_t size, char *format, ...\fR);
.sp
int \fButfcmp\fR(\fIchar *s1, char *s2\fR);
.sp
int \fButfncmp\fR(\fIchar *s1, char *s2, int rc\fR);
.sp
char *\fButfcpy\fR(\fIchar *dst, char *src\fR);
.sp
char *\fButfncpy\fR(\fIchar *dst, char *src\fR, \fIint nbytes\fR);
.sp
char *\fButfcat\fR(\fIchar *src, char *append\fR);
.sp
char *\fButfncat\fR(\fIchar *src, char *append\fR, \fIint nbytes\fR);
.sp
.fi
.SH DESCRIPTION
.PP
The \fBUTF\fR routines are used to pack the Unicode text encoding into
a standard character stream.
To do that effectively, ASCII characters form the lowest 127 characters
of UTF-8. These characters are interchangeable between the two character
sets.
A \fIRune\fR is a Unicode character, defined in the header file \fIutf.h\fR.
.PP
\fBrunetochar\fR translates a single Rune to a UTF sequence
and returns the number of bytes produced. \fBchartorune\fR
is the inverse of this function, returning the number of
bytes consumed.
\fBrunelen\fR returns the number of bytes in the encoding
of a \fIRune\fR.
\fBfullrune\fR checks that the first \fIn\fR bytes of the
\fBUTF\fR string \fIcp\fR contain a complete \fBUTF\fR encoding.
.PP
\fButflen\fR returns the number of runes in a \fBUTF\fR string.
\fButbytes\fR returns the number of bytes in a \fBUTF\fR string.
\fButfrune\fR returns a pointer to the first occurrence of
a rune in a \fBUTF\fR string.
\fButfrrune\fR returns a pointer to the last.
\fButfutf\fR searches for the first occurrence of a \fBUTF\fR string
in another \fBUTF\fR string.
.PP
\fButf_snprintf\fR is a prticularly dumb implementation of snprintf
for utf strings - it only interprets %%, %s and %d sequences in the
format string, and does no field width calculation on those.
.PP
\fButfcmp\fR compares two strings lexicographically, Rune by Rune,
and returns a value greater than 0, equal to zero, or less than zero
depending on whether the first \fBUTF\fR string is greater than, the
same as, or less than the second string.
\fButfncmp\fR does the same comparison as \fButfcmp\fR, with a maximum
upper bound of \fBrc\fR Runes.
.PP
\fButfcpy\fR copies from source to destination, Rune by Rune,
and returns its destination string. No bounds checking is done
on the number of Runes copied, or their individual sizes.
The \fIdst\fR argument is returned.
\fButfncpy\fR copies at most \fInbytes\fR bytes from source to destination,
terminating when a null Rune is found in the source. If the number of
bytes copied is less than \fInbytes\fR, then the destination string is
paddedf with null (0) bytes. If it is equal to or greater than \fInbytes\fR,
no zero bytes is added.
The \fIdst\fR argument is returned.
\fButfcat\fR appends the UTF string \fIappend\fR onto the UTF string \fIsrc\fR.
\fButfncat\fR appends the UTF string \fIappend\fR onto the UTF string \fIsrc\fR,
bearing in mind that the buffer \fIsrc\fR is only \fInbytes\fR long.
.SH IMPLEMENTATION
This implementation of \fBUTF\fR, nominally \fBUTF-8\fR, can encode a null Unicode
character using a one-byte or a two-byte encoding.
Typically, Plan 9 uses a one-byte encoding, whilst Java uses a two-byte
encoding.
Plan 9 type encoding makes backwards compatibility much easier, and loses
nothing - all the Java functionality is there, there are no embedded
null bytes in a UTF string, due to the encoding of second and third characters,
and ordinary C strings are recognised as well, which is not the case in Java.
By default, a one byte Null-byte encoding is used.
.PP
\fBUTF-8\fR is defined
in X/Open Company Ltd., "File System Safe UCS Transformation Format (FSS_UTF)",
X/Open Preliminary Specification, Document Number: P316, which also appears
in ISO/IEC 10646, Annex P.
.SH "BUGS"
Undoubtably, these are many, and legion.
.SH AUTHOR
.PP
Written by Alistair Crooks (agc@amdahl.com, or agc@westley.demon.co.uk),
from a draft document written by Rob Pike and Ken Thompson, detailing
the implementation of \fBUTF\fR in the Plan 9 operating system.