SYNOPSIS

       unidesc ([option flags]) (<file name>)

       If  no  input  file  name  is supplied, unidesc reads from the standard
       input.


DESCRIPTION

       unidesc describes the content of a Unicode text file by  reporting  the
       character  ranges  to which different portions of the text belong.  The
       ranges reported include both  official  Unicode  ranges  and  the  con-
       structed  language  ranges within the Private Use Areas registered with
       the   Conscript   Unicode    Registry    (http://www.evertype.com/stan-
       dards/csur/).  For each range of characters, unidesc prints the charac-
       ter or byte offset of the beginning of the range, the character or byte
       offset  of  the  end  of  the range, and the name of the range. Offsets
       start from 0.

       Since the ASCII digits, punctuation, and whitespace characters are fre-
       quently  used by other writing systems, by default these characters are
       treated as neutral, that is, as not belonging exclusively to  any  par-
       ticular  character range.  These characters are treated as belonging to
       the range of whatever characters precede them.

       If the input begins  with  neutral  characters,  they  are  treated  as
       belonging  to the range of whatever characters follow them. If the file
       consists entirely of neutral characters, the  range  is  identified  as
       Neutral followed by Basic Latin in square brackets.

       A magic number identifying the Unicode encoding is not part of the Uni-
       code standard, so pure Unicode files do not  contain  a  magic  number.
       However,  informal  conventions  have  arisen for this purpose.  If the
       command line flag -m is given, unidesc will  attempt  to  identify  the
       Unicode  subtype  by examining the first few bytes of the input. If the
       input is identified as one of the two acceptable types, UTF-8 or native
       order  UTF-32,  it  will  then  proceed to describe the contents of the
       input. Otherwise, it will report what it has  learned  and  exit.  Note
       that if the file does contain a magic number, you must use the -m flag.
       Without this flag unidesc assumes that the input consists of pure  Uni-
       code  with the character data beginning immediately.  It will therefore
       be thrown off by the magic number.

       By default, input is expected to be UTF-8. Native order UTF-32 is  also
       acceptable.   UTF-32  may be specified via the command line flag -u or,
       if the command line flag -m is given, via the magic number.



COMMAND LINE FLAGS

       -b     Give file offsets in bytes rather than characters.

       -d     Treat the ASCII digits as belonging  exclusively  to  the  Basic
              Latin range.

       -u     Input is native order UTF-32.

       -v     Print version information.

       -w     Treat ASCII whitespace as belonging  exclusively  to  the  Basic
              Latin range.



SEE ALSO

       uniname


REFERENCES

       Unicode Standard, version 5.0


AUTHOR

       Bill Poser
       billposer@alum.mit.edu


LICENSE

       GNU General Public License



                                  June, 2007                        unidesc(1)

Man(1) output converted with man2html