/*
Copyright (C) 1999-2002 Ricardo Ueda Karpischek
This is free software; you can redistribute it and/or modify
it under the terms of the version 2 of the GNU General Public
License as published by the Free Software Foundation.
This software is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this software; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
USA.
*/
/*
book.c: Documentation only
*/
/*
This module does not contain code, but only documentation blocks
that are'nt (currently) attached to specific pieces of code on
the other modules.
*/
/* (tutorial)
NAME
----
clara - a cooperative OCR
SYNOPSIS
--------
clara [options]
DESCRIPTION
-----------
Welcome. Clara OCR is a free OCR, written for systems supporting
the C library and the X Windows System. Clara OCR is intended for the
cooperative OCR of books. There are some screenshots available at
CLARA_HOME.
This documentation is extracted automatically from the comments
of the Clara OCR source code. It is known as "The Clara OCR
Tutorial". There is also an advanced manual known as "The Clara
OCR Advanced User's Manual" (man page clara-adv(1), also
available in HTML format). Developers must read "The Clara OCR
Developer's Guide" (man page clara-dev(1), also available in HTML
format).
CONTENTS
--------
Making OCR
Starting Clara
Some few command-line switches
Training symbols
Saving the session
OCR steps
Classification
Note about how Clara OCR classification works
Building the output
Handling broken symbols
Handling accents
Browsing the book font
Useful hints
Fun codes
AVAILABILITY
CREDITS
*/
/* (book)
NAME
----
clara - a cooperative OCR
SYNOPSIS
--------
clara [options]
DESCRIPTION
-----------
Welcome. Clara OCR is a free OCR, written for systems supporting
the C library and the X Windows System. Clara OCR is intended for
the cooperative OCR of books. There are some screenshots
available at CLARA_HOME.
This documentation is extracted automatically from the comments
of the Clara OCR source code. It is known as "The Clara OCR
Advanced User's Manual". It's currently unfinished. First-time
users are invited to read "The Clara OCR Tutorial". Developers
must read "The Clara OCR Developer's Guide".
CONTENTS
--------
Welcome to Clara OCR
Early historical notes
Design notes
Supported Alphabets
Clara vs the others
The requirements
How to download and compile Clara
Compilation and startup pitfalls
A first OCR project
Scanning and thresholding
Manual and histogram-based (global)
Classification-based (local)
Classification-based (global)
Avoiding or correcting skew
The work directory
Building the book font
Skeleton tuning
Classification tentatives
Alignment tuning
Complex procedures
Using two directories
Adding a page
Multiple books
Adding a book
Removing a page
Dealing with classification errors
Rebuilding session files
Importing revision data
How to use the web interface
Revision acts maintenance
Analysing the statistics
Upgrading Clara OCR
Reference of the Clara GUI
The application window
Tabs and windows
The Application Buttons
The Alphabet Map
Reference of the menus
File menu
Edit menu
View menu
Alphabets menu
Options menu
PAGE options menu
PAGE_FATBITS options menu
OCR steps menu
Reference of command-line switches
AVAILABILITY
CREDITS
*/
/* (devel)
NAME
----
clara - a cooperative OCR
SYNOPSIS
--------
clara [options]
DESCRIPTION
-----------
Welcome. Clara OCR is a free OCR, written for systems supporting
the C library and the X Windows System. Clara OCR is intended for the
cooperative OCR of books. There are some screenshots available at
CLARA_HOME.
This documentation is extracted automatically from the comments
of the Clara OCR source code. It is known as "The Clara OCR
Developer's Guide". It's currently unfinished. First-time users
are invited to read "The Clara OCR Tutorial". There is also an
advanced manual known as "The Clara OCR Advanced User's Manual".
CONTENTS
--------
Introducing the source code
Language and environment
Modularization
The memory allocator
Security notes
Runtime index checking
Background operation
Global variables
Path variables
Bitmaps
Execution model
Return codes
Internal representation of pages
Closures
Symbols
The sdesc structure and the mc array
The preferred symbols
Font size
Symbol alignment
Words and lines
Acts and transliterations
Symbol transliterations
Transliteration preference
Transliteration class computing
The zones
Heuristics
Skeleton pixels
Symbol pairing
The build step
Resetting
Synchronization
The function list_cl
The GUI
Main characteristics
Geometry of the application window
Geometry of windows
Scrollbars
Displaying bitmaps
HTML windows overview
Graphic elements
XML support
Auto-submission of forms
The Clara API
Redraw flags
OCR statuses
The function setview
The function redraw
The function show_hint
The function start_ocr
How to change the source code (examples)
How to add a bitmap comparison method
How to write a bitmap comparison function
How to add an application button
Bugs and TODO list
AVAILABILITY
CREDITS
*/
/* (book)
Early historical notes
----------------------
For some years now we have tested and used OCR softwares, mainly
for old books. Popular OCR softwares (those bundled with
scanners) are useful tools. However, OCR is not a simple
task. The results obtained using those programs vary largely
depending on the the printed document, and, for most texts we're
interested on, the results are really poor or even unusable. In
fact, it's not a surprise that many digitalization projects
prefer not to use OCR, but typists only.
For a programmer, it is somewhat intuitive that OCR could achieve
good results even from low quality texts, when an add-hoc
approach is used, focusing one specific book (for
instance). Within this approach, OCR becomes a matter of finding
one software adequate for the texts you're trying to OCR, or
perhaps develop a new one. So a free and easy to customize OCR
(on the source code level) would be a valuable resource for text
digitalization projects.
Dealing with graphics is not among our main occupations, but
after analysing many scanned materials, we began to write some
simple and specialized recognition tools. More recently (in the
third quarter of 1999) a simple X interface linked to a naive
bitmap comparison heuristic was written. From that prototype,
Clara OCR evolved. Since then, many new ideas from various
persons helped to make it better.
Design notes
------------
It's not a bad idea to enumerate some principles that have driven
Clara OCR development. They'll make easier to understand the
features and limitations of the software (these principles may
change along time).
1. Clara is an OCR for printed texts, not for handwritten
texts.
2. Clara was not designed to be used to OCR one or two single
pages, but to OCR a large number of documents with the same
graphic characteristics (font, size, etc). So it can take
advantage of a fine (and perhaps expensive) training. This will
be tipically the case when OCRing an entire book.
3. We chose not support directly multiple graphic formats, but
only Jeff Poskanzer's raw PBM and PGM. Non-PBM/PGM files will be
read through filters.
4. Clara OCR wants to be a tool that makes viable the sum and
reuse of human revision effort. Because of this, on the OCR model
implemented by Clara, training and revision are one same
thing. The revision is a sum of punctual and independent acts and
alternates with reprocessing steps along a refinement process.
5. The Clara GUI was implemented and behaves like a minimalistic
HTML viewer. This is just an easy and standard way to implement a
forms interface.
6. We have tried to make the source code portable across
platforms that support the C library and the Xlib. Clara has no
special provision to be ported to environments that do not
support the Xlib. We avoided to use a higher level graphic
environment like Motif, GTK or Qt, but we do not discourage
initiatives to add code to Clara OCR adapt or adapt better to
these or other graphic environments.
7. We generally try to make the code efficient in terms of RAM
usage. CPU and disk usage (for session files) are less prioritary.
Clara vs the others
-------------------
Clara differs from other OCR softwares in various aspects:
1. Most known OCRs are non-free and Clara is free. Clara focus
the X Windows System. Clara offers batch processing, a web
interface and supports cooperative revision effort.
2. Most OCR softwares focus omnifont technology disregarding
training. Clara does not implement omnifont techniques and
concentrate on building specialized fonts (some day in the
future, however, maybe we'll try classification techniques that
do not require training).
3. Most OCR softwares make the revision of the recognized text a
process totally separated from the recognition. Clara
pragmatically joins the two processes, and makes training and
revision one same thing. In fact, the OCR model implemented by
Clara is an interactive effort where the usage of the heuristics
alternates with revision and visual fine-tuning of the OCR,
guided by the user experience and feeling.
4. Clara allows to enter the transliteration of each pattern
using an interface that displays a graphic cursor directly over
the image of the scanned page, and builds and maintains a mapping
between graphic symbols and their transliterations on the OCR
output. This is a potentially useful mechanism for documentation
systems, and a valuable tool for typists and reviewers. In fact,
Clara OCR may be seen as a productivity tool for typists, instead
of a typical OCR.
5. Most OCR softwares are integrated to scanning tools offerring
to the user an unified interface to execute all steps from
scanning to recognition. Clara does not offer one such integrated
interface, so you need a separate software (e.g. SANE) to
perform scanning.
6. Most OCR softwares expect the input to be a graphic file
encoded in tiff or other formats. Clara supports only raw
PBM/PGM.
*/
/* (book)
Scanning and thresholding
-------------------------
Clara OCR cannot scan paper documents by itself. Scanning must be
performed by another program. The Clara OCR development effort is
using SANE (http://www.mostang.com/sane) to produce 600 or 300
dpi images. The Clara OCR heuristics are tuned to 600 dpi.
Scanners offer three scanning modes: black-and-white (also known
as "bitmap" or "lineart", however the meaning of these words may
vary depending on the context), "grayscale" and "color". Clara
OCR requires black-and-white or grayscale input. Both
black-and-white and grayscale images may be saved in a variety of
formats by scanning programs. However, only PBM (for
black-and-white) and PGM (for grayscale) formats are
recognized. Generally grayscale 600 or 300 dpi will be the best
choice, but black-and-white 600 dpi may be good for new, high
quality printed materials. If your scanning program do not
support the PBM or PGM formats, try to save the images in TIFF
format and convert to PBM or PGM using the command tifftopnm. If
for some reason the TIFF format cannot be used, choose any other
format that preserves all data (don't use "compressing" formats
like JPEG), and for which a conversion tool is available, to
convert it to PBM or PGM.
Obs. Programs that scan or handle (e.g. rotate) images may
sometimes perform unexpected tasks, as applying dithering or
reducing algorithms by themselves. An image transformed to become
nice or small may be useless for OCR purposes.
Obs. The PBM and PGM formats do not carry the original resolution
(dots-per-inch) at which the image was scanned. As some
heuristics require that information, Clara OCR expects to be
informed about it through the command-line switch -y (so take
note of the resolution used).
Grayscale means that each pixel assumes one gray "level",
typically from 0 (black) to 255 (white). This is a good choice
for scanning old or low-quality printed materials, because it's
possible to use specialized programs to analyse the image and
choose a "threshold", in such a way that all pixels above that
threshold will be considered "white", and all others will be
considered black (when scanning in black-and-white mode, the
threshold is chosen by the scanning program or by the user). The
threshold may be global (fixed for the entire page) or local
(vary along the page).
In most cases grayscale will achieve better results. However, as
grayscale images are much larger than black-and-white images, 300
dpi (instead of 600 dpi) may be mandatory when using grayscale
due to disk consumption requirements.
Obs. Try to limit yourself to the optical resolution oferred by
the scanner. Most old scanners are 300 dpi, but the scanning
software obtains higher resolutions through interpolation. Newer
scanners may be optical 600 dpi or 1200 dpi or more.
Obs. the page 143 of Manuel Bernardes Branco Dictionary that
we're using along these tests was scanned using the SANE
scanimage command:
scanimage -d microtek2:/dev/sga --mode gray -x 150 -y 210
--resolution 300 > 143.pgm
Thresholding is not the only method for converting grayscale
images to black-and-white (such conversion is also called
"binarization"), but it's the current method used by Clara OCR.
In practice, a too low threshold will brake many symbols on their
thin parts, and a too high threshold will link symbols together
(in the figure, an "a-i" link and a broken "u").
XX
XX
XXXXX XXX XXX XXX
X XX XX XX XX
XX XX XX XX
XXXXXXX XX XX XX
X XX XX XX XX
X XX XX XX XX
XXXXX XXXXXXX XX XXXX
It's a hard task to detect broken and linked symbols. The Clara OCR
heuristics that handle these cases are incipient, so thresholding must
must be carefully performed, in order to not compromise the OCR
results. If the printing intensity, the noise level or the paper
quality vary from page to page, thresholding must be performed on a
per-page basis.
Obs. Now you can try avoid links in segmentation step.
Just set "Try avoid links" parameter in Tune tab. (Normal values <=1)
The four thresholding methods currently avaliable are: manual
(global), histogram-based (global), classification-based (local),
classification-based (global).
Manual and histogram-based (global)
-----------------------------------
Histogram-based thresholding is the default method. It computes
automatically a thresholding value based on the distribution of
grayshades. To use it, just enter the TUNE tab and select (it's
selected by default) the "use histogram-based global
thresholder". To make a try, load a PGM image and press OCR or
ask the Segmentation OCR step.
Obs. You can correct the automatic-detected threshold with
"Threshold factor" in Tune tab.
A global thresholding value can be manually specified. This
corresponds to the "use manual global thresholder" entry. The
choice of the thresholding value is performed through a visul
interface called "instant thresholding". To use it, load one PGM
image and select the "Instant thresholding" entry (Edit
menu). Then use '<', '>', '+' and '-' to change the thresholding
value. When ok, press ESC. Note that the selected value will be
applied only when the segmentation step runs.
Classification-based (local)
----------------------------
Global thresholding does not address those cases where the
printing intensity (or paper properties) vary along one same
page. Local thresholding methods are required on such
cases. Clara OCR implements a classification-based local
(per-symbol) thresholder. Saying that it's classification-based
means that the OCR engine is used to choose the threshold. In
other words, the threshold chosen is that for which the
classifier successfully recognized the symbol (in fact, this is a
brute-force approach).
The local binarizer can be manually applied at any symbol. To do
so, load one PGM page and click any symbol directly on the PAGE
tab. Two thresholding values will be chosen. The pixels found to
be "black" for each one are painted "black" (smaller value) and
"gray" (larger value). At this moment, it's possible to add the
thresholded symbol as a pattern (just press the key corresponding
to its transliteration). Remember that this thresholder relies on
the classifier, so if the OCR is not trained, you'll get no
benefit.
Two versions of the local binarizer were developed, a "weak" one
and a "strong" one. The "weak" one just tries to change the
threshold on those symbols not successfully classified using the
default threshold. The "strong" one (unfinished) also tries to
criticize locally the segmentation results. By default, the weak
version is used. To try the strong one, check the corresponding
checkbox at the TUNE tab.
Obs. As an alternative, use the "Balance" feature + global thresholding.
Classification-based (global)
-----------------------------
Clara OCR includes a simple threshold selection script to compute
global best thresholds based on classification results. Let's try
it on our 2-page book. Just create a directory, cd to it and run
the selthresh.pl script informing the resolution and the names of
the images:
$ cd /home/clara/books/BC
$ mkdir pbm
$ cd pbm
$ selthresh.pl -y 300 -l 0.45 0.55 ../pgm/*pgm
selthresh.pl: scaling 2 times
Best thresholds:
143-l.pgm 0.49
143-r.pgm 0.51
In this case, selthresh.pl will require around 4 minutes to
complete on a 500MHz CPU. For larger collections of pages,
selthresh.pl may take much longer to complete (hours or days). If
needed, the execution can be safely interrupted using Control-C
(it's ok to shutdown the machine while selthresh.pl is
running). The execution can be safely restarted from the point
where it was interrupted typing again the same command:
$ cd /home/clara/books/MBB/pbm
$ selthresh.pl -y 300 -l 0.40 0.55 ../pgm/*pgm
The option -l is used to inform an interval of thresholds to
try. By now, selthresh.pl is unable to choose by itself a "good"
interval. The user must manually check the results for some
thresholds in order to make a choice. For instance, to examine
the results for threshold 0.4 on page 143-l.pgm, try:
$ pgmtopbm -threshold -value 0.4 ../pgm/143-l.pgm >143-l.pbm
$ display 143-l.pbm
Change the threshold, repeat and, once found a threshold value
that produces a "nice" visual result, specify to -l the interval
centered at that threshold, and total width 0.1 or 0.2. The same
interval may be used for all pages because selthresh.pl will warn
about a bad interval choice. Example:
$ selthresh.pl -y 300 -l 0.30 0.35 ../pgm/143-l.pgm
selthresh.pl: scaling 2 times
Best thresholds:
143-l.pgm 0.32 (bad interval, try -l 0.30 0.4)
If a "bad interval" warning appears on the final output for some
pages, it's ok to restart selthresh.pl informing a new, wider
interval, as suggested by selthresh.pl. Only the suspicious pages
will be re-examined. In fact, selecting a narrow initial interval
(and making it larger as required) may be a good strategy to
reduce the total running time.
Once the best thresholds are known, use pgmtopbm to produce the
black-and-white images. It's also a good idea to approach the
resolution to 600 dpi using pnmenlarge. Yet pnmenlarge does not
add information to the image, the classification heuristics will
behave better. In our case, the command should be
$ cd /home/clara/books/BC/pbm
$ pnmenlarge 2 ../pgm/143-l.pgm | \
pgmtopbm -threshold -value 0.49 >143-l.pbm
$ pnmenlarge 2 ../pgm/143-r.pgm | \
pgmtopbm -threshold -value 0.51 >143-r.pbm
Obs. it's not a bad idea to visualize the PBM files, or at least
some of them. Yet selthresh.pl produced good results for us, your
mileage may vary.
In order to capture the output of selthresh.pl (to extract the
per-page best thresholds), it's ok to re-generate it as many
times as needed (just repeat the same selthresh.pl command,
because once all computations become performed, the script will
just read the results from selthresh.out and output the results).
A final warning: selthresh.pl may be fooled by too dark images. So
if the right limit is much larger than it should be, selthresh.pl
may produce bad results. So be careful concerning the right limit
of the interval. As a practical advice, keep in mind that the
best threshold for most images is less then 0.6. In the near
future we'll use statistical measurements to choose the interval
to analyse, in order to prevent such problems and to make
unnecessary a manual choice.
obs. the tarball also includes an alternative selthresh.pl, named
slethresh_fidian.pl. It contains instructions on how to use it.
Avoiding or correcting skew
---------------------------
Sometimes the printing is skewed relatively to the paper
margins. Skew is a problem to the OCR heuristics. As the Clara
OCR engine just detects components by pixel contiguity and builds
classes of symbols, in practice the effect of skew will be a
larger number of patterns, and therefore a larger revision cost.
In some cases, a careful manual scanning can solve the
problem. When acceptable, a set-square solves the problem: just
align one text line at one set-square rule and the edge of the
scanner glass at the other rule (we're supposing that the
bookbinding was disassembled).
The bundled preprocessor now includes a method to compute and
correct skew, but it's not on by default. To activate it, enter
the TUNE tab and select the "Use deskewer" checkbox. Now
deskewing will be applied when the OCR button is pressed (or when
the "Preprocessing" OCR step is requested). Note that
preprocessing is called only once per page, so if the page was
already preprocessed, it won't be deskewed.
Skeleton tuning
---------------
Currently, symbol classification can be performed by three
different classifiers: skeleton fitting, border mapping or pixel
distance. The choice is done on the TUNE tab. Border mapping is
currently experimental. Pixel distance has been used as an
auxiliar classifier. Skeleton fitting is a more mature code and
is highly customizable. It's the default classification method by
now.
When using skeleton fitting, two symbols are considered similar
when each one contains the skeleton of the other. So the
classification result depends strongly on how skeletons are
computed. As an example, the figure presents one symbol
("e"). The symbol black pixels are the dots ('.'). The skeleton
black pixels are stars ('*').
.......
..******..
.*. ..*..
..*. ...*.
.*.. ...*..
..*.........*..
..***********..
..*. ....
..*.
..*..
..*... ...
..*..........
..********..
.........
Clara OCR offers seven different methods for computing
skeletons. Each method has tunable parameters. The choice of the
method and the parameters can be done through a visual inteface
on the TUNE (SKEL) tab. To try it, first save the session (menu
"File"), then enter that tab. At least one pattern must
exist. Vary the parameters and observe the results. Press the
left and right arrows to navigate through the patterns, and use
the "zoom" button to choose a comfortable image size. The last
selection will be used for all skeleton computations. To discard
it, exit Clara OCR without saving the session.
Instead of trying the TUNE (SKEL) tab, it's possible to specify
skeleton computation parameters through the -k command-line
switch. Note however that if a selection was performed through
the TUNE (SKEL) tab, that selection will override the parameters
informed to -k, so be careful.
Clara OCR has an auto-tune feature to choose the "best" skeleton
computation parameters. To use it, check the "Auto-tune skeleton
parameters" entry on the TUNE tab. This feature is currently left
off by default because manual tuning can achieve better
results. Examples:
1. Quality printing without thin details
use -k 2,1.4,1.57,10,3.8,10,4,4
or -k 0,1.4,1.57,10,3.8,10,4,4
2. Quality printing with thin details
use -k 2,1.4,1.57,10,3.8,10,1,1
or -k 4,,,,,,3,
3. Poor printing without thin details
use -k 2,1.4,1.57,10,3.8,10,1,1
4. Poor printing with thin details
use -k 2,1.4,1.57,10,3.8,10,1,1
Yet the pattern computation parameters may change along the way,
it's wise to choose adequate skeleton computation parameters
before OCRing, and keep them fixed along the project. Every time
Clara OCR is started, inform the same parameters chosen. In our
case, we can use the default parameters. To do so, just enter
Clara OCR as before:
$ cd /home/clara/books/BC/pbm
$ clara &
Classification tentatives
-------------------------
To classify the book symbols (i.e. to discover the
transliteration of unknown symbols using the patterns), enter
Clara OCR, select "Work on all pages" ("Options" menu) and press
the OCR button using the mouse button 1, or press the mouse
button 3 and select "Classification". The classification may be
performed many times. Each time, different parameters may be
tried to refine the results already achieved.
When the classification finishes, observe the pages 5.pbm and
6.pbm. Much probably, some symbols will be greyed. In other
words, the classifier was unable to classify all symbols. The
statistics presented on the PAGE (LIST) tab may be useful now. To
reduce the number of unknown symbols there are three choices: add
more patterns, change the skeleton computation parameters, or try
another classifier.
To add more patterns, just train some greyed symbols and
reclassify all pages again. The reclassification will be faster
than the first classification because most symbols, already
classified, won't be touched.
To change the skeleton computation parameters, exit Clara OCR,
restart it informing the new parameters through -k, select
"Re-scan all patterns" ("Edit" menu), select "Work on all pages"
("Options" menu) and reclassify. May be easier to choose and set
the new parameters using the TUNE (SKEL) tab, as explained
earlier. However, remember that the parameters chosen through the
TUNE (SKEL) tab override the parameters informed through -k.
To try another classifier, first select the "Re-scan all
patterns" entry on the "Edit" menu. Then enter the TUNE tab and
select the classifier to use from the available choices
(skeleton-base, border mapping and pixel distance). The pixel
distance may be a good choice. Then reclassify all pages.
The "Re-scan all patterns" is required because for each symbol
Clara OCR remembers the patterns already tried to classify it,
and do not try those patterns again. However, when the skeleton
computation parameters change, or when the classifier changes,
those same patterns must be tried again. Maybe in the future
Clara OCR will decide by itself about re-scanning all patterns.
Symbol properties
-----------------
The bottom five buttons (alphabet, pattern type, "bold", "italic"
and "bad") carry the properties of the current symbol. If the
"PAGE" window is on the plate, the current symbol is the one
identified by the graphic cursor. If the window "PATTERN" is on
the plate, the current symbol is the pattern being exhibited. In
all other cases, the current symbol is undefined.
Let's comment in detail the symbol properties carried by those
five buttons:
a. The possible values for the alphabet are: latin, greek,
cyrillic, hebrew, arabic, kana, number, ideogram or "other". In
order to limit the available alphabets, the button circulates
only the values selected on the "Alphabet" menu.
b. The "pattern types" are the fonts and font sizes used by the
book. Example: 12pt roman and 12pt arial for the text, and 8pt
roman for the footnotes. In this case we have three "types"
identified as "1", "2" and "3".
c. Each one of the bold, italic and "bad" flags may be on or
off. The "bad" flag identifies a symbol not to be used as
pattern.
The user can inform Clara OCR about any of these properties for
the current symbol, just selecting the desired value on the
corresponding button (click it one or more times). The pattern
type, however, is read-only by default. To allow changing its
value, use the "pattern types are read-only" entry on the
"Options" menu.
In most cases, Clara OCR will compute automatically the
properties of each symbol, so it's not required to set them
manually. But just like the transliterations, Clara OCR will need
some initial information, so the user must identify some symbols
as being bold or italicized.
Merge tuning
------------
merge internal fragments
merge pieces on one same box
merge close fragments
recognition merging
learned merging
Complex procedures
------------------
To OCR an entire book is a long process. Perhaps along it a
problem is detected. Bad choice of skeleton computation
parameters, or a bad page contaminating the bookfont, some files
loss due to a crash, etc. How to solve them?
Clara OCR does not offer currently a complete set of tools to
solve all these problems. In some cases, a simple solution is
available. In others, a solution is expected to become available
in future versions. This session will depict some practical
cases, and explain what can be done and what cannot be done for
each one.
Fixing transliterations
-----------------------
Fixing pattern transliterations
Fixing symbol transliterations
Removing patterns and synchronizing pages
-----------------------------------------
Removing references to that pattern
on the loaded page
on other pages
on the patter types
Removing a page
---------------
From the stats presented by the PAGE (LIST) tab it's possible to
detect problems on specific pages. A low factorization may be a
simptom of a bad choice of brightness for that page. In such a
case, it's probably a good idea to remove completely that page.
To remove a page is a delicate operation. Clara OCR currently
does not offer a "remove page" feature. Basically, it should
remove all patterns from that page, remove the revision data
acquired from that page, and remove the page image and its
session file.
Dealing with classification errors
----------------------------------
What to do when the OCR classifies incorrectly a large quantity of
symbols? (to be written)
Importing revision data
-----------------------
When OCRing a large book, a good approach is to divide its pages
into a number of smaller sections and OCR each one. So for a book
with, say, 1000 pages, we could OCR pages 1-200, then 201-400,
etc.
After finishing the first section, of course we desire reuse on
the second section the training and revision effort already
spent. This is not the same as adding the pages 201-400 to the
first section, because we do not want handle the pages 1-200
anymore.
Basically we need to import the patterns of the first section
when starting to process the second. Well, Clara OCR is currently
unable to make this operation.
How to use the web interface
----------------------------
The Clara OCR web interface allows remote training of symbols. To use
it, a web server able to run perl CGIs (e.g. Apache) is
required. Let's present the steps to activate the web interface for a
simple case, with only one book (named "book1"). Basically, one needs
to create a subtree anywhere on the server disk (say,
"/home/clara/www/"), owned by the user that will manage the project
(say, "clara"), with subdirectories, "bin", "book1" and
"book1/doubts":
$ id
uid=511(clara) gid=511(clara) groups=511(clara)
$ cd /home/clara/
$ mkdir www
$ cd www
$ mkdir bin book1
$ mkdir book1/doubts
Then copy to the directory "bin" the files clara.pl and sclara.c from
the Clara OCR distribution (say, /usr/local/src/clara), edit clara.c
to change the hardcoded definition of the root directory to
"/home/clara/www", compile it and make it setuid:
$ cd bin
$ cp /usr/local/src/clara/clara.pl .
$ cp /usr/local/src/clara/sclara.c .
$ emacs sclara.c
$ grep '^char *root' sclara.c
char *root = "/home/clara/www";
$ cc -o sclara -static sclara.c
$ rm sclara.c
$ chmod a+s sclara
Edit the script clara.pl. Example for the clara.pl configuration
section (the script clara.pl contains default definitions for some of
these variables, please comment out those definitions):
$CROOT = "/home/clara/www";
$U = "/cgi-bin/clara";
$book[0] = 'Author, Test 1, City, year';
$subdir[0] = "book1";
$LANG = 'en';
$opt = '-W -R 10 -b -k 2,1.4,1.57,10,3.8,10,4,1';
Now copy the PBM files to the directory "book1", create low-quality
jpeg previews, gzip the PBM files, and select some patterns:
$ cd /home/clara/www/book1
$ cp /usr/local/src/clara/imre.pbm .
$ pbmreduce 8 imre.pbm | convert -quality 25 - imre.jpg
$ gzip -9 imre.pbm
$ clara -k 2,1.4,1.57,10,3.8,10,4,1
(load one PBM file, train some symbols, save the session and quit the
program).
Now we need to process the PBM files in order to create some
"doubts". The script clara.pl also requires a symlink to the clara
binary (change the path /usr/local/bin/clara as required):
$ cd /home/clara/www/bin
$ ln -s /usr/local/bin/clara clara
$ ./clara.pl -s book1
$ rm ../book1/*html
$ ./clara.pl -p
Now your server must be instructed to exec /home/clara/www/bin/clara.pl
when a visitor requests "/cgi-bin/clara" (if you prefer another URL,
change the clara.pl customization too). An easy way to accomplish that
is creating a symlink on the default directory for CGIs. The default
directory of CGIs is platform-dependent (e.g. /home/httpd/cgi-bin,
/usr/local/httpd/cgi-bin, /var/lib/apache/cgi-bin, etc). Example:
# cd /home/httpd/cgi-bin
# ln -sf /home/clara/www/bin/clara.pl clara
Try to access the URL "/cgi-bin/clara" on your web server. The correct
behaviour is successfully loading a page entitled "Prototype of the
Cooperative Revision". If you have problems, be aware about some
common problems:
1. Apache expects to be explicitly allowed to follow symlinks. The
file access.conf should contain, in our case, a section similar to the
following:
AllowOverride None
Options ExecCGI FollowSymLinks
2. The directory /home/clara must be world readable:
# ls -ld /home/clara
drwxr-xr-x 4 clara clara 1024 Sep 17 09:56 /home/clara
If you succeeded, congratulations! Note that from time to time it'll
be necessary to reprocess the pages, adding to the session files the
data collected from the web, just like done before:
$ cd /home/clara/www/bin
$ ./clara.pl -p
$ ./clara.pl -s book1
Revision acts maintenance
-------------------------
Types of revision acts (to be written).
Discarding deduced data (to be written).
*/
/* (devel)
Bugs and TODO list
------------------
(Some) Major tasks
1. Vertical segmentation (partially done).
2. Heuristics to merge fragments.
3. Spelling-generated transliterations
4. Geometric detection of lines and words
5. Finish the documentation
6. Simplify the revision acts subsystem
Minor tasks
1. Change sprintf to snprintf.
2. Fix assymetric behaviour of the function "joined".
3. Optimize bitmap copies to copy words, not bits, where possible
(partially done).
4. Support Multiple OCR zones (partially done).
5. Make sure that the access to the data structures is blocked
during OCR (all functions that change the data structures must
check the value of the flag "ocring").
6. Use 64-bit integers for bitmap comparisons and support
big-endian CPUs (partially done).
7. Clear memory buffers before freeing.
8. Allow the transliterations to refer multiple acts (partially
done).
9. Rewrite composition of patterns for classification of linked
symbols.
10. The flea stops but do not disappear when the window lost and
regain focus.
11. Substitute various magic numbers by per-density and
per-minimum-fontsize values.
12. The local binarizer is slower on deskewed pages (why?).
*/
/* (book)
Welcome to Clara OCR
--------------------
Clara is an optical character recognition (OCR) software, a
program that tries to identify the graphic images of the
characters from a scanned document, converting their digital
images to ASC, ISO or other codes.
The name Clara stands for "Cooperative Lightweight chAracter
Recognizer".
Clara offers two revision interfaces: a standalone GUI and and a
web interface, able to be used by various different reviewers
simultaneously. Because of this feature Clara is a "cooperative"
OCR (it's also "cooperative" in the sense of its free/open status
and development model).
*/
/* (book)
The requirements
----------------
Clara OCR will run on a PC (386, 486 or Pentium) with GNU/Linux
and Xwindows. Clara OCR will hopefully compile and run on a PC
with any unix-like operating system and Xwindows. Currently Clara
OCR won't run on big-endian CPUs (e.g. Sparc) nor on systems
lacking X windows support (e.g. MS-Windows). Higher-level
libraries like Motif, GTK or Qt are not required.
A relatively fast CPU is recommended (300MHz or more). Memory
usage depends on the documents, and may range from some few
megabytes to various tenths os megabytes The normal operation
will create session files on your hard disk, so some megabytes of
free disk space are required (a large project may require plents
of gigabytes). Clara OCR can read and write gzipped files (see
the -z command-line switch).
If you need to build the executable and/or the documentation,
then an ANSI C compiler (with some GNU extensions) and a (version
5) perl interpreter are required.
How to download and compile Clara
---------------------------------
For those who need to download and compile the source code
(hopefully this will be unnecessary for most users as soon as
Clara binary distributions become available), it may be
downloaded from CLARA_HOME. It's a
compressed tar archive with a name like clara-x.y.tar.gz (x.y is
the version number).
The compilation will generally require no more than issue the
following commands on the shell prompt:
$ gunzip clara-x.y.tar.gz
$ tar xvf clara-x.y.tar
$ cd clara-x.y
$ make
$ make doc
Now you can copy the executable (the file "clara") to some
directory of binaries (like /usr/local/bin), and the man page
(file "clara.1") to some directory of man pages (like
/usr/local/man/man1). By now there is no "make install" to
perform these copies automatically.
If some of these steps fail, please try to obtain assistance from
your local experts. They will solve most simple problems
concerning wrong paths or compiler options. You can also read the
subsection "Compilation and startup pitfalls".
Compilation and startup pitfalls
--------------------------------
This subsection is intended to help people that are experiencing
fatal errors when building the executable or when starting
it. After each error message we'll point out some hints.
Bear in mind that most hints given below are very elementary
concerning Unix-like systems. If you have problems, try to read
all hints because details explained once are not repeated. If you
cannot understand them, please try to ask your local experts, or
try to read an introductory book on Unix things. Please don't
email questions like these to the Clara developers, except when
the hint suggests it.
1. Path-related pitfalls
$ make
bash: make: command not found
The shell could not find the "make" utility. Maybe there is no
such utility installed on your system, or maybe the path to it is
unknown to the shell. You can try to find the "make" utility with
a command like
$ find /usr -name make -print
The following command will display the current path:
$ echo $PATH
Remember that on Unix-like systems the environment is
per-process. So if you change the PATH variable on the shell
prompt within an xterm, this won't affect the other running
shells (on the other xterms). Remember that the Unix shells
expect to be explicitly informed about which variables must be
exported to subprocesses (use "export" in Bourne-like shells and
"setenv" on C-like shells).
$ make
gcc -I/usr/X11R6/include -g -c gui.c -o gui.o
make: gcc: Command not found
make: *** [gui.o] Error 127
The make utility could not find the gcc compiler. Check if gcc is
installed. If not, check if some other C compiler is installed
(for instance, "cc"), and edit the makefile to chage the value of
the CC variable.
If you don't know what I'm speaking about, take a look on the
directory where the Clara source codes are, and you'll see there
a file named "makefile". This file contains the names of the
tools to be used and rules to build the Clara executable. It
contains also important paths, like those where the system
headers (files .h) and libraries can be found. If the names or
the paths don't reflect those on your system, you need to edit
the makefile accordingly.
$ make
gcc -I/usr/X11R6/include -g -c gui.c -o gui.o
In file included from gui.c:16:
gui.h:12: X11/Xlib.h: No such file or directory
make: *** [gui.o] Error 1
The compiler could not find the header Xlib.h. Maybe your system
does not include such header, or maybe it is on another directory
not explicited on the makefile through the INCLUDE variable.
$ make
gcc -o clara clara.o skel.o gui.o mc.o ...
/usr/bin/ld: cannot open -lX11: No such file or directory
make: *** [clara] Error 1
The linker could not find the X11 library. Maybe your system does
not include such library, or maybe it is on another directory not
explicited on the makefile through the LIBPATH variable.
2. Compilation pitfalls
$ make
gcc -I/usr/X11R6/include -g -c clara.c -o clara.o
clara.c:70: parse error before `int'
make: *** [clara.o] Error 1
A syntax error on the line 70 of the file clara.c. Double check
if the sources were not changed. Try to obtain the sources
again. If you're a programmer, try to fix the problem. In any
case, report it to claraocr@claraocr.org.
$ make
clara.c: In function `process_cl':
clara.c:2293: `ZPS' undeclared (first use in this function)
clara.c:2293: (Each undeclared identifier is reported only once
clara.c:2293: for each function it appears in.)
make: *** [clara.o] Error 1
A reference to an undeclared variable. Double check if the
sources were not changed. Try to obtain the sources again. If
you're a programmer, try to fix the problem. In any case, report
it to claraocr@claraocr.org.
3. Runtime pitfalls
$ clara &
[1] 1924
bash: clara: command not found
The Clara executable does not exist or is not on the path. Most
Unix systems don't include the current directory ("./") on the
path, so if you're trying to start Clara from the directory where
it was compiled, specify the current directory ("./clara").
$ ./clara &
[1] 1922
_X11TransSocketUNIXConnect: Can't connect: errno = 111
cannot connect to X server
Clara could not connect the X server. The X Windows System is a
client-server system. The applications (xterm, xclock, etc)
connect to a display server (the X server). If the server is not
running, clients cannot connect to it. In some cases, it's
required to inform explicitly the client about the server it must
connect, using the environment variable DISPLAY.
$ ./clara
Segmentation fault (core dumped)
If you can reproduce the problem, report it
to claraocr@claraocr.org. If you're a programmer and Clara was
compiled with the -g option, try a debugger to locate the point
of the source code where the segmentation fault happened. Using
gdb, it's quite easy:
$ gdb clara
(gdb) run
Now try to reproduce the steps that led to the segmentation
fault.
*/
/* (tutorial)
Making OCR
----------
This section is a tutorial on the basic OCR features offerred by
Clara OCR. Clara OCR is not simple to use. A basic knowledge
about how it works is required for using it. Most complex
features are not covered by this tutorial. If you need to compile
Clara from the source code, read the INSTALL file and check (if
necessary) the compilation hints on the Clara OCR Advanced User's
Manual.
Starting Clara
--------------
So let's try it. Of course we need a scanned page to do so. Clara
OCR requires graphic format PBM or PGM (TIFF, PBM, and others
must be converted, the netpbm package contains various conversion
tools). The Clara distribution package contains one small PBM
file that you can use for a first test. The name of this file is
imre.pbm. If you cannot locate it, download it or other files
from CLARA_HOME. Alternatively, you can produce your own 600-dpi
PBM or PGM files scanning any printed document (hints for
scanning pages and converting them to PBM are given on the
section "Scanning books" of the Clara OCR Advanced User's
Manual).
Once you have a PBM or PGM file to try, cd to the directory where
the file resides and fire up Clara. Example:
$ cd /tmp/clara
$ clara &
In order to make OCR tests, Clara will need to write files on
that directory, so write permission is required, just like some
free space.
Obs. As to version CLARA_VERSION, Clara OCR heuristics are tuned
to handle 600 dpi bitmaps. When using a different resolution,
inform it using the -y switch:
$ clara -y 300 &
Then a window with menus and buttons will appear on your X
display:
+-----------------------------------------------+
| File Edit OCR ... |
+-----------------------------------------------+
| +--------+ +----+ +--------+ +-------+ |
| | zoom | |page| |patterns| | tune | |
| +--------+ +-+ +-+ +-+ +-+ |
| +--------+ | +-------------------------+ | |
| | zone | | | | | |
| +--------+ | | | | |
| +--------+ | | | | |
| | OCR | | | WELCOME TO | | |
| +--------+ | | | | |
| +--------+ | | C L A R A O C R | | |
| | stop | | | | | |
| +--------+ | | | | |
| . | | | | |
| . | | | | |
| | | | | |
| | | | | |
| | +-------------------------+ | |
| +-----------------------------+ |
| |
| (status line) |
+-----------------------------------------------+
Welcome aboard! The rectangle with the welcome message is called
"the plate". As you already guessed, the small rectangles with
the labels "zoom", "OCR", "stop", etc, are "the buttons". The
"tabs" are those flaps labelled "page", "patterns"
and "tune". On the menu bar you'll find the File menu, the Edit
menu, and so on. Popup the "Options" menu, and change the current
font size for better visualization, if required.
Press "L" to read the GPL, or select the "page" tab, and
subsequently, select on the plate the imre.pbm page (or any other
PBM or PGM file, if any). The OCR will load that file showing the
progress of this operation on the status line on the bottom of
the window.
note: the "page" tab is the flap labelled "page". This is
unrelated to the "tab" key.
When the load operation completes, Clara will display the
page. Press the OCR button and wait a bit. The letters will
become grayed and the plate will split into three windows. Move
the pointer along the plate and you'll see the tab label follow
the current window: "page", "page (output)" or "page
(symbol)". Move the pointer along the entire application window,
and, for most components, you'll see a short context help message
on the status line when the pointer reaches it (the buttons, for
instance). Dialogs (user confirmations) also use the status line
(like Emacs), instead of dialog boxes.
You can resize both the Clara application window or each of the
three windows currently on the plate ("page", "page (output)" and
"page (symbol)"). To resize the windows, select any point between
two of them and drag the mouse. The scrollbars can become hidden
(use the "hide scrollbars" on the View menu).
When the tab label is "page", press the "zoom" button using the
mouse button 1 and the scanned image will zoom out. If you use
the mouse button 3, the image will zomm in (the behaviour of the
"zoom" button depends on the current window).
Now try selecting the "page" tab many times, and you will
circulate the various display modes shared by this tab. These
modes are and will be referred as "PAGE", "PAGE (fatbits)" and
"PAGE (list)". Each display mode may have one or more windows
We've chosen this uncommon approach because an excess of tabs
transforms them in a useless decoration. The other tabs also
offer various modes, some will be presented later by this
tutorial.
Some few command-line switches
------------------------------
Besides the -y option used in the last subsection, Clara accepts
many others, documented on the Clara OCR Advanced User's
Manual. By now, from the various different ways to start Clara,
we'll limit ourselves to some few examples:
clara
clara -h
In the first case, Clara is just started. On the second, it will
display a short help and exit.
clara -f path
clara -f path -w workdir
The option -f informs the relative or absolute path of a scanned
page or a directory with scanned pages (PBM or PGM files). The
option -w informs the relative or absolute path of a work
directory (where Clara will create the output and data files).
clara -i -f path -w workdir
clara -b -f path -w workdir
The option -i activates dead keys emulation for composition of
accents and characters. The -b switch is for batch
processing. Clara will automatically perform one OCR run on the
file informed through -f (or on all files found, if it is the
path of a directory) and exit without displaying its window.
clara -Z 1 -F 7x13
Clara will start with the smallest possible window size.
A full reference of command-line switches is given on the section
"Reference of command-line switches" of the Clara OCR Advanced
User's Manual.
Training symbols
----------------
Yes, Clara OCR must be trained. Training is a tedious procedure,
but it's a must for those who need a customizable OCR, apt to
adapt to a perhaps uncommon printing style.
Before training, a process called segmentation must be
performed. Press the right button of the mouse over the OCR
button, select "Segmentation" on the menu that will pop out and
wait the operation complete.
Now, on the "page" tab, observe the image of the document
presented on the top window. You'll see the symbols greyed,
because the OCR currently does not know their
transliterations. Try to select one symbol using the mouse (click
the mouse button 1 over it). A black elliptic cursor will appear
around that symbol. This cursor is called the "graphic
cursor". You can move the graphic cursor around the document
using the arrow keys.
Now observe the bottom window on the "page" tab. That window
presents some detailed information on the current symbol (that
one identified by the graphic cursor). When the "show web clip"
option on the "View" menu is selected, a clip of the document
around the current symbol, is displayed too. In some cases, this
clip is useful for better visualization. The name "web clip" is
because this same image is exported to the Clara OCR web
interface when cooperative training and revision through the
Internet is being performed.
To inform the OCR about the transliteration of one symbol, just
type the corresponding key. For instance, if the current symbol
is a letter "a", just type the "a" key. Observe that the trained
symbol becomes black. Each symbol trained will be learned by the
OCR, its bitmap will be called a "pattern", and it will be used
as such when trying to deduce the transliteration of unknown
symbols.
Obs. in our test, the user chose the symbol to be trained. However,
Clara OCR can choose by itself the symbols to be trained. This feature
is called "build the bookfont automatically" (found on the "tune"
tab). To use it, select the corresponding checkbos and classify the
symbols as explained later.
Finally, when the transliteration cannot be informed through one
single keystroke or composition (for instance when you wish to
inform a TeX macro as being the transliteration of the current
symbol), write down the transliteration using the text input
field on the bottom window (select it using the mouse before).
Symbol properties
-----------------
Obs: most features described in this paragraph are still
experimental.
The bottommost three buttons (in this order: alphabet, pattern
type, and "bad") show properties of the current symbol.
If a symbol is defective, it's generally useful not use it as a
pattern. In such a case, when informing the symbol
transliteration, press the ESC key once before training that
symbol (or press the BAD button). The OCR will mark that symbol
as "bad".
The behaviour of the "alphabet" button is as follows: by default
it is in the state "other". If the current symbol is trained as a
latin letter ('a', 'b', 'c', etc), this property is automatically
set to "latin". If the current symbol is trained as a indo-arabic
digit ('0', '1', etc), this property is automatically set to
"number". If the button state is manually set to "greek" and a
letter is input from a latin keyboard, it will be automatically
mapped to the corresponding greek letter ("a" to "alpha", "b" to
"beta", etc). Note that the alphabet button circulates only those
alphabets selected on the "Alphabets" menu. By now, Clara OCR
does not include mappings for other alphabets.
The "pattern types" button presents the classification of the
symbol regarding the font types (Clarendom, Times, etc) and sizes
(9pt, 10pt, etc) found on the book. It's not mandatory to
classify the patterns, and there is some preliminar code to
perform this classification automatically. However, it's
currently expected to be performed manually, if desired. For
instance: first train some symbols, all of same type and
size. All just created patterns are put on type 0. Then use the
"set pattern type" on Edit menu to change their types from 0 to
some other at your choice.
Saving the session
------------------
Before going further, it's important to know how to save your
work. The file menu contains one item labelled "save
session". When selected, it will create or overwrite three files
on the working directory: "patterns", "acts" and "page.session",
where "page" is the name of the file currently loaded, without
the "pbm" or "pgm" tag (in out example, "imre"). So, to remove
all data produced by OCR sessions, remove manually the files
"*.session", "patterns" and "acts".
Note that the files "patterns" and "acts" are shared by all PBM
or PGM pages, so a symbol trained from one page is reused on the
other pages. The ".session" files however are per-page. Pages
with the same graphic characteristics, and only them, must be put
on one same directory, in order to share the same patterns.
When the "quit" option of the "File" menu is selected, the OCR
prompts the user for saving the session (answer pressing the key
"y" or "n"), unless there are no unsaved changes.
OCR steps
---------
The OCR process is divided into various steps, for instance
"classification", "build", etc. These steps are acessible clicking
the mouse button 3 over the OCR button. Each one can be started
independently and/or repeated at any moment. In fact, the more
you know about these steps, the better you'll use them.
Clicking the "OCR" button with the mouse button 1, all steps will
be started in sequence. The "OCR" button remains on the
"selected" state while some step is running.
Yet we won't cover this stuff in the tutorial, a basic knowledge
on what each step perform is required for fine-tuning Clara OCR.
The tuning is an interactive effort where the usage of the
heuristics alternates with training and revision, guided by the
user experience and feeling.
Classification
--------------
After training some symbols, we're ready to apply the just
acquired knowledge to deduce the transliteration of non-trained
symbols. For that, Clara OCR will compare the non-trained symbols
with those trained ("patterns"). Clara OCR offers nice visual
modes to present the comparison of each symbol with each
pattern. To activate the visual modes, enter the View menu and
select (for instance) the "show comparisons" option.
Now start the "classification" step (click the mouse button 3
over the OCR button and select the "classification" item) and
observe what happens. Depending on your hardware and on the size
of the document, this operation may take long to complete
(e.g. 5 minutes). Hopefully it'll be much faster (say, 30
seconds).
When the classification finishes, observe that some nontrained
symbols became black. Each such symbol was found similar to some
pattern. Select one black symbol, and Clara will draw a gray
ellipse around each class member (except the selected symbol,
identified by the black graphic cursor). You can switch off this
feature unselecting the "Show current class" item on the "View"
menu.
In some cases, Clara will classify incorrectly some symbols. For
instance, a defective "e" may be classified as "c". If that
happens, you can inform Clara about the correct transliteration
of that symbol training it as explained before (in this example,
select the symbol and press "e"). This action will remove that
symbol from its current class, and will define a new class,
currently unitary and containing just that symbol.
Note about how Clara OCR classification works
---------------------------------------------
The usual meaning of "classification" for OCRs is to deduce for
each symbol if it is a letter "a" or the letter "b", or a digit
"1", etc. As the total number of different symbols is small (some
tenths), there will be a small quantity of classes.
However, instead of classifying each symbol as being the letter
"a", or the digit "1", or whatever, Clara OCR builds classes of
symbols with similar shapes, not necessarily assigning a
transliteration for each symbol. So as sometimes the bitmap
comparison heuristics consider two true letters "a" dissimilar
(due to printing differences or defects), the Clara OCR
classifier will brake the set of all letters "a" in various
untransliterated subclasses.
Therefore, the classification result may be a much larger number
of classes (thousands or more), not only because of those small
differences or defects, but also because the classification
heuristics are currently unable to scale symbols or to "boldfy"
or "italicize" a symbol.
Note that each untransliterated subclass of letters "a" depends
on a punctual human revision effort to become transliterated
(trained). This is not an absurd strategy, because the revision
of each subset corresponds to part of the unavoidable human
revision effort required by any real-life digitalization
project. This is one of the principles that make possible to see
Clara OCR not as a traditional OCR, but as a productivity tool
able to reduce costs. Anyway, we expect to the future more
improvements on the Clara OCR classifier, in order to lessen the
number of subclasses created.
Building the output
-------------------
Now we're ready to build the OCR output. Just start the
"build" step. The action performed will be basically
to detect text words and lines, and output the transliterations,
trained or deduced, of all symbols. The output will be presented
on the "PAGE (output)" window.
Each character on the "PAGE (output)" window behaves like a
HTML hyperlink. Click it to select the current symbol both
on the "PAGE" window and on the "PAGE (symbol)" window. Note
that the transliteration of unknow symbols is substituted by
their internal IDs (for instance "[133]").
The result of the word detection heuristic can be visualized
checking the "show words" item on the "View" menu.
Handling broken symbols
-----------------------
Obs. As to version CLARA_VERSION the merging heristics are only
partially implemented, and in most cases they won't produce any effect.
The build heuristics also try to merge the pieces of broken
symbols, just like the "u", the "h" and the "E" on the figure
(observe the absent pixels). Some letters have thin parts, and
depending on the paper and printing quality, these parts will
brake more or less frequently.
XXX XXXXXXXXXXX
XX XXX X
XX XXX
XX XXX
XXX XXX XX XXX XXX X
XX XX XXX X XXX XXXX
XX XX XX XX XXX X
XX XX XX XX XXX
XX XX XX XX XXX
XX XX XX XX XXX X
XX XXXX XXXX XXX XXXXXXXXXXX
Clara OCR offers three symbol merging heuristics:
geometric-based, recognition-based and learned. Each one may be
activated or deactivated using the "tune" tab.
Geometric merging applies to fragments on the interior of the
symbol bounding box, like the "E" on the figure, and to some other
cases too.
The recognition merging searches unrecognized
symbols and, for each one, tries to merge it with some
neighbour(s), and checks if the result becomes similar to some
pattern.
Finally, learned merging will try to reproduce the
cases trained by the user. To train merging, just select the
symbol using the mouse button 1
(say, the left part of the "u" on the figure), click the mouse
button 3 on the fragment (the right part of the "u"), and select
the "merge with current symbol" entry. On the other hand, the
"disassemble" entry may be used to break a symbol into its
components.
Obs. do not merge the "i" dot with the "i" stem. See the
subsection "handling accents" for details.
Handling accents
----------------
Now let's talk about accents.
As a general rule, Clara OCR does not consider accents as parts
of letters, so merging does not apply to accents. Accents are
considered individual symbols, and must be trained
separately. The "i" dot is handled as an accent. Clara OCR will
compose accents with the corresponding letters when generating
the output. The exception is when the accent is graphically
joined to the letter:
XXX
XX XXX
XX XX
XX
XXXX XXXX
XX XX XX XX
XX XX XX XX
XXXXXXXXXX XXXXXXXXXX
XX XX
XX XX
XX XX XX XX
XXXX XXXX
In the figure we have two samples of "e" letter with acute
accent. In the first one, the accent is graphically separated
from the letter. So the accent transliteration will be trained or
deduced as being "'", the letter transliteration
will be trained or deduced as beig "e". When generating the output,
Clara OCR will compose them as the macro "\'e" (or as the ISO
character 233, as soon as we provide this alternative behaviour).
On the second case the accent isn't graphically separable from
the letter, so we'll need to train the accented character as the
corresponding ISO character (code 233) or as the macro "\'e". As
the generation of accented characters depend on the local X
settings, the "Emulate deadkeys" item on the "Options" menu may
be useful in this case. It will enable the composition of accents
and letters performed directly by Clara OCR (like Emacs
iso-accents-mode feature).
Browsing the book font
----------------------
As explained earlier, trained symbols become patterns (unless you
mark it "bad"). The collection of all patterns is called "book
font" (the term "book" is to distinguish it from the GUI
font). Clara OCR stores all pattern in the "patterns" file on the
work directory, when the "save session" entry on the "File" menu
is selected.
Clara OCR itself can choose the patterns and populate the book
font. To do so, just select the "Build the font automatically"
item on the "tune" tab, and classify the symbols.
To browse the patterns, click the "pattern" tab one or more times
to enter the "Pattern (list)" window. The "PATTERN (list)" mode
displays the bitmap and the properties
of each pattern in a (perhaps very long) form.
Click the "zoom" button to
adjust the size of the pattern bitmaps. Use the scroolbar or
the Next (Page Down) or Previous (Page Up) keys to navigate. Use
the sort options on the "Edit" menu to change the presentation order.
Now press the "pattern" tab again to reach the "Pattern" window. It
presents the "current" pattern with detailed properties. try
activating the "show web clip" option on the "View" menu to
visualize the pattern context. The left and
right arrows will move to the previous and to the next patterns. To
train the current pattern (being exhibited on the "Pattern" window),
just press the key corresponding to its transliteration (Clara will
automatically move to the next pattern) or fill the
input field. There is no need to press ENTER to submit the input
field contents.
Useful hints
------------
If the GUI becomes trashed or blank, press C-l to redraw it.
By now, the GUI do not support cut-and-paste. To save to a file
the contents of the "PAGE (list)" window, use the "Write report"
item on the "File" menu.
The "OCR" button will enter "pressed" stated in some unexpected
situations, like during dialogs. This behaviour will be fixed
soon.
The "STOP" button do not stop immediately the OCR operation in
course (e.g. classification). Clara OCR only stops the operation
in course in "secure" points, where all data structures are
consistent.
The OCR output is automatically saved to the file page.html (or
page.txt if the option -o was used), where "page" is the name of
the currently loaded page, without the "pbm" or "pgm" tag. This
file is created by the "generate output" step on the menu that
appears when the mouse button 3 is pressed over the OCR button.
Some OCR steps are currently unfinished and perform no
action at all.
Fun codes
---------
Clara OCR "fun codes" are similar to videogame "codes" (for those
who have never heard about that, videogame "codes" are special
sequences of mouse or key clicks that make your player
invulnerable, or obtain maximum energy, or perform an unexpected
action, etc).
The difference is that Clara OCR "fun codes" are not secret
(videogame "codes" are normally secret and very hard to discover
by chance). Clara OCR contains no secret feature. Fun codes are
intended to be used along public presentations. By now there is
only one fun code: just click one or more times the banner on the
welcome window to make it scroll.
*/
/* (book)
Supported Alphabets
-------------------
Clara OCR focuses the Latin Alphabet ("a", "b", "c", ...),
used by most European languages, and the indo-arabic digits
("0", "1", "2", ...), but we're trying to support as many
alphabets as possible.
To say that Clara OCR supports a given alphabet means that
Clara OCR
(a) is able to be trained from the keyboard for the symbols of
that alphabet, eventually applying some transliteration from that
alphabet to latin. For instance, when OCRing a greek text, if the
user presses the latin "a" key (assuming that the keyboard has
latin labels), Clara is expected to train the current symbol as
"alpha".
(b) knows the vertical alignment of each letter of that alphabet,
for instance, knows that the bottom of an "e" is aligned at the
baseline;
(c) knows which letters accept or require which signs (accents
and others, like the dot found on "i" and "j");
(d) contains code to help avoiding common mistakes, like
recognizing "e" as "c", "l" as "1", etc.
To say that Clara OCR supports a given alphabet does not
necessarily mean that Clara OCR
(a) knows some particular encoding (ISO-8859-X, Unicode, etc)
for that alphabet;
(b) contains or is able to use fonts for that alphabet to
display the OCR output on the PAGE (OUTPUT) window.
Even ignoring the standard encondings for one given
alphabet (e.g. ISO-LATIN-7 for Greek), Clara eventually
will be able to produce output using TeX macros, like
{\Alpha}.
*/
/* (devel)
Introducing the source code
---------------------------
This Guide is a collection of entry points to the Clara OCR
source code. Some notes explain punctual details about how this
or that feature was implemented. Others are higher-level
descriptions about how one entire subsystem works.
Language and environment
------------------------
Clara OCR is written in ANSI C (with some GNU extensions) and
requires the services of the C library and the Xlib. The
development is using 32-bit Intel GNU/Linux (various different
distributions), GCC, Gnu Make, Bash, XFree86 and Perl 5 (required
for producing the documentation).
Modularization
--------------
Clara source code started, of course, as being one only file
named clara.c. At some point we divided it into smaller
pieces. Currently there are 18 files:
book.c .. Documentation only
build.c .. The function build
clara.c .. Startup and OCR run control
cml.c .. ClaraML dumper and recover
common.h .. Common declarations
consist.c .. Consistency tests
event.c .. GUI initialization and event handler
gui.h .. Declarations that depend on X11
html.c .. HTML generation and parse
pattern.c .. Book font stuff
pbm2cl.c .. Import PBM
pgmblock.c .. grayscale loading and blockfinding
preproc.c .. internal preprocessor
redraw.c .. The function redraw
revision.c .. Revision procedures
skel.c .. Skeleton computation
symbol.c .. Symbol stuff
welcome.c .. Welcome stuff
Along this document we'll not refer these files, but the
identifiers (names of functions and variables).
Note that there are only two headers: common.h and gui.h. It's
complex to maintain one header for each module. Most functions
are not prototyped, but we intend to prototype all them in the
near future.
Security notes
--------------
Concerning security, the following criteria is being used:
1. string operations are generally performed using services that
accept a size parameter, like snprint or strncpy, except when the code
itself is simple and guarantees that a overflow won't occur.
2. The CGI clara.pl invokes write privileges through sclara, a program
specially written to perform only a small set of simple operations
required for the operation of the Clara OCR web interface.
The following should be done:
1. Memory blocks should be cleared before calling free().
Runtime index checking
----------------------
A naive support for runtime index checking is provided through the
macro checkidx. This checking is performed only if the code is
compiled with the macro MEMCHECK defined and the command-line switch
'-X 1' is used.
In fact, only those points on the source code where the macro checkidx
is explicitly used will perform index checking. We've added calls to
checkidx on some critical functions due to its complexity, or because
segfaults were already were detected there.
Background operation
--------------------
Clara OCR decides at runtime if the GUI will be used or not. So
even when using Clara OCR in batch mode (-b command-line switch),
linking with the X libraries is required.
When the -b command-line switch is used, Clara OCR just won't
make calls to X services. The source code tests the flag
"batch_mode" before calling X services. So it won't create the
application window on the X display, and automatically starts a
full OCR operation on all pages found (as if the "OCR" button was
pressed with the "work on all pages" option selected). Upon
completion, Clara OCR will exit.
Synchronization
---------------
Execution model
---------------
In order to allow the GUI to refresh the application window while
one OCR run is in course, Clara does not use multiple
threads. The main function alternates calls to xevents() to
receive input and to continue_ocr() to perform OCR. As the OCR
operations may take long to complete, a very simple model was
implemented to allow the OCR services to execute only partially.
Such services (for instance load_page()) accept a "reset" parameter
to allow resetting all static data, and they're expected to
return 0 when finished, or nonzero otherwise. Therefore, a call to
such services must loop until completion. The continue_ocr() calls
the OCR steps using this model, and some OCR steps call other
services (like load_page()) that implement this model too.
Resetting
---------
XML support
-----------
We decided to use XML because of the facilities of using
non-binary encodings to store, analyse, change and transmit
information, and also because XML is a standard. Currently we do
not have DTDs, and until now we didn't try to load, using the
Clara parser, XML code not produced by Clara itself.
The GUI
-------
Main characteristics
--------------------
1. Clara OCR GUI uses only 5 colors: white, gray, darkgray,
verydarkgray and black. The RGB value for each one is
customizable at startup (-c command-line option). On truecolor
displays, graymaps are displayed using more graylevels than the 5
listed above.
2. The X I/O is not buffered. Buffered X I/O is implemented but
it's not being used.
3. Only one X font is used for all needs (button lables, menu
entries, HTML renderization, and messages).
4. Asynchronous refresh. The OCR operations just set the redraw
flags (redraw_button, redraw_wnd, redraw_int, etc) and let the
redraw() function make its work.
5. No toolkit is used. The graphic code is very specific to
Clara, and it was not written to be reusable. So it's very
small. The disadvantage of this approach is that Clara look and
behaviour will be slightly different from the typical ones found
on popular environments like GNOME or KDE.
The Clara API
-------------
*/
/* (book)
Building the book font
----------------------
Patterns are selected symbols from the book. They're obtained
from manual training, or from automatic selection. The patterns
are used to deduce the transliteration of the unknown symbols by
the bitmap comparison heuristics. In other words, the OCR
discovers that one symbol is the letter "a" or the digit "1"
comparing it with the patterns.
The book font is the collection of all patterns. The term "book
font" was chosen to make sure that we're not talking about the X
font used by the GUI. The book font is stored on a separate file
("patterns", on the work directory). Clara OCR classifies the
patterns into "types", one type for each printing font. By now,
most of this work must be done manually. Someday in the future,
the auto-tuning features and the pre-build customizations will
hopefully make this process less painful.
So, before OCRing one book, it's convenient to observe the
different fonts used. In our case, we have three fonts (the
quotations refer the page 5.pbm):
Unknown Latin 9pt ("Todos sao iguais...")
Unknown Latin 9pt bold ("Art. 5")
Unknown Latin 8pt italic (footings)
It's not mandatory to exactly identify each font by its "correct"
name or style or size (Roman, Arial, Courier, etc). In our case,
we've chosen the labels above ("Unknown Latin 9pt" and the
others). These labels can be manually entered using the PATTERN
(TYPES) tab, one "type" for each "font". So we'll have 3 "types",
and, for each one, various parameters can be manually
informed. At least the alphabet must be informed. In fact, the
PATTERN (TYPES) tab allows structuring very carefully all fonts
used along the book. Even some intrincated details, like the
classification techniques that can be used for each symbol, can
be set.
Now we can select some patterns from the pages 143-l.pbm and
143-r.pbm. Try:
$ cd /home/clara/books/MBB/pbm
$ clara &
Load the page 143-l.pbm. Observe the symbols, select a nice one
using the mouse button 1 or the arrows (say, a letter "a", small)
and train it pressing the corresponding key (the "a" key). Repeat
this process for various symbols, all from one same type (so do
not mix bold with non-bold, etc). The entered patterns belong by
default to "type 0". The "Set pattern type" entry of the Edit
menu can be used to move all "type 0" patterns to some other type
(1, 2 or 3 in our case). To display the letters and digits for
which few or no samples are trained, click the mouse right button
over the PAGE tab and select "Show pattern type". This way, one
can complete all fonts used along the book.
At this point, the "Auto-classify" feature (Edit menu) may be
quite useful. When on, Clara OCR will apply the just trained
pattern to solve all unknown symbols, so after training an "a",
only those "a" letters dissimilar to that trained will remain
unknown (grayed).
Now save the session (menu "File"), exit Clara OCR (menu "File"),
and enter Clara OCR again using the same commands above. Try to
load one file and/or to observe the patterns on the tabs PATTERN,
PATTERN (list), TUNE (SKEL), etc. This is a good way to
experience that Clara OCR is started and exited many times along
the duration of one OCR project.
The last remark in this subsection: instead of the just described
manual pattern selection, Clara OCR is able to select by itself
the patterns to use from the pages. In order to use this feature,
after selecting the checkbox "Build the bookfont automatically"
(TUNE tab), classify the symbols (just press the OCR button using
the mouse button 1, or press the mouse button 3 over it and
select the "classify" item). However, the current recommendation
is to prefer the manual selection of patterns, at least as a
first step.
*/
/* (book)
Reference of the Clara GUI
--------------------------
In this section, the Clara application window will be described
in detail, both to document all its features and to define the
terminology.
The application window
----------------------
The application window is divided into three major areas: the
buttons ("zoom", "OCR", "stop", etc) the "plate" (right),
including the tabs ("page", "symbol" and "font"), and one or more
"document windows" inside the plate.
We say "document window" because each window is exhibiting one
"document". This "document" may be the scanned page (PAGE
window), the current OCR output for this page (PAGE OUTPUT
window), the symbol form (PAGE SYMBOL window), the GPL (GPL
window) and so on. However, we'll refer the document windows
merely as "windows".
Around each window there are two scrollbars. On the botton of the
application window there is a status line. On the top there is
a menu bar (fully documented on the section "Reference of the
menus").
+-----------------------------------------------+
| File Edit OCR ... |
+-----------------------------------------------+
| +--------+ +----+ +--------+ +-------+ |
| | zoom | |page| |patterns| | tune | |
| +--------+ +-+ +-+ +-+ +-+ |
| +--------+ | +-------------------------+ | |
| | zone | | | | | |
| +--------+ | | | | |
| +--------+ | | | | |
| | OCR | | | WELCOME TO | | |
| +--------+ | | | | |
| +--------+ | | C L A R A O C R | | |
| | stop | | | | | |
| +--------+ | | | | |
| . | | | | |
| . | | | | |
| | | | | |
| | | | | |
| | +-------------------------+ | |
| +-----------------------------+ |
| |
| (status line) |
+-----------------------------------------------+
Tabs and windows
----------------
Three tabs are oferred, and each one may operate in one or more
"modes". For instance, pressing the PATTERN tab many times will
circulate two modes: one presenting the windows "pattern" and
"pattern (props)" and another with the window "pattern
(list)".
On each tab, Clara OCR displays on the plate one or more
windows. Each such window is called a "document window" to
distinguish them from the application window. Each such window
is supposed to be displaying a portion of a larger document, for
instance
The scanned page (graphic)
The OCR output (text)
The list of pages (text)
The list of patterns (text)
The symbol description (text)
Unless the user hides them, two scrollbars are displayed for each
document window, one horizontal and one vertical. On each one, a
cursor is drawn to show the relative portion of the full document
currently visible ont the display.
All available tabs and the modes for each one are listed
below. The numbers (1, 2, etc) are only to make easier to
distinguish one mode from the others. There is no effective
association between the modes and the numbers.
tab mode windows
-------------------------------
1 WELCOME
2 GPL
3 PATTERN_ACTION
page 4 PAGE_LIST
5 PAGE
PAGE_OUTPUT
PAGE_SYMBOL
6 PAGE_FATBITS
PAGE_MATCHES
pattern 7 PATTERN
8 PATTERN_LIST
9 PATTERN_TYPES
tune 10 TUNE
11 TUNE_PATTERN
TUNE_SKEL
11 TUNE_ACTS
Note that the windows WELCOME and GPL have no corresponding
tab. When these windows are displayed, there is no active
tab. Except in these cases, the name of the current window is
always presented as the label of the active tab.
The Alphabet Map
----------------
When the "Show alphabet map" option of the "View" menu is selected,
the GUI will include an alphabet map between the buttons and the
plate. This map presents all symbols from the current alphabet. The
current alphabet is selected using the alphabet button. The alphabet
button circulates all alphabets selected on the "Alphabets" menu.
Clara OCR offers an initial support for multiple alphabets. To become
useful, it needs more work. The alphabet map currently does not offer
any functionality. For some alphabets (Cyrillic and Arabic) the
alphabet map is disabled on the source code due to the large alphabet
size. Currently Clara OCR does not contain bitmaps for displaying
Katakana.
Reference of the menus
----------------------
Most menus are acessible from their labels menu bar (on the top of the
application window). The labels are "File", "Edit", etc. Other menus
are presented when the user clicks the mouse button 3 on some special
places (for instance the button "OCR"). Let's describe all menus and
their entries.
*/
/* (devel)
Geometry of windows
-------------------
The current window is informed through the CDW global variable
(set by the setview function). The variable CDW is an index for
the dw array of dwdesc structs. Some macros are used to refer the
fields of the structure dw[CDW]. The list of all them can be
found on the headers under the title "Parameters of the current
window".
The portion of the document being displayed is defined by the
macros X0, Y0, HR and VR, where (X0,Y0) is the top left and HR
and VR are the width and heigth, measured in pixels (graphic
documents) or characters (text documents):
X0 X0+HR-1
| |
+----+-----+--+
| |
| |
| +-----+ +- Y0
| | | |
| | | |
| | | |
| +-----+ +- Y0+VR-1
| |
| |
| |
| |
| |
| |
+-------------+
The document
Regarding the application window, the document window is a
portion of the plate, defined by DM, DT, DW and DH, where (DM,DT)
is the top left and DW and DH are the width and heigth measured
in display pixels.
DM DM+DW-1
| |
+-----+-----------------+----+
| |
| |
| |
| +-----------------+ +- DT
| | | | |
| | | X |
| | | X |
| | Document | X |
| | window | | |
| | | | |
| | | | |
| | | | |
| | | | |
| +-----------------+ +- DT+DH-1
| -----XXXXXXXXXXX- |
| |
| |
+----------------------------+
Application window
The rectangle (X0,Y0,HR,VR) from the document is exhibited into
the display rectangle (DM,DT,DW,DH). When displaying the scanned
page, the reduction factor RF applies. Each square RFxRF of
pixels from the document will be mapped to one display pixel.
When displaying the scanned page in fat bit mode, each document
pixel will be mapped to a square ZPSxZPS of display pixels, and a
grid will be displayed too.
Scrollbars
----------
The scrollbars inform the relative portion of the document being
exhibited. The viewable region of the document (in the sense just
defined) is defined by X0, Y0, HR and VR:
Y0 Y0+HR-1
+----+-------+-------+ - 0
| |
X0 + +-------+ |
| | | |
| | | |
| | | |
| | | |
X0+VR-1 + +-------+ |
| |
| |
| |
| |
+--------------------+ - GRY-1
| |
0 GRX-1
The variables GRX and GRY contain the total width and height of
the full document, measured in pixels. The interpretation of the
contents of the variables X0, Y0, HR and VR is not simple. In some
cases, they will contain values measured in pixels. In other cases,
in characters. The variables HR and VR define the size of the
window. However, in some cases this size is the size
from the viewpoint of the document and, in others, of the display
(the difference is a reduction factor).
+------------+ -
| | |
| | |
| | X
| | X
| | X
| | |
| | |
+------------+ -
|---XXXX-----|
Note that the parameters X0, Y0, HR, VR, GRX and GRY are macros
that refer the corresponding fields of the structure dw[CDW],
that stores the parameters of the current DW.
Displaying bitmaps
------------------
The Bitmaps on HTML windows and on the PAGE window are exhibited
in "reduced" fashion (a square RFxRF of pixels from the bitmap is
mapped to one display pixel). If RF=1, then each bitmap pixel
will map to one display pixel.
The windows PATTERN, PAGE_FATBITS, and PAGE_MATCHES exhibit
bitmaps in "zoomed" mode (one bitmap pixel maps to a ZPSxZPS
square of display pixels). In this case a grid is displayed to
make easier to distinguish each pixel. The variables GW and GS
contain the grid width and the "grid separation" (GS=ZPS+GW).
ZPS GS GW
|<---->|<----->| --->||<---
++------++------++------++----
++------++------++------++----
|| || || ||
|| || || ||
|| || || ||
++------++------++------++----
++------++------++------++----
|| || || ||
|| || || ||
|| || || ||
Note that the parameters RF, GS and GW are macros that refer the
corresponding fields of the structure dw[CDW], that stores the
parameters of the current DW.
Auto-submission of forms
------------------------
The Clara OCR GUI tries to apply immediately all actions taken by
the user. So the HTML forms (e.g. the PATTERN window) do not
contain SUBMIT buttons, because they're not required (some forms
contain a SUBMIT button disguised as a CONSIST facility, but this
is just for the user's convenience).
The editable input fields make auto-submission mechanisms a bit
harder, because we cannot apply consistency tests and process the
form before the user finishes filling the field, so
auto-submission must be triggered on selected events. The
triggers must be a bit smart, because some events must be
attended before submission (for instance toggle a CHECKBOX),
while others must be attended after submission (for instance
changing the current tab). So auto-submission must be carefully
studied. The current strategy follows:
a. When the window PAGE (symbol) or the window PATTERN is
visible, auto-submit just after attending the buttons that change
the current symbol/pattern data (buttons BOLD, ITALIC, ALPHABET
or PTYPE).
b. When the window PAGE (symbol) or the window PATTERN is
visible, auto-submit just before attending the left or right
arrows.
c. When the user presses ENTER and an active input field exists,
auto-submit.
d. Auto-submit as the first action taken by the setview service,
in order to flush the current form before changing the current
tab or tab mode.
e. Auto-submit just after opening any menu, in order to flush
data before some critic action like quitting the program or
starting some OCR step.
f. Auto-submit just after attending CHECKBOX or RADIO buttons.
Auto-submission happens when the service auto_submit_form is
called, so it's easy to locate all triggering points (just search
the string auto_submit_form). This service takes no action when
the current form is unchanged.
The Clara API
-------------
This section describes the variables and functions exported by
Clara OCR for extensionability purpuses. Note that Clara OCR
currently does not have an interface for extensions. The first
such interface planned to be added will use the Guile
interpreter, available from the GNU Project.
*/
/* (all)
AVAILABILITY
------------
Clara OCR is free software. Its source code is distributed under
the terms of the GNU GPL (General Public License), and is
available at CLARA_HOME. If you don't know what is the GPL,
please read it and check the GPL FAQ at
http://www.gnu.org/copyleft/gpl-faq.html. You should have
received a copy of the GNU General Public License along with this
software; if not, write to the Free Software Foundation, Inc., 59
Temple Place - Suite 330, Boston, MA 02111-1307, USA. The Free
Software Foundation can be found at http://www.fsf.org.
CREDITS
-------
Clara OCR was written by Ricardo Ueda Karpischek. Giulio Lunati
wrote the internal preprocessor. Clara OCR includes bugfixes
produced by other developers. The Changelog
(http://www.claraocr.org/CHANGELOG) acknowledges all them (see
below). Imre Simon contributed high-volume tests, discussions
with experts, selection of bibliographic resources, propaganda
and many ideas on how to make the software more useful.
Ricardo authored various free materials, some included (at least)
in Conectiva, Debian, FreeBSD and SuSE (the verb conjugator
"conjugue", the ispell dictionary br.ispell and the proxy
axw3). He recently ported the EiC interpreter to the Psion 5
handheld and patched the Xt-based vncviewer to scale framebuffers
and compute image diffs. Ricardo works as an independent
developer and instructor. He received no financial aid to develop
Clara OCR. He's not an employee of any company or organization.
Imre Simon promotes the usage and development of free
technologies and information from his research, teaching and
administrative labour at the University.
Roberto Hirata Junior and Marcelo Marcilio Silva contributed
ideas on character isolation and recognition. Richard Stallman
suggested improvements on how to generate HTML output. Marius
Vollmer is helping to add Guile support. Jacques Le Marois helped
on the announce process. We acknowledge Mike O'Donnell and Junior
Barrera for their good criticism. We acknowledge Peter Lyman for
his remarks about the Berkeley Digital Library, and Wanderley
Antonio Cavassin, Janos Simon and Roberto Marcondes Cesar Junior
for some web and bibliographic pointers. Bruno Barbieri Gnecco
provided hints and explanations about GOCR (main author: Jorg
Schulenburg). Luis Jose Cearra Zabala (author of OCRE) is gently
supporting our tentatives of using portions of his code. Adriano
Nagelschmidt Rodrigues and Carlos Juiti Watanabe carefully tried
the tutorial before the first announce. Eduardo Marcel Macan
packaged Clara OCR for Debian and suggested some
improvements. Mandrakesoft is hosting claraocr.org. We
acknowledge Conectiva and SuSE for providing copies of their
outstanding distributions. Finally, we acknowledge the late Jose
Hugo de Oliveira Bussab for his interest in our work.
The fonts used by the "view alphabet map" feature came from
Roman Czyborra's "The ISO 8859 Alphabet Soup" page at
http://czyborra.com/charsets/iso8859.html.
The names cited by the CHANGELOG and not cited before follow
(small patches, bug reports, specfiles, suggestions,
explanations, etc).
Brian G.,
Bruce Momjian,
Charles Davant (server admin),
Daniel Merigoux,
De Clarke,
Emile Snider (preprocessor, to be released),
Erich Mueller,
groggy,
Harold van Oostrom,
Ho Chak Hung,
Jeroen Ruigrok,
Laurent-jan,
Nathalie Vielmas,
Romeu Mantovani Jr (packager),
Ron Young,
R P Herrold,
Sergei Andrievskii,
Stuart Yeates,
Terran Melconian,
Thomas Klausner (packager),
Tim McNerney,
Tyler Akins.
*/
/* (faq)
WELCOME
-------
These are the Clara OCR Frequently Asked Questions. They're
useful for a first contact with Clara OCR. If you're looking for
information on how to use Clara OCR, please try the Clara OCR
Tutorial instead. Clara OCR can be found at CLARA_HOME.
CONTENTS
--------
What is Clara OCR?
Why is Clara a "cooperative OCR"?
Is Clara OCR Free? Open Source?
Is Clara OCR a GNU program?
On which platforms does Clara OCR run?
Does Clara OCR have a command-line interface?
Does Clara OCR run on KDE? GNOME?
Which languages are supported by Clara OCR?
Does Clara OCR support Unicode?
Is Clara OCR omnifont?
How does Clara differ from other OCRs?
What is PBM/PGM/PPM/PNM?
How can I scan paper documents using Clara OCR?
I've tried Clara OCR, but the results disappointed me
How can I get support on Clara OCR?
Does Clara OCR induce to Copyright Law infringements?
How can I help the Clara OCR development effort?
What is Clara OCR?
------------------
Clara is an OCR program. OCR stands for "Optical Character
Recognition". An OCR program tries to recognize the characters
from the digital image of a paper document. The name Clara stands
for "Cooperative Lightweight chAracter Recognizer".
Why is Clara a "cooperative OCR"?
---------------------------------
Clara is a cooperative OCR because it offers an web interface for
training and revision, so these tasks can benefit from the
revision effort of many people across the Internet. However,
Clara OCR also offers a powerful X-based GUI for standalone
usage.
Is Clara OCR Free? Open Source?
-------------------------------
Clara OCR is distributed within the terms of the Gnu Public
License (GPL) version 2. Yes, Clara OCR is Free. Yes, Clara OCR
is Open Source. Clara OCR is not "Shareware", nor "Public
Domain".
Is Clara OCR a GNU program?
---------------------------
Clara OCR is unrelated to the GNU Project but its development is
strongly based on GNU programs (GCC, Emacs and others), as well
as on other free softwares, like the Linux kernel and XFree86.
Clara OCR is free software because we agree on the free software
ideal as stated by the GPL. To make this agreement explicit we
also adopted some suggestions from the Free Software
Foundation. These suggestions apply to the Clara OCR
documentation:
(a) GPL programs are referred as "free software", not "open
source".
(b) The term "GNU/Linux (operating system)" is used rather
than "Linux (operating system)".
(c) We do not recommend non-free softwares and do not refer
the user to non-free documentation for free softwares.
Furthermore, Clara OCR will support Guile as an extension
language in the near future.
Obs. We write "free software" instead of "open source"
just for coherence. We dislike antagonisms between the various
initiatives created along the years to freely produce, use,
change and distribute software.
On which platforms does Clara OCR run?
--------------------------------------
Clara OCR is being developed on 32-bit Intel running GNU/Linux.
Currently Clara OCR won't run on big-endian CPUs (e.g. Sparc) nor
on systems lacking X windows support (e.g. MS-Windows). A
relatively fast CPU (300MHz or more) is recommended. There is a
port initiative to MS-Windows being worked. See also the next
question.
Does Clara OCR have a command-line interface?
---------------------------------------------
Yes, but the X Windows headers and libraries are required anyway
to compile the source code, and the X Windows libraries are
required to run even the Clara OCR command-line interface. Unless
someone reworks the code, it's not possible to detach the GUI in
order to compile Clara OCR on systems that do not support X
Windows.
Does Clara OCR run on KDE? GNOME?
---------------------------------
Clara OCR will hopefully run on any graphic environment based on
Xwindows, including KDE, GNOME, CDE, WindowMaker and
others. Clara OCR depends only on the X library, and does not
require GTK, Qt or Motif to run. Clara OCR does not use the X
Toolkit (aka "Xt"). Clara OCR has been successfully tested on
X11R5 and X11R6 environments with twm, fvwm, mwm and others.
Which languages are supported by Clara OCR?
-------------------------------------------
As a generic recogniser, Clara OCR may be tried with any language
and any alphabet. However, there are some restrictions. Currently
Clara OCR expects the words to be written horizontally, and there
are some heuristics that suppose some geometric relationships
typical for the Latin Alphabet and the accents used by most
european languages. Support for language-specific spell checking
is expected to be added soon.
Does Clara OCR support Unicode?
-------------------------------
No, Clara OCR does not support Unicode, and the support to the
ISO-8859 charsets is partial.
Is Clara OCR omnifont?
----------------------
No, Clara OCR is not omnifont. Clara OCR implements an OCR model
based on training. This model makes training and revision one
same thing, making possible to reuse training and revision
information (see also the next question).
How does Clara differ from other OCRs?
--------------------------------------
This is a quote from the Clara Advanced User's Manual:
Clara differs from other OCR softwares in various aspects:
1. Most known OCRs are non-free and Clara is free. Clara focus
the X windows system. Clara offers batch processing, a web
interface and supports cooperative revision effort.
2. Most OCR softwares focus omnifont technology disregarding
training. Clara does not implement omnifont techniques and
concentrate on building specialized fonts (some day in the
future, however, maybe we'll try classification techniques that
do not require training).
3. Most OCR softwares make the revision of the recognized text a
process totally separated from the recognition. Clara
pragmatically joins the two processes, and makes training and
revision parts of one same thing. In fact, the OCR model
implemented by Clara is an interactive effort where the usage of
the heuristics alternates with revision and fine-tuning of the
OCR, guided by the user experience and feeling.
4. Clara allows to enter the transliteration of each pattern
using an interface that displays a graphic cursor directly over
the image of the scanned page, and builds and maintains a mapping
between graphic symbols and their transliterations on the OCR
output. This is a potentially useful mechanism for documentation
systems, and a valuable tool for typists and reviewers. In fact,
Clara OCR may be seen as a productivity tool for typists.
5. Most OCR softwares are integrated to scanning tools offerring
to the user an unified interface to execute all steps from
scanning to recognition. Clara does not offer one such integrated
interface, so you need a separate software (e.g. SANE) to
perform scanning.
6. Most OCR softwares expect the input to be a graphic file
encoded in tiff or other formats. Clara supports only raw PBM
and PGM.
What is PBM/PGM/PPM/PNM?
------------------------
PBM, PGM and PPM are graphic file formats defined by Jef
Poskanzer. PNM is not a graphic file format, but a generic
reference to those three formats. In other words, to say that a
program supports PNM means that it handles PBM, PGM and PPM.
PBM = Portable BitMap
PGM = Protable GrayMap
PPM = Portable PixMap
PNM = Portable aNyMap
PBM files are black-and-white images, 1 bit per pixel. PGM files
are grayscale images, 8 bits per pixel. PPM files are color
images, 24 bits per pixel. Currently Clara OCR likes raw PBM and
raw PGM files only. A scanned page stored in some format other
than PBM or PGM can be converted to PBM or PGM using the netpbm
tools, ImageMagick or others.
PNM files may be "raw" or "plain". The plain versions are rarely
used. Clara OCR does not support plain PBM nor plain PGM. To make
sure about the file format, try the "file" utility, for instance
file test.pbm
Remember that image conversion sometimes implies data loss. For
instance, to convert a color image to black-and-white, each pixel
must be mapped to either black or white, so the original color
(say, red, lightblue, seagreen, tomato, mistyrose, etc) is
dropped. Also, the conversion process should decide for each
pixel if it will be mapped to black or to white. Generally, the
program that performs the conversion offers a variety of
different mapping criteria. The OCR results depend strongly on
the criterion chosen.
How can I scan paper documents using Clara OCR?
-----------------------------------------------
You cannot. Clara OCR includes no support for scanners. To scan
paper documents, use another software, like the one bundled with
your scanner, or SANE (http://www.mostang.com/sane/). The
development tests are using SANE.
I've tried Clara OCR, but the results disappointed me
-----------------------------------------------------
All OCR programs will disappoint you depending on the texts
you're trying to recognize. If you're a developer, join the Clara
OCR development effort and try to make it behave better on your
texts. If your are not a developer, wait a new version and try
again.
How can I get support on Clara OCR?
-----------------------------------
If the documentation did not solve your problems, try the
discussion list.
Does Clara OCR induce to Copyright Law infringements?
-----------------------------------------------------
No. Clara OCR is just a tool for character recognition like many
others that can be purchased or are bundled with scanners. The
Clara OCR Project claims all users to be aware about the
Copyrigth Law and not infringe it. The Clara OCR Project
abominates any try to infringe the legitimate laws of any
country.
Nonetheless, the Clara OCR Project supports the free and public
availability of materials produced to be free, or of materials
out of copyright due to its age. The Clara OCR Project recognizes
the right of anyone to produce free or non-free materials.
How can I help the Clara OCR development effort?
------------------------------------------------
The best way is to use Clara OCR to recognize the texts you're
interested on, and try to make it adapt better to them. The
Developer's Guide should help in this case (C programming skills
are required). The Clara OCR Project acknowledges all efforts to
make Clara OCR more widely known and used.
*/
/* (glossary)
WELCOME
-------
This is the Clara OCR glossary. It's somewhat specific to Clara
OCR. The entries that do not refer an author were written by
Ricardo Ueda Karpischek. Send new entries or suggestions to
claraocr@claraocr.org. This glossary is part of the Clara OCR
documentation. Clara OCR is distributed under the terms of the
GNU GPL.
CONTENTS
--------
algorithm
binarization
bitmap
bitmap comparison
border
border mapping
clara
classification
density
depth
digital image
dpi
function
graphic format
graymap
heuristic
image size
mapping
OCR
page
pattern
pixel
pixel distance
pixmap
PBM
PGM
PNM
PPM
resolution
skeleton
skeleton fitting
symbol
thresholding
Xlib
*/
/* (glossary)
image size
----------
As a digital image uses to be a rectangular matrix of pixels, its
size in pixels can be conveniently described giving the rectangle
width and height, usually in the form WxH. For instance, a 200x100
image is a rectangle of pixels having width 200 and height 100.
depth
-----
the number of bits available to store the color of each pixel.
Black-and-white images have depth 1. Graymaps use to have depth
8 (256 graylevels). The larger the depth, the larger will be the
amount of disk or ram space required to store a digital image.
For instance, an image of size 100x100 and depth 8 requires
100*100*8 = 80000 bits = 8000 bytes to be stored.
graphic format
--------------
A standardised way to store the color of each pixel from a digital
image in a disk file. The graphic format may include other
information, like density and image annotations. Some graphic
formats include a provision to compress the data. In some cases,
this compression, if used, may change the color of some pixels
or regions to colors close to the original ones, but different.
So the usage of some graphic formats may imply in data loss.
Examples of graphic formats are TIFF, JPEG, GIF, BMP, PNM, etc.
clara
-----
Cooperative Lightweight Recognizer. "Clara" is also a personal
name: Clara (Latin, Portuguese, Spanish), "Chiara" (Italian),
Claire (English).
OCR
---
Optical Character Recognition. Some people feel hard to
understand conveniently what OCR is due to the lack of knowledge
on how computers store and process text and image data. Most
users think OCR as being a required step before editing and
spell-checking documents got from the scanner (it's not wrong,
though).
algorithm
---------
a well defined procedure. The term "algorithm" is usually
reserved for procedures whose properties can be assured,
generally through a rigorous mathematical proof. For instance,
the procedure learned by children to multiply two numbers from
their multi-digit decimal representations is an algorithm (see
heuristic).
binarization
------------
the conversion from color or grayscale (PGM) to
black-and-white. The Clara OCR classification heuristics
currently available require black-and-white input, so when the
input is grayscale (PGM), Clara OCR needs to convert it to
black-and-white before OCR. Note that to binarize an image, some
choice must be done on how to map colors or graylevels to either
black or white. Also and mainly, and the OCR results depends
strongly on that choice.
bitmap
------
The Clara OCR documentation tries to use the term "bitmap" to
mean only rectangular, black-and-white digital images. Grayscale
rectangular digital images are called "graymaps" (see also
pixel).
bitmap comparison
-----------------
any method intended to decide if two given bitmaps are
similar. Clara OCR implements three such methods: skeleton
fitting, border mapping and pixel distance.
border
------
the line formed by the bitmap black pixels that have white
neighbours. Note that the definition of "neighbour" may
vary. Clara OCR generally consider that the neighbours of one
pixel are all 8 pixels contiguous to it (top left, top, top
right, left, right, bottom left, bottom, bottom right).
border mapping
--------------
a bitmap comparison technique that builds a mapping from the
border pixels of one bitmap to the border pixels of another
bitmap. If this mapping is found to satisfy certain mathematical
properties, the bitmaps are considered similar.
classification
--------------
the process that recognizes a given bitmap as being the letter
"a" or the digit "5", etc. Instead of saying that the bitmap was
"recognized" as a letter "a", it's common to say that it was
"classified" as a letter "a". All Clara OCR classification
methods are currently based on bitmap comparison techniques.
density
-------
see dpi.
digital image
-------------
see pixel.
dpi
---
dots-per-inch. A measure of linear image density. Example:
scanning an A4 (210x297mm) page at 300 dpi results an image of
size 2481x3508 (remember that 1 inch equals 25.4 millimeters). In
most cases, all relevant visual details from printed characters
can be conveniently captured at 600dpi (in some cases, 300dpi
suffices). Some file formats, like TIFF or JPEG, include density
information. Others, like PBM, PGM or PPM, don't. So when
converting from TIFF to PGM, remember that the density
information is dropped. So if, for instance, you ask SANE to scan
a page creating a TIFF file, and subsequently convert it to PPM,
and from PPM to TIFF again, the last file will not be equal to
the first one. Density information uses to be irrelevant when
displaying images on the computer monitor, because in this case a
1-1 mapping between image pixels and display pixels is
assumed. However, density information is quite important when
printing an image on paper, or when performing OCR. Clara OCR
expects to be informed explicitly about the image density
(default 600 dpi).
function
--------
a rule that assigns, for each given element, another element, in
a unique fashion. For instance, the equation y = x+1 defines a
function that assigns to each number x the number x+1. A 2d
digital image may be seen as a function that assigns to each dot,
given by its horizontal and vertical coordinates, a color
("black", "white", "green", etc). Functions are also called
"mappings".
graymap
-------
see bitmap.
heuristic
---------
a procedure whose properties are not assured. Heuristics are
generally the expression of some more or less vague feeling, or a
naive, initial approch for a complex problem. If an heuristic can
be proven to satisfy some interesting property, then it can be
referred as an algorithm (in regard of that property). Some
experts say that OCR is an engeneering field, not a mathematical
field. Perhaps we can express this same idea saying that by its
own nature, OCR is a field where nothing else than heuristics can
be stated.
mapping
-------
see function.
page
----
a scanned document. The Clara OCR documentation tries to avoid
using terms like "document", "image" or "file" to signify a
scanned document. "Page" is used instead.
pattern
-------
in the Clara OCR context, it's a letter, digit or accent
instance, used to classify the page symbols through bitmap
comparison. Clara OCR builds a set of patterns based on manual
training or automatic selection, and uses it to classify all page
symbols.
pixel
-----
each one of the individual dots that compose a digital image
(quite frequently, the term "pixel" is used to refer only the
non-white dots of an image). A digital image uses to be a
rectangular matrix of dots. To each one it's possible to assign
one from many available colors, in order to form an image. If the
available colors are only "black" and "white", the image thus
formed is a "black-and-white image". As the representation of one
from two possible values may be done using a bit, and the
assignment of geometrically well positioned dots to colors may be
seen as a function or mapping, a black-and-white image is also
called a "bitmap". Similarly, if the colors available are only
gray levels, usually from 0 (black) to 255 (white), then the
image is a "grayscale image" or a graymap, and a generic
assignment of pixels to colors is called a "pixmap".
pixel distance
--------------
a bitmap comparison technique that builds a mapping from all
pixels of one bitmap to the pixels of another bitmap. If this
mapping is found to satisfy certain mathematical properties, the
bitmaps are considered similar.
pixmap
------
see pixel.
PBM
---
see PNM.
PGM
---
see PNM.
PNM
---
Portable aNyMap. PNM is a generic reference to the graphic file
formats PBM, PGM and PPM defined by Jef Poskanzer. In other
words, to say that a program supports PNM means that it handles
PBM, PGM and PPM. PBM (Portable BitMap) files are black-and-white
images, 1 bit per pixel. PGM (Protable GrayMap) files are
grayscale images, 8 bits per pixel. PPM (Portable PixMap) files
are color images, 24 bits per pixel. Currently Clara OCR likes
PBM and PGM files only. A scanned page stored in some format
other than PBM or PGM can be converted to PBM or PGM using the
netpbm tools, ImageMagick or others. PNM files may be "raw" or
"plain". The plain versions are rarely used. Clara OCR does not
support plain PBM nor plain PGM.
PPM
---
see PNM.
resolution
----------
this term is used along the Clara OCR documentation to refer
either the image size (for instance: 640x480 pixels) or the image
density (for instance: 300 pixels per inch).
skeleton
--------
ideally, it's a minimal structural bitmap. From an algorithmic
standpoint, the skeleton of a symbol is the bitmap obtained
clearing a number of its peripheric pixels, whose remotion does
not destroy the symbol shape.
skeleton fitting
----------------
a bitmap comparison technique that decides that two given bitmaps
are similar if and only if the skeleton of each one fits into the
other.
symbol
------
an instance of a letter or digit in a page. So if the word
"classical" occurs in a page, all its letters ("c", "l", "a",
"s", "s", "i", "c", "a", "l") are individual symbols. At the
source code level, things that are not letters not digits are
sometimes called symbols (for instance, pieces of broken symbols,
dots, accents, noise, etc).
thresholding
------------
a simple binarization method. It decides to map each pixel from a
graymap to either black or white just testing if its gray level
is smaller or larger than a given threshold. So, if the threshold
is, say, 171, then all gray levels from 0 to 170 are mapped to 0
(black) and all graylevels from 171 to 255 are mapped to 255
(white). The thresholding is said to be global if one fixed
(per-page) binarization threshold is used to decide the mapping
of all page pixels. The thresholding is said to be local if the
threshold is allowed to vary along the page, due to irregular
printing intensity.
Xlib
----
the low-level, standard, Xwindows library. It offers
basic graphic primitives, similar to others found on most graphic
environments, like "draw line", "draw pixel", "get next event",
etc, as well as services more specific to the Xwindows way of
doing things, like "connect to an X display", properties
(resources) handling, etc. The Xlib does not include facilities
to create menus, buttons, etc. Application programs usually take
these facilities from "toolkits" like Xt, GTK, Qt and
others. Clara OCR creates the few facilities it needs using
the Xlib primitives.
*/
/*
Alignment drafts
s_pair(a,b)
complete_align(a,b)
get_ap(a)
use hardcoded data
get_ap(b)
use hardcoded data
get_dd(a,x,b,d)
estimate from alignment data
geo_align(a)
geo_align(b)
1. geometrical line detection.
2. compute per-symbol geometrical alignment.
3. add per-symbol alignment data to the pattern types.
4. add alignment filtering rule to the classification service.
*/