This session began with three talks, all of which motivated the use
of
structured representations such as SGML or XML. It quickly became
apparent that the focus of this working group would be on standards,
rather then on evaluation.
Initial discussion focused on the extensive list of acronyms which
were used in the three papers including XML, SGML, ISO, SDF, DTD, OCR,
DAFS, DSA, URL, HTML, OO, DCD, PDF, RTF, and SOX. For this
discussion, the comment was made that there seems to be no universal
standard that is both flexible enough to make everyone happy, while
at
the same time being structured enough so that it constrains the way
data is represented to the point where tools and large data sets can
be developed and widely used.
The discussion in the working group then focused primarily on the use
of XML as a representation that should be adapted. It was pointed
out
that although XML can be used to represent a document, it is extremely
important to have an appropriate set of conventions so that everyone
abides by. We all agreed that XML is powerful enough, but we
need to set the structure as well.
Discussion then turned to DAFS (Document Attribute Format
Specification) that was developed by jointly RAF technologies and the
US Department of defense for the representation of document images
and
intermediate document analysis results. It was generally agreed
that
DAFS is powerful (and extensible) enough to represent most of what
we
need and perhaps a DAFS-XML version would do the trick. It was
also
agreed that for the most part, DAFS could be represented in XML
although there was some concern about how to deal with the binary
data. Some features of DAFS and DAFS-lib are included below.
General Advantages highlighted for XML
- Wide support for basic properties and activities
- conversion, Database manipulation, search, sorting
General Disadvantages highlighted for XML
- overhead? how to represent uncertainty?
- We do not currently have data
The final point of discussion turned to what we need to give us
credibility. The general answer is that we need some tools, some
basic data and most of all, a set of extensible standards that can
adapt to new needs.
****** Summary ******
Is XML a "natural" for representation? PROBABLY
Can we produce Tools and Libraries? With some minimal support
Is there Data? Lots of willing participants, all we have to do
is
convert it.
What needs to be done to make this an interchange format? Follow
DAFS
****** DAFS ******
FEATURES
Entity is the primary data type
Hierarchical structure
Entity types
Ambiguity (OR entities)
Text content
Uni-code
ASCII
character possibility set
confidence values
Bounding areas:
box
list of boxes
list of orthogons (polygons) for rendering
Properties: arbitrary user-defined
name-value pairs
Image
TIFF Group 3 or Group 4
compression
Bi-tonal
One file compressed storage
of image and entity
Byte-swapping support
Type properties
Callbacks as an aid to writing user interfaces such
as illum
Library Requirements
ANSI C, portable, robust
Read: DAFS, TIFF, PDA files
Write: DAFS
Create, delete, move hierarchies
Get and set entity content
Get and set properties
Efficient run-length storage of images
Convert to and from bitmap
Scale efficiently
Override file/io
Override memory allocator
Borrow entities have been removed permanently. Instead,
use i_CopyEntity.
Auxiliary programs:
dafs2txt: print out text
pda2dafs: convert Calera
PDA files
dumpdafs: dump out tags
prall: hierarchy text output
bunch: combine multiple
DAFS files
Viewer Requirements
5 magnification levels for images
Flagged characters and words
Unicode support
Keyboards and fonts: American,
Arabic, British, French,
German, Greek, Italian, MICR, Norwegian, Russian,
Spanish, and Swedish
Korean (fonts only)
Some diacritical marks
Reads DAFS, TIFF's, PDA's (image only)
Save to DAFS, ASCII/UTF or Unicode
Japanese input server
Entity Viewer shows
Properties
Character choices
Bounding boxes
Image fragment for a selected
entity
Change type
Change content
5 modes
Image
allows you to view image files with OCR information overlays
also allows viewing of TIFF and PDA files (image only)
variable control over which overlays are displayed
mouse and keyboard commands for resizing and moving boxes
mouse command for creating new entities
Out of Context Viewer
allows
you to see all the a's on one page, b's on another
and designed for rapid building of reference sets
Flag
like out of context, but showing only flagged words and
characters and allowing rapid correction of the same
Text
a text editor that concatenates the text value of the
elements of the file
configurable
separate fields and pages and
improved facsimile appearance
text view flag popups have been disabled for this release
Hierarchy
display the entity hierarchy in a manner similar to a file browser
allows creating and moving entities