DLIA99 Discussion summary

VI: Document Standards and Evaluation

Discussion chair: David Doermann (University of Maryland)

 

This session began with three talks, all of which motivated the use of
structured representations such as SGML or XML.  It quickly became
apparent that the focus of this working group would be on standards,
rather then on evaluation.

Initial discussion focused on the extensive list of acronyms which
were used in the three papers including XML, SGML, ISO, SDF, DTD, OCR,
DAFS, DSA, URL, HTML, OO, DCD, PDF, RTF, and SOX.  For this
discussion, the comment was made that there seems to be no universal
standard that is both flexible enough to make everyone happy, while at
the same time being structured enough so that it constrains the way
data is represented to the point where tools and large data sets can
be developed and widely used.

The discussion in the working group then focused primarily on the use
of XML as a representation that should be adapted.  It was pointed out
that although XML can be used to represent a document, it is extremely
important to have an appropriate set of conventions so that everyone
abides by.  We all agreed that XML is powerful enough, but we need to set the structure as well.

Discussion then turned to DAFS (Document Attribute Format
Specification) that was developed by jointly RAF technologies and the
US Department of defense for the representation of document images and
intermediate document analysis results.  It was generally agreed that
DAFS is powerful (and extensible) enough to represent most of what we
need and perhaps a DAFS-XML version would do the trick.  It was also
agreed that for the most part, DAFS could be represented in XML
although there was some concern about how to deal with the binary
data.  Some features of DAFS and DAFS-lib are included below.

General Advantages highlighted for XML
 - Wide support for basic properties and activities
 - conversion, Database manipulation, search, sorting

General Disadvantages highlighted for XML
 - overhead? how to represent uncertainty?
 - We do not currently have data
 

The final point of discussion turned to what we need to give us
credibility.  The general answer is that we need some tools, some
basic data and most of all, a set of extensible standards that can
adapt to new needs.

         ******  Summary  ******

Is XML a "natural" for representation?  PROBABLY

Can we produce Tools and Libraries?  With some minimal support

Is there Data?  Lots of willing participants, all we have to do is
 convert it.

What needs to be done to make this an interchange format?  Follow DAFS
 

    ******  DAFS  ******

FEATURES

    Entity is the primary data type
        Hierarchical structure
        Entity types
        Ambiguity (OR entities)
        Text content
            Uni-code
            ASCII
            character possibility set
            confidence values
        Bounding areas:
            box
            list of boxes
            list of orthogons (polygons) for rendering
        Properties: arbitrary user-defined name-value pairs
    Image
        TIFF Group 3 or Group 4 compression
        Bi-tonal
        One file compressed storage of image and entity
    Byte-swapping support
    Type properties
    Callbacks as an aid to writing user interfaces such as illum
 
 

Library Requirements

    ANSI C, portable, robust
    Read: DAFS, TIFF, PDA files
    Write: DAFS
    Create, delete, move hierarchies
    Get and set entity content
    Get and set properties
    Efficient run-length storage of images
        Convert to and from bitmap
        Scale efficiently
    Override file/io
    Override memory allocator
    Borrow entities have been removed permanently. Instead, use i_CopyEntity.
    Auxiliary programs:
        dafs2txt: print out text
        pda2dafs: convert Calera PDA files
        dumpdafs: dump out tags
        prall: hierarchy text output
        bunch: combine multiple DAFS files

Viewer Requirements

    5 magnification levels for images
    Flagged characters and words
    Unicode support
        Keyboards and fonts: American, Arabic, British, French,
                German, Greek, Italian, MICR, Norwegian, Russian,
                Spanish, and Swedish
        Korean (fonts only)
        Some diacritical marks
    Reads DAFS, TIFF's, PDA's (image only)
    Save to DAFS, ASCII/UTF or Unicode
    Japanese input server
    Entity Viewer shows
        Properties
        Character choices
        Bounding boxes
        Image fragment for a selected entity
        Change type
        Change content
    5 modes
        Image
            allows you to view image files with OCR information overlays
            also allows viewing of TIFF and PDA files (image only)
            variable control over which overlays are displayed
            mouse and keyboard commands for resizing and moving boxes
            mouse command for creating new entities
        Out of Context Viewer

            allows you to see all the a's on one page, b's on another
                and designed for rapid building of reference sets

        Flag
            like out of context, but showing only flagged words and
                characters and allowing rapid correction of the same

        Text
            a text editor that concatenates the text value of the
                elements of the file
            configurable
            separate fields and pages and
            improved facsimile appearance
            text view flag popups have been disabled for this release
        Hierarchy
            display the entity hierarchy in a manner similar to a file browser
            allows creating and moving entities