DLIA’99
Report on Discussion Group V
Generic Document Structure Analysis
Apostolos Antonacopoulos
Department of Computer Science, University of Liverpool, United Kingdom.
http://www.csc.liv.ac.uk/~aa

Layout (physical and logical) Analysis approaches are maturing and yet, more methods are reported with various degrees of success in different application domains. At the same time, the nature of documents (and their structure) is changing. Documents with complex layouts and coloured backgrounds and text are becoming widespread as the relevant enabling technologies are reaching the mass market.

It is natural, therefore, that the discussion focused on the fundamental issues of performance evaluation, document structure description and the elusive goal of devising a Unified Layout Analysis approach.

It was agreed that the provision of ground-truth data is essential, both for performance evaluation and for estimating parameters for Layout Analysis methods. Performance evaluation is crucial not only for comparative benchmarking but primarily for discovering and resolving weaknesses in a particular method. However, existing data sets are not adequate for documents with complex layouts and those composed in colour. Furthermore, it is expensive to obtain ground-truth. The suggested way forward is to start with a small data set (manually obtained) and use the (corrected) output of the method to expand the data set.

Ground-truth data can also be used in algorithm development for determining the correct path to a solution. The paper by Liang, Phillips and Haralick showed that very good results can be obtained when layout analysis is performed based on probability tables whose values are calculated from ground-truth data (text lines). Again, a small sample can form as the nucleus of the data set that can be expanded (and probability tables updated) by correcting further results from a method.

The question that follows is how to achieve a balance between heuristics (if they are necessary) and appropriate statistics. In the discussion it was suggested that although heuristics do play an important role in identifying a solution, that solution should then be formalised with an appropriate statistical analysis.

On the subject of devising co-operative approaches of physical and logical layout analysis, it was agreed that single-pass approaches (considering only physical or logical information) are generally considered outdated. Moreover, the interaction between higher and lower-level information is recognised as the way forward for achieving better results for both segmentation and logical labelling.

The fact that there are numerous approaches to layout analysis (of mainly textual documents), each applicable to a specific and often narrow domain does not hint at a constructive path to identifying a way forward to solving the "layout analysis problem". It can be argued that attempts towards a unified layout analysis approach can be beneficial in identifying good practice and achieving progress. An attempt for a definition of a generic document data structure was made and Bob Haralick proposed: "A Document Data Structure (DDS) is a set of relations, each relation having tuples whose components themselves may be atoms or a document data structure". This is a good starting point for further discussion, which should be expanded outside the limited time of a single discussion session.

Document structure representation is a key aspect of any successful approach. Various approaches have been proposed based mainly on either a hierarchical model or a sequential (reading order) one. However, documents containing colour information (especially images on the background or watermark text) as well as electronic documents with hyperlinks cannot be adequately described with either of the above. It was proposed that a different representation scheme involving both a layered description and a hierarchical structure may be appropriate for documents containing colour information. The discussion ended with agreement on the need for further consideration of these increasingly important classes of documents.