Seminář: Visual Structure of Web Documents

Datum a čas 2. 3. 2006 10:30 - 12:00
Místnost 336 RB

Visual Structure of Web Documents

Prezentující: Radek Burget

The documents on the World Wide Web that are mostly created in the HTML or XHTML languages contain several types of information. Besides the text content of the documents, the HTML language allows to specify the content structure and formatting which may play an important role for the document reader. Therefore, for many document processing tasks such as document indexing, classification or information extraction, it is desirable to use models that describe all the kinds of the information avilable in the document. Traditionaly, the Document Object Model (DOM) has been used for modeling the structure of HTML documents. However, the use of additional technologies such as Cascading Style Sheets causes that the resulting appearance of the document can be significantly different from the underlying DOM. Recently, some new models have been proposed by various authors that are based on visual segmentation of the documents and several algorihms for the discovery of visual areas have been proposed. This talk will introduce the main approaches to this problem, it will propose some improvements of existing methods and their possible applications.