Word Extraction from Table Regions in Document Images
Vol. 12, No. 4, pp. 369-378,
Aug. 2005
10.3745/KIPSTB.2005.12.4.369
PDF
Abstract
Document image is segmented and classified into text, picture, or table by a document layout analysis, and the words in table regions are significant for keyword spotting because they are more meaningful than the words in other regions. This paper proposes a method to extract words from table regions in document images. As word extraction from table regions is practically regarded extracting words from cell regions composing the table, it is necessary to extract the cell correctly. In the cell extraction module, table frame is extracted first by analyzing connected components, and then the intersection points are extracted from the table frame. We modify the false intersections using the correlation between the neighboring intersections, and extract the cells using the information of intersections. Text regions in the individual cells are located by using the connected components information that was obtained during the cell extraction module, and they are segmented into text lines by using projection profiles. Finally we divide the segmented lines into words using gap clustering and special symbol detection. The experiment performed on 100 table images that are extracted from Korean documents, and shows 99.16% accuracy of word extraction.
Statistics
Cite this article
[IEEE Style]
C. B. Jeong and S. H. Kim, "Word Extraction from Table Regions in Document Images," The KIPS Transactions:PartB , vol. 12, no. 4, pp. 369-378, 2005. DOI: 10.3745/KIPSTB.2005.12.4.369.
[ACM Style]
Chang Bu Jeong and Soo Hyung Kim. 2005. Word Extraction from Table Regions in Document Images. The KIPS Transactions:PartB , 12, 4, (2005), 369-378. DOI: 10.3745/KIPSTB.2005.12.4.369.