Title |
Transformation of Text Contents of Engineering Documents into an XML Document by using a Technique of Document Structure Extraction |
Keywords |
engineering document;text information;enumerative text contents;document structure extraction;XML;bridge;엔지니어링 문서;텍스트 정보;개조식 구문;문서구조추출;교량 |
Abstract |
This paper proposes a method for transforming unstructured text contents of engineering documents, which have complex hierarchical structure of subtitles with various heading symbols, into a semi-structured XML document according to the hierarchical subtitle structure. In order to extract the hierarchical structure from plain text information, this study employed a method of document structure extraction which is an analysis technique of the document structure. In addition, a method for processing enumerative text contents was developed to increase overall accuracy during extraction of the subtitles and construction of a hierarchical subtitle structure. An application module was developed based on the proposed method, and the performance of the module was evaluated with 40 test documents containing structural calculation records of bridges. The first test group of 20 documents related to the superstructure of steel girder bridges as applied in a previous study and they were used to verify the enhanced performance of the proposed method. The test results show that the new module guarantees an increase in accuracy and reliability in comparison with the test results of the previous study. The remaining 20 test documents were used to evaluate the applicability of the method. The final mean value of accuracy exceeded 99%, and the standard deviation was 1.52. The final results demonstrate that the proposed method can be applied to diverse heading symbols in various types of engineering documents to represent the hierarchical subtitle structure in a semi-structured XML document. |