Information extraction from technical specification PDFs
Tools for extraction of numerical specifications from technical documentation in PDF format
This is a collection of prototype software tools for extracting technical specification information from PDF documents by several methods. The software was created to find the technical specification details needed for dynamic simulation models. The software has been tested on technical documentation of marine diesel engines, specifically to find numerical specification data on engine fuel consumption, charge air flow rate, combustion air flow rate, exhaust gas flow rate, and heat balance at different engine loads. The tools are generic and can be applied in other contexts to detect relevant pages from large sets of PDF documents.
There are four Python programs implementing different information extraction methods. The programs have a graphical user interface, but using the current prototype version of the software requires the skills to modify Python code.
Three of the programs are intended to be used in sequence, first to search pages that contain combinations of user-specified keywords, second to cluster the found pages by similarity in an unsupervised clustering approach, and third to classify pages based on similarity to model pages. The model pages can be either based on the unsupervised clustering results or specified by the user. The fourth program is not part of the sequence and can be used independently: it implements a binary relevancy classification method based on a neural network.
Usage and installation instructions are included in the distributed package. Ship engine documents are not included – the user must provide the PDF files to be searched. Training the neural network method is supervised learning and requires the user to specify the relevant pages in a set of PDF files that are used as training data.
The VesselAI project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 957237.