Using Mozenda to Extract Text from PDF and Documents
September 08, 2015
Imagine that the data you need to complete your report is buried inside a PDF and there is no other place on the website to get the information. Without this information, the rest of the report will be of little worth. What now?
If you’re using Mozenda, you’re in luck. Mozenda can extract data from most common document formats. Some of the more popular formats Mozenda supports include:
- Spreadsheets (xls, xlsx, csv, tsv)
- doc and docx (doc, docx, rtf, odt)
How Does it Work?
To begin extracting data from a document, click the link on the website that opens the document. Mozenda will detect that the document that is loading is not a normal web page and will convert it into a web page that can be scraped.
Once the document has been converted, the new web page will be loaded into the Agent Builder where you can begin extracting data from it just as you would any other web page. This includes capturing item lists, tables, images, etc.
The image below shows a Mozenda agent that has been designed to click into each bill (attached PDF document) and then gather the number of kilowatt-hours used and the number of days in each billing period. This data is captured then inserted in line with the bill date and the bill amount as shown.
When Mozenda detects that the document being loaded is not a normal web page, it tries to determine the format of the document. Once the format type is determined, Mozenda converts the document from its original format into an html document or “new web page”. Depending on the size of the document being converted, the process could take a few seconds or a few minutes to complete.
Mozenda stores information about the type of document being converted and other settings in the action properties of the Click Item action. You can modify these settings in the Action Properties panel of the action. The image above shows some of those properties for a Click Item action.
If you have any questions about extracting data from common file types please contact our support department or view the following help topic about scraping data from a PDF.