Web Scraping Tips – Capture Table Action

April 21, 2015

In our latest release, we added the Capture Table action, which makes capturing data from tables easier and more accurate than ever, especially when capturing data from dynamic tables or tables with complicated layouts.

In this walk through, we’ll introduce you to this new action and share with you a few of the benefits achieved by using it.

Overcoming the challenges of capturing data from tables

In the past, capturing data from tables has often required more time and custom configuration than capturing other types of data. In addition, capturing data from tables can be challenging because of the different table configurations found on the internet. Columns frequently change their position within the table or appear only under certain circumstances. Sometimes, adjacent cells within the table are merged across multiple columns. Other table layouts include multiple header rows. The list of variations is long, and this inconsistency has always been a challenge for Agent Builders as they try and accurately capture the content in tables.

The Capture Table action responds to these challenges with a new time-saving table capturing process that improves precision and accuracy, and adjusts dynamically to accommodate changes in the table layout. This is all done automatically without having to manually maintain the Capture actions capturing the table’s data.

Automatic association with nearby text

Capture actions within a Capture Table action are automatically associated to a column or row by its header name rather than its position in the table. This means that when a table header changes position, its data will still be recognized and captured into the correct field.

Capture tables with multiple header rows

Some tables have two or more rows of headers, where a row of sub-headers exists beneath a row of general headers. The Capture Table action recognizes which sub-headers are associated with each general header and automatically assigns appropriate field names to the columns.

Intuitively collect image URLs

Many tables include columns that are populated with images rather than text data. These are recognized by the Capture Table action and the associated Capture actions adjust to capture the image URL.

Dynamic Field Recognition

This is a big time-saver. Some tables have columns that do not always appear. Tables showing product specifications, for example, may show certain columns only on the product pages where they are applicable, resulting in a constantly changing set of columns.

In the past, you either needed to identify all possible table columns in advance and add Capture actions for them manually, or limit your Agent to capturing just those columns that always appear in the table.

The Capture Table action solves this problem by recognizing new columns and automatically creating new Capture actions while the Agent is running.

When to use the Capture Table action

The Capture Table action most effectively captures data laid out in a grid where there is a label or header associated with each column or row. The slideshow above shows some examples of data organized into tables on a different webpages. You’ll notice that sometimes the table headers are on the left and sometimes on the right. Sometimes there will be stacked header rows or multiple columns under a single table header. The Capture Table action is versatile enough to handle all of these scenarios.

How to use the Capture Table action

Below, we will show you the steps to create a Capture Table action. We recommend following along by using the Agent Builder to capture data from a table that interests you.

1. To get started creating a Capture Table action, click on any text in a table:

2. When the What do you want to do? window appears, choose Capture a table:

Notice that instructions for creating the Capture Table action appear in the upper-left panel of the Agent Builder:

3. Click two distinct header cells, one after the other. This will help the Agent Builder identify all of the headers in the table.

After clicking two distinct header cells, the Agent Builder will automatically identify all of the headers in the table and will show you a list of the text contained within each header cell:

4. Next, click two non-header cells in the same column, one after another. It isn’t important which column contains the two non-header cells you click, just that the two cells you click are not headers and are in the same column. This helps Mozenda identify the table’s rows.

5. Next, click two distinct cells from the same non-header row, one after another. Just like the previous step where you clicked two non-header cells in the same column, click any two non-header cells from any row. This helps Mozenda determine the location of the cells within each row.

The Agent Builder now has the information it needs to discover all of the columns, rows, and cells of the table. The Agent Builder will show you a preview of the data it finds:

6. Finally, click the Save button. The Capture Table action will appear with its Capture actions on the page.