How to Create a High-Volume Web Scraping Workflow in Mozenda

March 22, 2017

If you’re getting started on a project that involves web scraping a lot of data from multiple websites, your workflow is going to look a bit different from a small-scale project.

Mozenda has been built from the ground up to accommodate large data sets extracted from multiple sources. Recent feature releases have improved this capability even further, allowing you to easily configure settings in bulk and streamlining the data acquisition and integration processes.

Note that this article will be focused on the overall architecture of a project. If you’re looking for assistance with building agents, our Help Center has you covered.

1. Be Prepared

If you’ve found our website and possibly even signed up for our free trial, that means you have a good idea of what you want to do with Mozenda. Now is the perfect time to outline your project goals, if you haven’t already.

Here is a checklist of the items you should have ready to begin a new project in Mozenda:

A list of all the websites where the data will be scraped.
A list of the fields that will be needed from the above websites. Consider printing out key web pages so they can be marked or highlighted for future reference, and also gathering any data that may not be needed right now but could be useful later on.
The names and formats (number, time, etc.) of the fields. Mozenda accepts letters, numbers, spaces and hyphens, up to 50 characters.
The actions used to reach the required data fields. Mozenda automates manual interaction, and will need to be “trained” to find the data it needs through navigation, menus, search queries, and so on.
The frequency and timing of each scrape. (E.g., every Monday at 12:00 a.m.)
The file type and destination of the data. Mozenda can provide CSV, TSV and JSON document formats and supports direct file downloads, email, FTP, Azure, Amazon S3 and Dropbox.

2. Create a Template

First off, let’s talk about agent groups.

Agent groups make syncing settings between agents a simple process, and also allow you to create fields to be used in agents that are not built yet.

Start by creating a new agent group from the Agents tab. Next, refer to the list of fields from step 1 to set up template fields. Once these are in place, you can easily add them to an agent while it is being built.

This field template functionality makes it easier to combine data later and also ensures that important fields aren’t overlooked during the process of creating an agent.

For a thorough introduction to agent groups (including how to use them with existing agents), refer to this help article.

3. Create a Combined Collection (optional)

Building new agents within a group doesn’t lead to the data automatically being aggregated; this is done using a combined collection.

When an agent runs, the data collected is placed in a collection. Combined collections offer a way to easily merge these collections into a single repository, which can then be used to remove duplicate entries (dedupe), sort and filter results, and publish the data.

This step is marked as optional due to the overhead it introduces to the process. If your agents are collecting vast amounts of data, using combined collections will lead to time delays and may not be appropriate for time-sensitive projects.

To create a new collection, click the Collections tab, click New collection and select From existing collections in the dropdown. Select the agents to be added, then click Create collection. You will then be prompted to add fields to the collection as needed. By repeating these steps, you can create different variations of the data gathered for teams or departments within your organization that have unique needs.

4. Configure Settings

Agents, agent groups, and combined collections each have their own set of settings available, and there is some overlap. Here is a brief rundown of the options available for agents and agent groups:

Harvesting. Here you can choose how the agent behaves while harvesting (scraping). This includes requesting images (turned off by default), error handling options, and more.
Notifications. Specify an email address and select when you want to be notified of agent activity, such as when the agent completes successfully or stops when an error occurs. Global notifications can also be set up so that an email is sent for all agent activity.
Publishing. Choose the destination of your data when the agent completes it. Mozenda supports email, FTP, Amazon S3, Azure, Dropbox and Google Drive. Publishing can be performed as soon as the agent completes or on a schedule.
Scheduling. Schedule an agent to run on specific days at certain times (or in intervals).

Keep in mind that the above settings can be configured once using an agent group and then applied to agents belonging to that group. If you create a new agent in the group, the settings are automatically applied.

Like agent collections, combined collections can be published, but this is always done on a schedule that is separate from agent settings. This means that you will want to make sure any agents that are used to build a combined collection complete before the collection publishes to ensure that the data is up-to-date.

Combined collections also support removing duplicates (deduping) and custom views. After selecting one or more fields as unique (steps 1-7 in this walkthrough), Mozenda will clear out extra rows of data that contain duplicate values in the unique field(s). Custom views support sorting and filtering data in a collection and can be used during publishing. Follow this guide to create a new view.

5. Common Issues

If you’re running into trouble with the above steps, don’t panic. The following sections will cover potential problems you might encounter along with some helpful tips and resources.

Agent(s) Not Working

This can happen for a variety of reasons and will take some troubleshooting to pin down the problem. Take a look at the agent dashboard for error details then refer to our help center or contact our support team for assistance.

Missing Data

If you are using a combined collection with unique fields, remember that detected duplicates are removed. As an example, if the product name is the only unique field used and multiple versions of that product are available in different variations with the same name, the entries beyond the first result will not be included in the collection. Where possible, base the unique fields on two or more values that will not overlap.

If you don’t see the expected data right away, give our system a few minutes after an agent finishes. The collection may also need to be reloaded using the refresh button just above the field list. If that doesn’t help, try rebuilding the combined collection (if you are using one). Click the tools icon, select “Rebuild” to perform this action, and wait for the process to complete—extremely large collections will take some time. For more help check out our troubleshooting section in the help center.

Wrong Data

This usually occurs when a website changes structurally and will require adjustments to the agent. If it isn’t immediately clear which agent collected the data, the source can be traced by either creating a new view or editing the default one to display a system-generated field, ItemSourceName. This will display a new column that shows the name or ID of the agent that produced the data.