The Foundational Principle of Every Successful Website Scraping

Lists: The Foundational Principle of Every Successful Website Scraping Agent

May 02, 2016

The Internet is full of lists. Think of the websites you visit regularly; aren’t almost all of them full of lists? Blogs, e-commerce sites, Facebook, Twitter, Instagram, etc. They all contain lists of data in one form or another. That is precisely why the foundational action of almost every Agent you will ever build with Mozenda is a list. Some Agents contain a single list, other more complex Agents may contain multiple lists of different types. Although Mozenda has several different kinds of lists, the purpose of all Mozenda lists is simple: It allows the user to automate repetitive tasks on a website. Users can teach Mozenda one time how to perform the process of capturing data or entering inputs and then Mozenda will repeat the process for every item the user includes in the list.

In this post, we’ll first examine the basic structure of all Mozenda lists. Next, we’ll discuss the purpose of the three most popular list types in Mozenda and how they can be used together:

  1. Item Lists
  2. Data Lists
  3. Input Lists

List Structure

Although there are several different kinds of lists in Mozenda, they all follow the same basic structure.

Begin List Action

First, they all start with a Begin List action. This action specifies what type of list it is and also defines the items that should be included in the list. On certain types of lists you can also create list refinements on the Begin List action that include or exclude items based on certain conditions.

Inner List Actions

Next, all lists have one or more actions listed after the Begin List action that will be repeated for all items in the list. These actions may instruct Mozenda to capture a particular piece of data like a product name, or it may be a click action that instructs Mozenda to click on a link for each item in the list and navigate to a new web page.

End List Action

Finally, all lists have an End List action specifying where the list ends. This action tells Mozenda that all the work for a particular list item is complete and it needs to check with the Begin List action to see if there are any additional list items that need to be processed.

Item Lists

An item list is the most common of the list types used in Mozenda. This is likely because it is the most flexible list type. Most websites that contain large amounts of data that organizations would like to extract have data is some type of list. Some examples of these list formats are shown in the image gallery above,  but here are several descriptions of structures where an Item List would be appropriate are listed below:

  • Capturing and clicking into each product category on an e-commerce site
  • Capturing a list of products and its associated image and price.
  • Capturing a list of all the reviews left on a popular excursion on a travel site.
  • Capturing a list of images associated with a car listing on a used car website.

Data Lists

Data lists allow users to upload a spreadsheet of inputs into Mozenda Collection and then use the values in the spreadsheet as inputs on a target website. Imagine you had the UPCs for 100,000 products and you wanted to search for each of them individually on a target website and gather the price and availability. Data Lists allow you to do just that. Furthermore, once you have an Agent setup to use inputs from a spreadsheet then you can change the contents of that spreadsheet without having to change the Agent at all.

Input Lists

Input lists are similar to Data Lists because they allow users to setup a list of inputs to process; however, they differ in the way the list items are obtained. Input lists receive their list of items by either having the user enter them manually, or by obtaining them from a combo box or drop-down control on a website. For example, most auto-parts sites require the user to first enter the make, model, and year of the car before searching for relevant parts. Most of the time this is done by selecting the correct value from a drop-down control. Using Input Lists in Mozenda, the user can instruct the system to iterate through all possible combinations.

A Combination of Lists

As you become proficient with the Mozenda Agent Builder and have a few Agents under your belt, you’ll start to see how multiple types of lists can be used together in the same Agent to simplify and organize your data collection process even more.

For example, imagine you want to search several hundred zip codes to find all of the locations of a popular retailer. You’ll likely already have a spreadsheet with the zip codes in it that you want to search, and you can use this spreadsheet by creating a Data List in your Agent that systematically enters one zip code at a time and clicks the button to find nearby locations. Then, you can create an item list on the results for all the locations that fall within that geographical area. For each location you can capture the contact information of the location and its operating times.

Additional Information

This post provides a cursory introduction into the most common list types and purposes. However, the Capture Name-Value Pair action and the Capture Table action are also types of lists with more isolated use cases. Some websites require nesting lists inside each other to achieve the desired results.

Need more information?

We'd love to hear from you.

100% Privacy. You are that important to us. Privacy Policy