How do I harvest data from a website?

December 17, 2015

Mozenda provides two tools to help you gather data from websites: the Agent Builder and the Web Console. The Agent Builder helps you construct automated scripts, called agents, to collect the data. These agents and the data they collect are managed through the Web Console, Mozenda’s online portal; the Web Console is used to run or schedule agents, publish data, and more.This article demonstrates how to gather data from a website through the following three processes:

  1. Building an Agent
  2. Running the Agent
  3. Downloading the Results

 

Building an Agent

 

    1. Open the Agent Builder.

    2. Type (or copy and paste) a website URL into the Agent Builder’s address bar.

    3. Click Start a new Agent from this page.

    4. Click the first item in the list you want to create.

    5. Click Capture List.

    6. Click a similar item further down the list. Generally, any item other than the first will work. The Agent Builder will compare the second clicked item with the first (which was clicked in step 5) and automatically build out the remainder of the list to include all similar items. 

    7. Give the field a name. In the output spreadsheet, the field name will become the header for the column containing the data you just collected.

    8. Click Create Action.

    9. Click the desired data associated with the first item in the list.

    10. Click Capture Text. Alternatively, you can download images or capture the URLs of links.

    11. Give the field a name and click Save.

    12. Confirm that the data is captured correctly into a new column in the Captured Text Preview. The Agent Builder automatically associates the additional data with each item in the list. Data related to the first item in the list should appear in the first row, data for the second list item in the second row, and so on. You can repeat steps 10 – 13 for each additional piece of information you want to collect from the list.

    13. Scroll up or down the web page until you can see the button or link that navigates to the next page (hereafter referred to as the “next page button”).

    14. Click the next page button.

    15. Click Page List.

    16. Click Create Page List in the resulting window. The next page button in the web page should now be highlighted in blue, indicating that the agent is now configured to go to the next page (and all remaining pages) after collecting all data from the current page.

    17. Click the first item in the list. 

    18. Click Click Item.

    19. Click a block of text.

    20. Click Capture Text, give the field a name, and click Save. You do not need to send the agent back to the list (on the first page) to continue scraping data. The agent will return to the list automatically after gathering data from the details page.

    21. Click Test Agent to watch it work.

    22. When you are satisfied that the agent works properly, stop the test by clicking Stop.

    23. Click File.

    24. Click Save As.

    25. Give the agent a name. Agent names can only contain letters, numbers, spaces, and/or hyphens.

    26. (Optional) Give the agent a description.

    27. Click Save.

    28. Close the resulting window by clicking Close. You can now close the Agent Builder program. In the next section, you will use the Web Console to run the agent.

Run the Agent

 

    1. Open the Web Console.

    2. Click the Agents tab.

    3. Click the agent to be run (e.g., the agent that was created in the Building an Agent section). This opens the agent dashboard in the Web Console. From the agent dashboard, you can manage the agent’s settings (including how it runs, how it displays data, and how it makes the data available for download), run or stop the agent, view the data gathered by the agent, and more.

    4. Click Run Now. This starts an agent job. Learn more about the difference between agents and jobs in this article’s Notes section.

    5. Wait a few minutes, then refresh the view to see incoming data. You can sort and filter the data before exporting it if you would like. You can also schedule the agent to run automatically at a specific time and/or at regular intervals.

 

Download the Results

 

    1. Click the Tools icon.

    2. Click Export.

    3. Click Download.

    4. Click Save.

 

Notes

Do you want your agent to publish its data automatically after finishing scraping its target website?

Publish an Agent’s Data to an Email
Publish an Agent’s Data to an FTP Server
Publish an Agent’s Data to an Amazon S3 Bucket
Publish an Agent’s Data to Microsoft Azure

 

The Difference Between Agents and Jobs
An agent is a set of instructions outlining the actions to be performed against a specific website. These instructions are saved in a file called an “agent definition file.” Settings for an agent, data extracted by an agent, and files associated with an agent are accessible via that agent’s dashboard in the Web Console.

A job is a single instance or copy of those instructions being actively used by the Mozenda harvesting servers to gather data from the target website. A single agent can have multiple jobs running at the same time (perhaps with differing parameters), each depositing data into the same agent’s data collection.

Need more information?

We are anxious to answer any questions you may have about our products and services.

100% Privacy. You are that important to us. Privacy Policy

X