New Features Release Description: Powerful data extraction tool
August 02, 2017
We are very excited to announce a significant new feature release. Though we originally thought it would be a relatively simple change, it has turned into one of our biggest releases in the last several years. We’ve made a lot of changes on our backend to improve the results you see, keeping Mozenda as the most robust and powerful tool in the marketplace for data extraction.
Our overarching goal in this new release is to give you the option to do more with your data while allowing you to process more data.
This new release focuses on bookmarks and storing the history of your data, allowing you to change how you view and filter it without losing information. The first thing to know is that Storing Item History is not the default.
Since your account subscription is partially determined by how much space it takes to store your collection data, we’re keeping you in control of what you want to keep in the system. Without Storing Item History turned on, you have access to your most recent data bookmark and will not be able to change unique fields after your run without having to rerun your agents. To preserve your data over time, you’ll need to turn this feature on.
Here are a couple of visual graphics that will help you to understand the new changes
To activate this feature, go to your Harvesting Settings. Click the Behavior tab, and then click the checkbox next to “Store Item History.”
Once you’ve enabled Store Item History you’ll have a new option in your tab selector drop down—History. This new section of the web console enables you to look through all your past bookmarks. (Note that you could do this previously, but we’ve replaced the drop down selector that was available when you were on the Data tab with this new History tab.)
Back on the Data tab, you’ll also notice that by removing the bookmarks drop down we freed up space for new menu options. With item history tracking turned on, you’ll see three options here (if you don’t have history tracking on, you’ll only see the second two):
- All Items Ever Found
- Most Recent Completed Run
- Harvesting Results
Let’s start at the bottom and work our way up.
This allows you to see your data as it comes in for the current bookmark. Once you hit Run on your agent you’ll be pulled into the Harvesting Results filter, and as you hit refresh you’ll be able to see your data come in line by line.
You could do this before—hitting refresh enabled you to see the data as it came in within your Data filter—but previously, you would only see cleaned up data (no duplicates, item statuses, etc.). The new Harvesting Results filter enables you to see just the raw data.
We’re speeding up the process by pulling in the raw data for the bookmark and then cleaning it up. This enables Mozenda to achieve a higher level of performance while large bookmarks of data are being collected. Watch the raw data come in on the Harvesting Results filter, but be aware that you may see duplicates and that there are no item statuses. This will be more consistent with the testing results that are seen in the Agent Builder. Once the data is in, Mozenda will apply your unique fields and clean things up—this is the “Refreshing Data” part of your run. Once everything’s cleaned up, you’ll be able to see the final data by switching to Most Recent Completed Run.
Most Recent Completed Run
Just like it sounds, this is where you see your final data from your most recently completed bookmark. Duplicates are removed and item statuses are applied. We’ve also added a key icon so that you can easily see which fields, if any, are set as a unique field right from this view. (More on unique fields to come.)
Every time you complete a new run the data goes from Harvesting Results to Most Recent Completed Run. The data that was previously the most recently completed becomes a bookmark that can then be viewed by navigating to the History tab.
All Items Ever Found
This filter gives you the option to see everything you’ve ever pulled in from all your bookmarks after applying the configured unique fields. Because it is an aggregation across all bookmarks, the Item Status column is not available.
We’ve also added the option to set either Most Recent Completed Run or All Items Ever Found as the default option for how your data is filtered. Any agents that are currently tracking history at the time of the release will now be set to store item history. In addition, any agents that are not configured to delete old items will be set to the “All Items Ever Found” filter and automatically set to store item history, otherwise, they will be set to “Most Recent Run.” However, you can change this whenever you like by clicking the save icon that appears to the right when you select the non-default option.
The way unique fields now works is one of the best places to see how the new history tracking has real world application for you. Previously, when you built your agents and set unique fields any newly harvested items that did not include unique data in the unique field was identified as a duplicate and immediately consolidated within the system. This caused data to be lost forever right off the bat. Now, you can change your unique fields any time your agents are not running and all the data rows ever scraped will be used to determine the new set of unique items, which means you can go back, change your unique fields, and see that change reflected across all your bookmarks. It’s like Time Machine for a Mac—but Mozenda style. Let’s take a look at an example to see how this works.
Here we’re looking at five lines of data tracking gas prices. We pulled the data in without a unique field set to make sure we had a complete data set. We have “Store Item History” set to track information over time.
Once we have the data, we can set unique fields to view and consolidate the data in different ways in order to find actionable insights. For example, if we set City as the unique field our data is cut down to just two lines. (Notice the new key icon in the City column indicating the unique field.)
Think of this as similar to creating a spreadsheet Pivot Table to view and analyze correlations between your data columns and rows.
Previous to this release when you set this field and changed your data those other three lines of data would have been deleted and gone forever. However, with the new storing item history architecture, we have the option to change the unique fields again, restoring the data and filtering it in a different way.
Now the unique field has been changed to Location rather than City, and you can see that the data has been restored, bringing us back to five items.
When you make these unique field changes, they’re applied across all your bookmarks. So, while in the past changing the unique fields configuration would delete information from your bookmarks, now you can make these changes to filter and view all your historical data in different ways without worrying about losing anything.
The goal of our new release is to help you prevent data loss and do more with your data while enabling you to track item history on larger data sets. With the system architectural changes that we’ve made to support these frontend features, Mozenda’s processing power has gone from hundreds of thousands of items to millions, keeping Mozenda as the most robust and powerful tool in the marketplace for data extraction.
Save Time & Money
This set of features is so powerful and useful that you may be able to save time and money. In some instances you are able to decrease the number of times you need to run agents to get the data you need. There may be an increased storage cost depending on your agent settings (e.g. tracking history), but we don’t see that happening much since most of our clients download the data after it’s harvested. Image storage has also been optimized. Please call or email us with any questions at 801-995-4550 firstname.lastname@example.org