Do More With Your Data With Item History Tracking (Webinar)
August 10, 2017
What You Need To Know About This Release
- Do more with your data: The main goal of this release is to help you gather more information in your collections with the flexibility to change your collection structure without losing data.
- Revamped item history tracking: We’ve restructured how Mozenda tracks item history. This new architecture will allow you track history on more items for longer periods of time.
- Possibly one less column of data: If you’re tracking history but not deleting old items, you may no longer receive the column with item statuses.
Welcome, everybody! We are happy to have you at this webinar for our new, major release. How you can do more with your data and save more time and money doing it as well. While we are waiting for people to join, I know it has happened to me before where, you know, I get a reminder last minute and I am trying to download the app and trying to dial into the session. So while that is occurring, we will give everybody about a minute, let me cover a few things. Number one, thank you for being a customer. We are excited to have you! A couple of things here also. First of all where is everybody from? I would love to see where everybody is from. Please go ahead and send chats to me. I would love to see that. Hopefully, we can share that later.
Mozenda has been in development for over 10 years. Mozenda has also built the most powerful web scraping tool in the market. We also have world-class, in-house customer support. Just outside this door, we have an incredible staff, and they will answer your phone calls and your emails. We are happy to do so. This is why 27% of fortune 500 companies and thousands of small/medium businesses trust Mozenda. Because we truly care about providing you with world-class service. It is important to note that our tool is very easy to use. For most websites and cases you don’t need to be a developer to use our tool. There are some websites that are a little harder to scrape, and that is when we are available for you to call us. We can even do a screen share and go through the process with you to make sure you get the data you are looking for.
Let’s go ahead and jump in. I think it’s been long enough to have everybody download the app and join us. This is the team that is going to be on the call today. Chris Curtis is the development team lead. By the way, all of these engineers have over 20 years of combined experience in our tool, at Mozenda. They are extremely passionate about what they do. They know the code and all the features inside and out. Let me introduce Chris Curtis, development team lead; Joe Fullmer, Senior software engineer; Nathan Barton, quality assurance engineer; and I am Mike Alvarez, director of marketing and customer success.
Let’s go through the agenda now. We’re first going to cover the why. I will cover that in just a second. Secondly, just for couple of minutes, we’re going to cover the basics. Only a couple of minutes because there are few of you that are new to Mozenda. Most of our attendees today are power users. The item history demo, that is what we are going to cover next. Then best practices. And the last one is going to be a question and answer section.
Let’s jump into this. The big why. We have a core belief in Mozenda that if we can make things faster, more efficient, and help you collect greater amounts of data that helps you make better business decisions, you will keep coming back for more. So this is what we have done. Engineers got together and tried to figure out what would be in the next release. And they decided we were going to add these features that are going to add greater flexibility, speed, and savings for you! Firstly, we are going to allow you to store larger amounts of historical data and secondly, allow you to view your item history without having to rerun your agents. Thus, saving you time, money, and processing credits. I don’t know about you, but I think everybody should be excited about that. Now I am going to turn the time over to Nathan, and he is going to cover the basics and the demonstration. Then Joe is going to cover best practices, and Chris will chime in whenever he is available. Go ahead Nathan.
Nathan Barton: 4:25
Great, thanks Mike! I appreciate it. We are really excited about this particular feature. Like Mike said I am going to cover some basic background then roll right into a show-and-tell demo. In order for us to really allow you to store larger amounts of history and change your configuration without rerunning your agent, we really need to understand why we would need to apply different filters to your data. We’ll start with a quick business case to help you grasp the concepts that we’re about to talk about. If I were working for a display manufacturer, for example, and I were to be approached by my boss and asked if I could prepare a data set that looks for a relationship between TV sets sold in retail and their particular availability, (whether they’re available online or in stores only, just as an example of our particular business case) my thought is, “Okay that is great. I can do that!” I can use Mozenda to gather that particular data set from the web and look for particular relationships.
Let’s go over to the web console and see the data set lines I have gathered here. Here I have a list of data that I have scraped from the web of retail TV sets. You can see that I have quite a few rows here. A few rows of products are actually the same. There are quite a few rows of the Samsung 64 inch 4K HDTV for example. But remember, I am specifically looking for products in relation to their availability. I want to collapse that down and only see those particular items. In order to do this, I am going to take advantage of the unique fields filter. We are going to go ahead and apply that now by going to the fields view here. I am going to start off right out of the gate by selecting product ID as the unique field, setting that, and allowing Mozenda to apply that filter based on the configurations we have set here. That is currently applying that unique field, cleaning that data. And here you see I got each of those products collapsed down into the unique identifier of product ID. We’re now only seeing one instance of that particular product; the Samsung 64 inch. That is just the basics. We’re going to jump into the concept of how Mozenda allows you to apply the unique fields filter, as well as few other filters.
Let’s jump back to a brief graphic here. We just touched briefly on the unique fields filter. This graphic here is helping us describe the flow of data as it exists in Mozenda. There are several filters, we covered the first. Well, actually we will cover more on unique fields. If we go back to my business case we actually want to see products by availability. To start things off, we only apply a unique field to a product ID. If you have used unique fields in the past one of the things you will remember is that if I have applied a unique field and need to change that configuration to include another field as part of that unique set, then I would need to rerun the agent to gather the data again. One of the changes we have made to this update is that once you have scraped the data with a particular field you can change that unique field’s configuration without having to rerun your data. Let’s go ahead and apply both product ID and availability as our unique fields.
Go back to the data set here. You’ll see back to that Samsung 64 inch that there is actually one available both in-store and online. Now we have collapsed down to that combination of unique fields, which you can identify with the key icon in those field headers. That is one of the first filters that you can apply to a data set after it’s been scraped to that raw harvesting results. One of the other filters that you can apply after harvesting results is what we call “What do you want to include.” There are a couple of ways you can apply that filter. For example, if you want your agents to aggregate data for a period of time or overall your individual bookmarks you can select a filter called “All items ever found.” This will let you view all of the unique items across all bookmarks, making that agent configure to aggregate data. You can also set it to “Most recently completed run,” which is the most recent complete bookmark for that particular agent and the one that has completed its bookends.
In this particular case, this agent is currently set to “All items ever found.” If I were to change that to “Most recent completed run”, you will see the unique fields filter has been applied and you will see those collapsed down. You will also see the item status column. In this case all six of these items have not been changed since the previous bookmark. It’s important to keep in mind that the item status shown in this particular view is in relation to the previous bookmark to the most recently completed run. This is an example of another filter that you can apply to your raw harvesting results to be able to consume the data. Now, this “What to include” filter, plays a role into the type of data that is sent to your combined collections if this agent sources a combined collection. It also plays a role to what data is sourced by a data item list or what is published and exported.
In this particular case, this agent is defaulted to “All items ever found.” I know that because when I am select the most recently completed run, I see the same icon to the right of the drop down here. If I were to set this agent to default to the most recently completed run, I would simply select this icon, save this, review the data in this information in this dialog, say ok, and now this agent is set to default to the most recently completed run. But don’t worry, you can always go back to view the “All items ever found.” Let’s go ahead and do that now and resave that configuration. We want this agent to aggregate data.
We’re doing this to show you that (again, the same concept I was talking about before) if you need to change the configuration for this particular agent, you can do so without needing to rerun the agent. The way we can do this by setting your harvesting settings to “Store item history.” Let’s crack open the harvesting settings for this particular agent. In the behavior tab of the harvesting setting you may remember that we used to have “Track item history” and “Delete old items on successful completion” as settings. We have consolidated both of those settings into one setting, “Store item history.” With this setting enabled, we will store all the data that you have scraped for this particular agent, so that if you need to change your configuration of unique fields and what to include filter, you can do so without having to rerun the agent. So we definitely want this agent to store item history.
Now I talked about how the “What to include” filter had an effect on both publishing and exporting. We made a few changes to those settings that we want to go through. The publishing dialog, if we open that up. If I want to set a publisher to email me with the data, I could also include the “All items ever found” or the “Most recently completed run.” Now this will default to whichever of those settings I have set the agent to persist. But don’t worry, you can choose either of those options for publishing. Let’s select the most recently completed run and save that publishing configuration. Once it has been saved, that configuration will store as a setting for your publisher.
Let’s take a look at the same thing in the export dialog. You can see there is the “What to include in the file” filter here. I have the same options, as well as the option to export harvesting results. That is one filter that we have not talked about yet. It’s a little bit different than the first two, so let’s go back to the data view, and change our view to harvesting results. It takes a little bit of time for these filters to be applied. The system, once you have chosen your configuration, will need to process through those and need to apply the unique fields and “What to include” filters. We know you want to see your data as the agent’s running, so we have included this option for you to be able to (as your agent is running in real time) refresh this view, go to the harvesting results and see the raw data coming in as it’s being scraped. Now a couple of things we would like to make known here. The harvesting results, this is raw data. The unique fields filter and the “What to include” filter are not applied here in this particular case. You can see that we have our duplicate items in our list. As the agent runs, it will gather this raw data. And eventually will finish scraping and will refresh the data. [Refreshing Data…] You may have seen this in the status of your agents up at the top of the dashboard. The ”Refreshing Data” will take the raw harvesting results you can see in this particular view, clean it, add those filters and move that to the most recently completed run, as well as update the “All items ever found” so at that point, once the refreshing data status is completed, you will be able to see those raw results in those particular reports.
We’ve talked a little bit about those reports, let’s finish the life cycle of data here for Mozenda. One that you may be familiar with that we’ve had is item status filters. We wanted to touch on this here real quick. We’ve noticed the item status in the most recently completed run. You may not see that anymore in the view “All items ever found” because that is an aggregation of data. But we still allow you to have the ability to change those filters to be able to see which items have been added, changed, unchanged or even deleted when you’re storing this item history. We also still have the ability to iterate through different bookmarks to see the data for that particular snapshot for that particular point in time. We have broken that up into its own view. Let’s go to the history view. Here we can select a bookmark from this drop-down to view. Let’s review the most recent one. We currently have six items that have been unchanged. Let’s select the deleted filter and make sure we include that in here. Because we have one item that was removed from the website that was not scraped on this more recently completed bookmark. Let’s go down to the previous bookmark and see what those item statuses are. Here we can see a few of them that have changed with the majority of them unchanged. It’s important to keep in mind the item status is in relation to the previous bookmark. We can change this filter. This is just another filter we allow you to apply to be able to view and consume your data.
One last filter we would like to talk about in the life cycle of data is our simple views. You may be familiar with custom views that you can create to apply criteria, sorting to, and this will apply on top of all previous filters that we have talked about. Let’s really quickly finish of by creating a custom view. Let’s create a view that includes only items that we can purchase in bulk, so that have a quantity greater than one. I am going to name this view, select my column heading that I want to include, and set a criteria (in this case in the quantity field I am going to want to set a criteria to be not equal to one). Let’s save that particular view. You’ll notice here that this filtered the data down even further to this single item. That is all of the filters that we allow you to configure for your agent. Once you have the raw harvesting results and store item history, you can then pick and choose your particular configuration and change that as you need so that you can consume it. That is all we had for the demo. Thanks, Mike!
Mike Alvarez: 21:22
Hey this is fantastic! As you were going through the demo, I couldn’t help but think of those of you who are used to using Macs. That bookmarks feature is similar to what Mac uses for time machine. To go back in time and access previous data. That is a great feature. Joe would you mind covering what you believe are some of the best practices to set this up and also other things that our clients may want to consider while using this?
Joe Fullmer: 21:59
Certainly, yeah, I would like to talk to that. As a developer, one of the most exciting things about this for me was to add this ability to change your uniques and then not have to go rerun your agents. As we saw previously, Nathan in his demo had the unique on product ID and so all of the items with that same product ID all collapsed down to a single one. But then we realized, wait! We wanted to have that same product ID distinguished by whether its available or not. We were able to apply that additional unique field and after the system does its recalculating, we have new results. Depending on how many millions of rows you have, it may take a little bit of time but will certainly be a lot less time than having to go and re-scrape all of that data. Previously the system would have already collapsed and burned data in that deduping process. Whereas now, once you have chosen your configuration, yes, we burn it in but we are able to go back to those original scraped items and change our configuration and burn it in with different uniques. So now you not only save processing credits from having to rescrape, but you also save time. And I guess the other benefit would be, let’s say the product is no longer available on the website. So when you go to rescrape, or rerun your agents, it doesn’t even pick it up. I think that is the number 1 feature here, the ability to have all of that history at beckon call. You can change all of your uniques and look at your data in different ways, use different combinations of uniques, and learn different things from that.
Another feature that is not a main focus of this webinar is the improvement of image storage. If you have an agent that scrapes images or downloads files and for example, 100 of the items were scraped and it scraped that same image. Let’s say this Sony tv we’re looking at here. It scraped from several different categories. Well, previously we would have saved that image as many times as we have scraped the item. Now we will save that item just once and save on some storage costs. Those are the big exciting benefits for me, keeping the history and being able to change your uniques on the fly, and being able to look at your different views, whether aggregated, the most recently completed run, or everything unfiltered as it’s coming in.
Mike Alvarez: 25:22
So Joe, I guess bottom line is that duplicates we used to discard before, now we realized that they are just an absolute gold mine of information, right?
Joe Fullmer: 25:36
Yes, yes. And what we learned is that customers often look at their data and realize “oh you know what.. I actually needed to be more granular with my duplicates. I do want to see this same Sony TV more than once if it’s in different categories for example, or if they have a different price”. So customers don’t want to have to go and rerun the agent to get that. That’s great!
You also asked if I would speak to best practices. One of the things that I think is very important when you design your agent, or first set it up, is including data or information that may not seem important right now but will be useful down the road. Nate set this up and scraped “available”. We initially did not have that in our unique field but he saw that it was an available piece of information to scrape so he grabbed it. If he hadn’t done that and then we realize later, you know boss comes and says “I need to know whether they are available or not” you would have had to go and re-run your data. Obviously we cannot work miracles and have data that wasn’t scraped to begin with. That is one of my key points. Grab all the data you can as you scrape each item as you set up your agent to capture these items. If there is a piece of information that is easy to grab, grab it!
Mike Alvarez: 27:20
That is an excellent point because a lot of times some of the information on that page is missed because it’s not asked of us as we were trying to collect this data. But if we can, we might as well grab more data. If we were to go back in time and plot all of this out and trend it over time you would be able to see some incredible information and be able to make business decisions. That’s great Joe, thanks so much for that!
Let’s go back to the power point and take over on this end. So just a quick summary. What you see in red is basically what is new. After selecting unique fields we allow you to do more custom filtering and sorting to be able to use the information in a consumable way. This provides a visual that is really easy to understand, what happened before and what happens now. So before you would actually scrape all this data and sort it for example in a category called smart TVs. Now as your scraping data and you’re not deleting the duplicates, if you take a look at the new step 2, you are able to now see the data based off the categories smart TVs, 4K TVs or 52 inches. Previously you would have to run 3 agents to get that information. With the new feature you can run the agent once.
So let’s go ahead and jump into questions and answers and we will wrap this up. We do have a good question here. What if I set the product ID as a unique field in this data. Can my agent still run and scrape my other data?
Nathan Barton: 29:40
That’s an excellent question! I can talk about that. So one of the things that we want to talk about briefly is about setting up your agent. It’s important to make sure that if you would like to apply all of these filters that you have the data available in your raw harvesting results to apply these filters. Now what that means is that you may only be required to scrape certain amounts of data but you might need to be forward thinking about what fields you may want to include later down the road in case you need to change that configuration. Back to our business case, if we had never scraped quantity for example, if that wasn’t one of our fields, then we would not be able to apply those filters after the fact. So even if we don’t need that right away it might be beneficial for us to set up the agent to include that filter. So step one really is to understand which fields are needed, which fields might be needed in the future, and which fields you might want to include even though that might be extra data you do not need right now. That is step one. Step two is in the case of the unique fields. If you would like to apply that unique fields filter, that happens after some scraping of the raw data. So in this particular case, if the product ID were to not exist on the webpage, then that unique fields filter will be applied at the end, identifying that particular line item as null or empty. The agent will still scrape the data as you’ve created that agent definition, but the unique fields filter will apply afterwards on top of that raw result. Hopefully that answers your question. Thanks Mike.
Mike Alvarez: 31:50
Great, there is another quick question here. Approximately what is the amount of time for an agent’s data to refresh?
Nathan Barton: 31:59
That is an excellent question! A lot of that really depends on the size of the raw harvesting results: how many rows of data, how many columns exist in that particular collection. Because it varies so much I don’t know if I could give an exact time because each collection is different. But like I have mentioned before, one of the main focuses of this development process is to allow you to see your raw harvesting results as it’s being scraped so that you can make sure that your agent is gathering the data that you want. There may be a period of time of collection maintenance if it’s a large collection. The refreshing data may take some time if it’s a large collection, or it may be fairly brief. I’m sorry I can’t apply an exact time, it really does depend. We recommend knowing your collection and knowing your agent and the time associated with each of those particular actions.
Mike Alvarez: 33:11
Thank you Nathan! Thank you, everybody, for being on the phone call. Thank you for being a customer of Mozenda’s! We are excited to hear your feedback on everything to do with this release. If you find any gold nuggets of data we would love to hear it! We would love to hear your testimonials as well.
Again my name is Mike Alvarez, feel free to ask any of the support people how to get a hold of me if you would like to share a testimonial as well. Thanks again, we look forward to seeing you at the next webinar.