New Release Webinar Video: Scrape Web Data 5X Faster
October 16, 2017
Included in this video
- The Jobs Sequencer: Learn how to your agent process and scrape more efficiently and effectively with the Job Sequencer.
- Request Blocker: Learn how scrape web data up to 5X faster with the Request Blocker.
- New Publishing Features: See how you can now publish to Google Drive as well as to Excel.
Welcome everyone. My name is Mike Alvarez, Marketing Director here at Mozenda. We are breaking records again in attendance for our Webinars and that is very exciting!
Let’s quickly test the chat feature. Please send me a quick chat here and let me know what country you are connecting from and the company you represent.
We have a core belief at Mozenda: If we can make things faster, more efficient, and better help you collect greater amounts of data that helps you to make better business decisions, you will keep coming back for more. This drives our decision making on which features to develop.
Before we get started I just want to remind you of something I added to our newsletter about a special announcement we’ll have at the end of the webinar regarding a year-end special bonus on processing credits. You won’t want to miss this announcement, so I’d stay until the end of the presentation to get all the details on this if I were you.
Today we have so much material to cover, on incredibly powerful features, so we are going to jump right in.
Let me introduce our presenters today. They both have a collective experience totaling more than 12 years using and engineering at Mozenda.
Chris Curtis is our Systems Architect and has a wealth of knowledge and information about these tools, because he is the one who developed the core components.
Kenny Nielsen is an Account Manager in our Professional Services Department. He is a powerful user of these features and has been using them for years now to service the data needs of our high-volume Fortune 500 clients.
On the agenda today:
• Job Sequencer – How to make your life easier managing our tools. This is only available for Enterprise clients.
• Request Blocker – How to extract web data up to 5X faster.
• Publishing Updates
• Details on the year-end promotion
• Q&A – Through Chat – feel free to ask any questions you have during the webinar and we’ll pick several to answer at the end.
3:13 Kenny Nielsen
Thanks Mike. We are happy to be here and we are excited for all those that have joined the call. Like Mike said, my name is Kenny Nielsen and I have been using Mozenda for close to 4 years now and currently work in our Professional Services department as an Account Manager. I have been doing that for about 2 years. Our Professional Services Team has worked closely with Engineering on these features. We have used it on existing projects. Some of these projects take days if not weeks to run, and while using these new features we have been able to cut some of these projects down that collect hundreds of thousands of rows of data, even millions of rows of data down to hours and days. The results are impressive and we are excited that these features are now available to our customers.
I am going to be talking specifically about the sequencer. Before we jump into the sequencer, and talk about that tool, it’s important to understand that this is going to require a little bit of a shift in how you build agents and how you understand how your agents work. What I have done here is pulled up a traditional agent that hopefully most of our users are used to seeing. This is a common retail site, the goal here is to capture all computer data. A typical Mozenda user would have a data list that they are using to search, say for example product ids or product names. I didn’t have that in this case so when I built this agent I just simply scraped the different categories, then you click through each of those, and then you click through all the products and their URLs. Then you get to the landing page for each product where you scrape all the data. Hopefully that make sense to all of our users, as this is a pretty traditional way to build an agent. Think about each of these agent pages as an individual step. Now, you are going to obviously go through every single category. In this case, there are 6 categories. Then you’re going to iterate through the 100’s of products and you’re going to click on each one of those URLs. Then you’re going to scrape each of those individual product URLs for all of the details.
Now, we have introduced a new paradigm here with the sequencer where you can break this agent up into multiple agents and leverage the data collected from each agent, and put that into a step by step process. There’s a lot of unnecessary time that is spent in the clicking through each of the different products and the different categories one by one in a very linear fashion. It is really important to understand that as we introduce this new sequencer feature, we can break these agents up, collect URLs, and build data lists. We can then get into how the sequencer is very powerful for using that sort of agent paradigm when you are running your agent.
I’m going to switch over to the web console now. For our enterprise customers you will now see that up here you have a tab that’s called sequences. Now I have already built a couple of sequences and for the example agents that I just showed you, I have already built out the 3 agents that are needed to collect all of those different pieces of data. Let’s go into an existing sequence and see what it looks like when it is already built. From the sequencer’s list page, I am going to click on “computers.” Here is where you see the individual steps. Currently in the sequencer we can do 3 different things in regard to steps. We can run the agents, we can clear the collections, and we can publish the data. All of these features are currently available for all of our users when you build a traditional agent, but here is where we introduce the ability to do this on multiple levels. Meaning, you can clear multiple collections, you can run multiple agents, and you can publish in multiple publishing types. You have never been able to do that unless you were using the automation, API tools, things of that sort. Now you can do it just through the sequencer tool.
For example, to walk you through this, let’s say that with that agent we don’t really care so much about historical data. The thing that we want to do is clear all 3 agents. We will clear the category URLs, the product details, and the product URLs. The next step is that you want to run the first page of the agent, which is now its own individual agent, which is the category URLs. Then you run the Product URLs, which then feeds into the Product Details, and you scrape these one by one. Once you have finished the product details agent, you then publish. Now one thing to notice here is that you can set multiple publishers and publish to say an email and a FTP.
So that’s a sequence, hopefully that gets everyone kind of excited to see what you can do here at this point.
8:50- Kenny Nielsen
Not only that, there are a couple things that we can do with this existing sequence that we can show you. For example, we can click on all of the clear collections steps which do not need to run step by step, that can all happen at the same time. So, we click on the more button and we say run the selected 3 steps concurrently. We will group those and it will run each of those steps at the exact same time. Now our run agent step, those need to run in a sequential process. We don’t want to group those. But our publishers, why don’t we go ahead and group those. Once again, click on more. Run the selected two steps concurrently. So, it groups those
and will publish those at the exact same time. Once again, that is a feature that you have never been able to do using Mozenda. Hopefully that makes sense.
Now, let’s go and design a sequence so that you can get an idea of what it’s like to create this. Let’s click on the sequences tab, and click new sequence. We will go ahead and name this “computers new.” Create. This is where you will start to design your sequence. You will want to start by clicking add a step. Now remember, we are going to do to recreate the same sequence, so we want to add our clear collection steps. The first thing that we do is click on clear collection, find our agent, we will select “computer category URLs”. You are presented with a dialog box that you are used to seeing when clearing collections. Let’s click items. We will repeat this process for the “category URLs” and the “product URLs,” select items again. Add another step, clear collection, and “product details.” Choose items. Now we have set up all of our steps that will clear all of our data.
Now, our next steps are to add our run agent steps. Click on run agents, and we will want to add our “category URLs” first. We will select that. Here is where we really get into the core power of the sequencer. If you read this note, you will see that the selected agent does not contain any data lists. To gather data from the agent in less time, configure the agent to use the data list. So now you remember, when we talked about that original agent, I didn’t have a data list to start from. Some customers might start by searching product numbers or keywords, and they can feed that data list into that agent. I have simply scraped the category URLs. So, I don’t have a data list. We will go ahead and just click save. Let’s set up our next step. We are going to run the product URLs. Watch what happens when I click select here. This presents you with this dialog box. This agent is recognizing that it is using a data list. This agent is using the “category URLs” data list to process those different category URLs inside of this agent. It’s asking you, do you want to process that data list in one job? Which you are used to doing currently. Or, we can divide the category URLs agent into multiple jobs. There were 6 categories and for each of those categories, the agent is going to do the exact same process. It’s going to load that category URL, it is going to scrape those sub categories, and then it’s going to complete. Because that process is the exact same over all 6 of those different inputs, we
can run those in 6 concurrent jobs at the exact same time. So, you choose 6. Click save. Let’s add our last run agent step. This is the “product details,” the final agent in this sequence, that collects all of the product information that is passed along through. Click select. Once again, this is using the subcategory data list, and that collects hundreds of product URLs so let’s just run that in 15 concurrent jobs and click save.
Now, like I said, that is the core power of the sequencer, and to give you a better visual of how you can understand this and the time it can save you, think of it as if you are painting a
house. Say in this house, you have five rooms. Mozenda presented you with a tool or a machine that can paint all of those rooms for you. Now, that’s great, you don’t have to spend any time. However, that machine has to go through each room individually, one by one, to paint the entire house. Picture it like this now, so the blue paint brush represents the way that you are scraping data today. For example, let’s say you have a data list of 20 different inputs that you
want to search across a site. Well you would have to go through those inputs one by one, just like the paintbrush is going one by one through each room. Now, when you create concurrent jobs, using your data list, Mozenda will break those up into different ranges and run them concurrently. For example, this paintbrush is doing one through 20 jobs. Using the sequencer, and running 4 concurrent jobs, you split that data list up into ranges 1-5, 6-10, ect. They all run concurrently, they all run at the same time, and they all reduce the amount of time that it takes to scrape, which increases the speed.
Let’s go back to the web console. So, that’s where you are really going to get your gains in speed is in those run agent events as long as you’re using your data list and splitting the jobs up and using concurrent jobs. Let’s go ahead and add our last step to publish the data. We are going to choose our product details agent, because that contains all of the data. We will select and then you have all of your traditional publishing types that you can select from. You will notice here that we have a new feature here, we can now publish to google drive. So, you can link your google drive and publish directly to there. Another thing to point out here too is that we offer a new file format as well, we can now publish directly into excel in XLSX format. So those are two other new features that we are also excited to be rolling out with this last
release. Let’s go ahead and set an email publish event, and let’s go ahead and set another publish data collection to STP, to Services and save. Then we can go through like I showed you and group these steps, and run them concurrently. This is essentially how you set up a
One thing I want to point out also is a couple of different functionalities in a sequence. I’m going to go to a different sequence so that you can see some of these. For example, I started this agent this morning. Essentially it is using a data list to search zip codes and gather
different restaurant data. I’ll go ahead and refresh the page here. This is kind of a snapshot of time, as this sequence is running. You can see these 3 basic steps, we are clearing the data, and then we are running the restaurants agent, and then we are publishing it. Now,
what you will notice here is that in the progress bar you will see that right now the run agent step has completed roughly 1,400 of about 2,000 items that are in that data list. So, we are searching about 2,000 zip codes. We are using concurrent jobs, you will notice that in this run agent step I am using 15 concurrent jobs. This is running much faster than if you were to be searching these zip codes one by one. Within a couple of hours, you have scraped almost half of this entire data list. A couple of other things to point out here too, just for functionality, this is the button where you are going to resume and run your sequences. From the sequencer icon you can have different settings, naming, you can go to the scheduler. You can schedule these sequences to run on the normal intervals that we offer.
If you come out to the main sequences list page, you can once again click on this sequencer icon, and you can have global notifications set. If the sequence advances to the next step, or if it completes successfully, it will notify you once that happens. So hopefully that sums up everything with the sequencer. We realize that there might be some questions. We are actually going to be doing further webinars on how to best build agents to really use and leverage the sequencer. Like I said, we have had dramatic increases in regards to speed. A lot of our projects were able to get out of the door much faster, giving us more time to evaluate the process and the data. We are really excited for these features, and hope that you will use them and find the same amount of satisfaction that we have.
Now, if that’s not fast enough for you, we have actually developed another feature which Chris is going to talk about that will further help increase the speed when it comes to gathering your data. So, I will now turn the time over to Chris.
Thanks Kenny. I agree that the sequencer technology is going to dramatically change how, hopefully, most of our users use our system. It’s eliminates a lot of the micromanagement that needs to happen, and allows you to introduce some level of concurrency into your processing. But as Kenny said, there is another part of this demo that we wanted to go through and discuss. That is another feature that is called request blocker. Now, in our communications leading up to this webinar you very well may have seen the tag-line saying something like “Mozenda will be 5 times faster.” Well, how is that possible? How can I take my existing agents, that have been running for maybe months or years, and make them run faster? Well, Kenny showed you one way to do that, and that is by adding some concurrency into your process so that multiple jobs can be processing parts of that process at the same time. But another fundamental way to improve the speed at which your agents process is by using this new feature called request blocking.
So, to do that I am going to move over into the agent builder, because that is where you can configure request blocking. Now, as kind of a best practice, the best place to start when using this new feature is to start from working agents. You are going to want to have an existing working agent, or to build a new agent that is working before you try to implement any of the request blocking functionality. At this point what you are seeing here on the screen is a very simple agent that goes to a common retail site, and it captures a list of product names. I have tested this agent and it is performing as I would like and gathering the data that I desire. So, at this point I am prepared to use this request blocking technology and feature to further enhance this agent to process more quickly.
The first step in using the request blocking feature is by making sure that the navigation requests window is available in your agent builder. That can be done by coming to the file menu, clicking on settings, and making sure that the navigation requests box is checked, and then clicking save. At that point, the navigation requests window will show up here at the bottom of your agent builder, and I can switch to that. We can start to see different requests that were made to load this particular agent. In this case, there were over 200 requests that were made to load this page. What happens is that when a web page loads, the website, Best Buy, will tell the browser, or Mozenda in our case, which other requests need to be made for the website to load. Now, generally, a lot of the requests that the website will say that the browser needs to make or Mozenda needs to make, are unnecessary for the content of the page to remain the same. The goal with using the request blocking feature, is to teach Mozenda which requests do not need to be fulfilled for the website to function properly and to continue to gather the data that we need. By telling Mozenda not to process or execute certain requests, it allows Mozenda to execute much more quickly, because it is not waiting for third party requests to come through.
Look to this list here, I am going to sort this by root domain. You are going to notice as we scroll up through this list that there are a lot of different domains that are being requested by this Best Buy web page. Now, as I mentioned just a second ago, a good portion of these are not needed for the website content to remain there so that the agent can gather the data that it is interested in gathering. My job here is to start teaching Mozenda which of those requests are needed, and which of them need to be blocked to further improve the performance of this agent. Now, this is going to take some time. It’s a process where you will start and you will make some changes, and then you will make sure that your agent continues to work, and you’ll further refine it as you want to optimize that agent and improve the performance as much as you can.
I will show you at the basic level how you would go about starting to do this. I am going to go in and look through my list of domains. I’m going to say, you know what, I am doubtful that this “247-inc.net” request, or the requests that are made to this domain are actually important for the website to perform properly. As you scroll down, you will see a lot of Ad traffic, like “double click.” You will see some social media traffic like “Facebook,” and these are all requests which are unnecessary for this website to function properly. So, I am going to show you an example of how you can teach Mozenda to block those requests. I have selected a few of those requests here. I am going to click on the block button here, and I am going to choose to block by the root domain of these requests that I have selected. That will open up the request blocking expressions window, and it has pre-populated it with 3 expressions based on the domains of the request that I selected. I am free to add, remove, and change these expressions.
Essentially, what is being defined here is a list of expressions that will be executed against every request that the website is trying to make to see if Mozenda should allow that request to actually be executed. In this case, if it matches these domains, Mozenda will choose to block those requests. Which will further improve the performance of this agent. After I have set up my expressions, I am going to select save and reload. That will use those expressions to block requests that the website is trying to make. You will see some flashes of red as they come up here. What that is indicating is that there are specific requests that have been blocked by Mozenda, as this web page was loaded, as a result of these request blocking expressions.
Now, again this is in a process so what you would want to do is continue to block more and more of these requests, and set up and refine these expressions, but always ensuring that you are testing your agent to make sure that it is continuing to gather the data that you are interested in. In the interest of time, I am going to switch over here to a video that our quality assurance team has created that shows a side by side comparison of two agents that are identical except for the fact that in the bottom one, the QA team has taken the time to test the request blocking feature and has set up many expressions that really reduce the number of requests that this website needs to make to execute. What I am going to do here now, is show you a time lapse of how this can affect the testing and performance of your agent. I am going to click play here, you are going to see on the top the traditional agent that doesn’t have any of the request blocking expressions in it, and the bottom which is what has our refinements. Now, both of these agents again are capturing the exact same data. There are 301 items of data that it is collecting. However, you will notice that they are running at dramatically different speeds. The bottom one, while using the request blocking technology, was able to gather all 301 items in 2 minutes and 11 seconds. Whereas the top one, without this request blocking feature enabled, took over 11 minutes for that same data to be gathered. Over a 5 times performance improvement. So, I hope that that is kind of an insight into what this request blocking feature can do.
Now, we are very aware of the fact that using the request blocking feature is something that is going to take some time for our customers to go and use or implement into their existing agents. We know that it might require some help from our part to help our customers integrate this feature into their existing processes. Our training and support department have been fully trained on these features, and I would encourage each of you that have an interest in using the request blocking feature or the job sequencer feature, to reach out to your account manager, and set up a training. We want to help you see the vision inside of your own projects, and see how these features can improve your data collection processes.
At this point, I think that concludes the product demo portion of our webinar, and at this point think I will turn the time back over to Mike to take it away.
28:34- Mike Alvarez
Perfect. Before I jump in and cover the year-end bonus that we have. Someone sent us in a really nice compliment. They said, “Guys, I just want to compliment you on the release. These features are amazing, and just what we have been needing as our agents get more complex, and sites we are scraping get deeper.” This really is a game changer as our CEO Brett Haskins said. There is no one else in the industry that has technology like this. We are very excited about it.
So, the year-end bonus is for all manually paid processing credits through the end of the year. All that you have to do is use the code CLIENTAPPRECIATION, to get a 20% bonus on all manually paid processing credits. If you have an enterprise account, you will need to contact your account manager in order to get this. But this is from now until the end of the year. This only happens once a year, and this will help you to get your Q4 goals accomplished, and also to get a head start on your Q1 goals for next year.
Let’s go ahead and jump into the questions. Number one, “Can I skip a step in the sequence without deleting it?”
I can answer that. Let me jump back to the web console, to show how that is done. Let’s go back into that computer sequence that we talked about. Let’s say for example, you want to essentially start from your product URLs agent. You can just simply select the step that you want to start from, and you can skip all of these other steps that are part of that sequence, and start from this step. Another thing that I would mention as well is that you can move steps. Say for example you want to move around the order of your run agent steps, or your clear, whatever it is, you can move those around with this feature here. I hope that answers your
Great, thanks. “As I am looking at this here I see that there are icons under config, would you mind explaining those?”
Yeah, no problem. There are only a few new icons that came with the sequencer, and that is really the icon here that tells you that you are running concurrent jobs. Most of the other icons should be pretty familiar. The clear collection data is a new one that you will see specifically in the sequencer. Hovering over these icons will tell you what they are. This one signifies that you are publishing FTP, this is the new publish to Google Drive icon. If you are ever confused about what an icon means, just simply hover over it and it will tell you exactly what it is doing.
Great. Another great question here. This is probably for Chris. “Are there common requests in websites that we should always block?”
Yeah, that’s a great question. As your users become more familiar with looking at the navigation requests that take place across many of the agents in your account, I’m sure they will notice as we have that there are many patterns that come about. What we are finding is that in most of the modern websites, there is a lot that is going on that has very little to do with the content that is being displayed on the page. For instance, people are trying to gather
analytics information or they are trying to produce ads. There is a lot of social media integration that is going on. So, I think over time what you will find is that you end up doing a lot of similar work on a lot of your agents to have it block a lot of similar requests. Over time, you will probably end up with some sort of templet that you use across a lot of your agents. I think that is a process for each of your users to go through, because they will be very specific to the type of websites that you are scraping data from. There isn’t really a one size fits all for every website, but you will definitely start to see trends as you spend more time looking at each of the requests that actually allow an agent to execute.
This is excellent, thank you guys. Just to let you know, we are going to have more webinars about these features. We are going to have at least one each, to dive deeper into these features for those of you who would like to attend. We’ll go ahead and send an email invitation in the next couple of weeks for that.
Thank you for attending. Like I said, we have broken another record for attendance here at Mozenda, these are very exciting releases. We can’t wait to hear from you and to hear the success that you are having with these new features. These were launched on Friday, so they are available for you to access right now. My name is Mike Alvarez, the Marketing Director here. Feel free to send an email to Support or to email@example.com. I would love to hear your feedback, or any stories on how you are using Mozenda. It would be a pleasure. Thank you for this, and we look forward to the next webinar. Talk to you soon.