Screen scraping has gotten a bad rap for a long time, and its reputation is not entirely without merit. Ryanair, an Irish based airline company, announced in August of 2008 that it would cancel all tickets purchased through websites (e.g. BravoFly, Opodo, Atrapalo, OTBeach, et. al) that employed screen scraping techniques. There have also been countless examples of entire websites being duplicated using screen scraping. But, does that mean that all screen scraping is bad? There are plenty of legitimate reasons to use techniques and technologies that allow you to get information off of a website. Hopefully, this article will address the stigma attached to screen scraping by discussing some of its legitimate uses.
It’s been a long standing practice of retail companies around the world to keep an eye on the pricing of the competition. By knowing your competitors prices, you’re able to make adjustments to your own pricing and remain an attractive shopping option. Now that most companies have moved their prices online, you no longer have to send “spies” into retail locations, spend hours leafing through newspaper inserts, or make price-inquiry phone calls. Many websites have no printed policy on the use of screen scraping techniques and while that’s not an open invitation to do whatever you want on the site, it may mean that, as long you’re not causing an unreasonable strain on the site’s servers.
Forums can contain a wealth of useful information for product manufacturers, service providers, and marketers but getting to that information is often clumsy and time consuming. Provided the site doesn’t restrict the use of data extraction techniques, using screen scraping can make a world of difference. Imagine you’re a cell phone manufacturer. You just released a new phone and want to keep an eye on the public’s reaction. Users are likely to be far more candid with the anonymity a forum offers than they would be in a more intimate setting such as a focus group. So, by monitoring a forum, the cell phone manufacturer may be able to find useful information such as design successes and flaws, manufacturing defects, and consumer demand. These same principles can be used to monitor blogs or blog comments in the event no RSS feed is available.
Getting product information to the people who need it can be a pain (especially if you’re one of the people who needs it). Distributers, wholesalers, and dropshippers often use archaic methods (CDs, Excel Files, physical product catalogs) to get out product information. None of these methods give those who need up-to-date information what they need at the time they need it. This can make it impossible to determine inventory levels, adjust pricing, and be aware of new product offerings or discontinuations. Screen scraping can provide a rather elegant solution. Whether scraping your own site and providing the information to resellers or scraping the site of your distributor, you’re able to extract needed information in a timely, simple fashion. Some solutions, such as Mozenda, offer the ability to not only regularly schedule a screen scraping agent, but to also automatically export that information to a file or to a website. This means that you can either alert distributors to changes or—if you are a distributor—you can monitor suppliers’ changes all without investing additional time and effort.
The above examples are only a handful of the thousands of legitimate uses for screen scraping. Hopefully in the future, responsible users will find new legal and ethical reasons to better organize and repurpose information from the web. Screen scraping–or whatever you chose to call it–won’t have such a stigma attached to it when that time arrives.
So, Mashup Camp ended about a week and a half ago and I’ve now had some time to reflect on what happened. While at the conference, we were running around all day doing demos, teaching people how to use the software, and meeting conference attendees and other sponsors. All in all the conference was a lot of fun. It was refreshing to be in a less structured conference environment even though it occasionally meant that things were a little more hectic. It was also refreshing to be around people who feel passionately about mashups.
We here at Mozenda have always felt like the Web Agent Builder would be very well suited for handling the data layer of mashups. The conference featured a number of sponsors whose software allowed users to easily link up data or data feeds to create the front-end mashup that users would see which was exciting for us because we were able to show users how they could build those data feeds. It was also great to be able to show our software to people from Zembly, WetPaint, Calais, IBM, and Yahoo and see how impressed they were with it.
It was also interesting to listen to Tim O’Reilly’s keynote address on how the Internet is the new OS. He talked about how the web holds so many of the applications we use regularly—email, news, search, documents (with Google docs), chat, etc. Mashups are also great candidates to contribute to the Internet OS because they allow users to replace derive more meaning from a single source than from multiple other sources. As mashups help people gain ground on the massive amounts of information they are presented with, they will become a necessary destination, and an integral part of people’s online usage.
All in all, I would say Mashup Camp was a successful unconference both in general and for Mozenda specifically. It was great to see people getting creative mashup data with Mozenda, plugging it in to a mashup enabling software, and then wowing the attendees with it. Hopefully, we continue to see providing innovation in the mashup space, and hopefully Mozenda can continue helping people get the data they need for the mashups they’re building.
The Internet is an ever changing landscape. There is now more information available at a lower cost than ever before in the world. With this wealth of data available, businesses are constantly looking for a way to turn information into a competitive edge. This article will discuss several uses of screen scraping technology that help businesses develop a greater competitive edge.
The first use of screen scraping that we’ll discuss is using it as a means for conducting market research. Companies create scraping “bots” or “agents” to extract helpful market information such as information about competitors (e.g. store locations, inventories, publically available financial information, etc.), information about customers (e.g. discussions in forums, posts on blogs, etc.), and about trends (e.g. changes in search traffic, online advertising, etc.). Companies are also able to use these agents to gather contact information—such as phone, fax, and e-mail—for individuals and businesses.
Another increasingly popular use of screen scraping is to automate repetitive tasks with scraping agents. For example, if there are websites from which you retrieve information on a frequent basis, you can replace that manual process with an agent that will automatically find and extract the data you need. Other companies use data extraction software to extract information from a competitor’s website and then use that pricing information to dynamically update price sheets or the prices of products on a website. This process of automating manual tasks can help ensure that companies remain competitive even when competitors run sales or offer rebates, and it can help reduce errors, save time, and save money.
Another use of screen scraping tools is to create catalogs, lists, and databases of information. This data can be used in your own products, applications, and services. These catalogs, lists, and databases can then be updated dynamically simply by scheduling your agent to run regularly. This process of creating and updating data “collections” is gaining in popularity with financial firms, pharmaceutical companies, scientific institutions, and other research and information knowledge workers as it allows them to have constantly current information on which to base their decisions and recommendations.
Screen scraping can be used to create mashup websites—sites that combine data from more than one source to accomplish something that each source was individually incapable of. Some of the more popular web mashups include sites like Trulia, NetVibes, and RunningMap, each of which are gathering information from several web sources and then consolidating them into a single website. Some screen scraping solutions on the market make using scraped data for a mashup easy by providing access to the data through an API.
Screen scraping can also be used to create custom data feeds. Many blogs use RSS to send notifications of new content out to readers, but most websites don’t offer feeds for the rest of their information. For this information that doesn’t have a feed screen scraping software can be used to pull the information into a database. Once the data is in a database it can be easily fed out through a number of different methods (e.g. RSS, email, custom XML, FTP, etc.).
Finally, uses for screen scraping software are not limited to getting information from other companies. Some businesses use screen scraping for combining data from incompatible software. Screen scraping can be used to search for items within your own company, interact with web based programs from 3rd parties, or test programs before distribution.
As more information is available through the Internet and more companies recognize the advantages to having up-to-date web data, there will be an increasing number of uses for screen scraping software. Using screen scraping will often allow you to save time and money and as a result improve profitability and decision making.
Web data extraction is commonly used by companies for market research. In other words, companies work to get company contact information such as phone, fax, and e-mail, as well as other company demographic information, production items, and anything else that could be used to sell products. But this is not the only way web data extraction can be used. Many of the new web data extraction software programs on the market are chock full of interesting goodies that make it more than just an extraction tool.
For one thing, web data extraction programs can impart your own company software on thousands of PCs already set up with your company information included. It will automate the program files of numerous small programs into one large installation application that can then be distributed out. Web data extraction works both ways when you have the right web data extraction program to work with.
A good web data extraction program can help you with detailed product testing and allows you to chain together tasks that are executed based on your parameters alone. This helps you determine whether or not the web data extraction is functioning properly when you send it out to search for the information you need on the Internet. It can help you tweak your programming and you have immediate results without having to make complicated changes over and over again.
If you need to move data from an obsolete system to a newer one, or from an incompatible system or website, the web data extraction program can extract the information for you using a variety of techniques, including screen scraping. It can send it to the computer system you need it on. This saves many companies time, keeping them from having someone rekey all of the data and losing information or having costly mistakes.
Web extraction programs can pull down information from websites that you check every day and dump the information into a spreadsheet for you, saving you the time it takes to check the same sites day in and day out. It can remember your passwords and login for you, or if you need items submitted to search engines, good web extraction programs can provide you with this service. Once again, you save time and effort by automating some of the mundane and repetitive tasks you do with your web data extraction program.
You can import data from your other company programs into your web data extraction program and automate the running of inventory checks, customer record updates, report generation, and more. You simply program a macro like you would for web data extraction and screen scraping but use your company data instead, allowing the macro to search the company system.
Web data extraction programs are not limited to surfing the Internet. You can use it to search for items within your own company, automate a search on all of the websites you check on a daily basis, such as e-Bay, and automate other programs for distribution. It offers you more than just market research and demographics, and it can help save your company time and money.
Screen scraping and web data extraction are terms for the process of using a specialized computer program to extract information from a website. Screen scraping is different than using spiders ‘spiders’, ‘miners’, or other similar techniques, because–instead of indexing everything on the page–the screen scrapers only extract information selected by the user. A good screen scraping program is able to navigate through more than one page while extracting whatever content the user wishes. Screen scraping programs can save you time and money and are especially useful for automating human processes such as copying and pasting information from the web to a document or spreadsheet, keying in data from the web, or keeping up with regularly changing pages.
Screen scraping is accomplished through building and running scripts that are commonly known as “agents”. Agents visit web pages and look for specific, user-defined content. Smart agents are able to navigate through a site much the same way a human would (i.e. clicking on links on the page). The agents, as they navigate through pages, ignore everything that is not needed and only capture the desired text.
Screen scraping is not without its challenges. Pages that use a lot of JavaScript can be problematic for screen scraping technologies because they often change the information on the page without requiring the user to load a new page. This can cause agents that aren’t equipped to handle content changes made through JavaScript to complete miss or incorrectly record the data.A second weakness to screen scraping is that when an agent is created, it is based around structure of the website at the time the agent was built. If— in the future—there are changes to the structure of the targeted website (e.g. links or content being moved to a different part of the page), the agent typically isn’t able to get the data because the page won’t match the expected layout. This means that the agent will either have to be fixed to adjust for the changes or be rebuilt from scratch. This process of having to “fix an agent” every time a website changes can be time consuming. When selecting a screen scraping vendor, make sure to research what methods the software has notify you of problems with agents and allow you to quickly fix those problems. Finding a software that helps you work around changing web pages will save you hours of frustration in the long run.Finally, there are some anti-screen scraping techniques that making extracting data difficult or impossible. One such example is the use of CAPTCHAs. Because agents cannot typically read or input the text displayed in the CAPTCHA, they can be prevented from logging in, submitting forms, or even continuing to view content on a site. Other websites employ programs that flag or block abnormal bursts of traffic that are typical of agents scraping data. Programs looking for these bursts in traffic will occasionally block the IP address of the computer running the agent to prevent further scraping. To get around this problem, some screen scraping software offers access to anonymous proxy systems which make the traffic difficult to block.
There are few vendors in the market that are known to support a wide variety of sites and styles of programming. Some of these include Connotate (www.connotate.com), Kapow Technologies (www.kapowtech.com), Mozenda (www.mozenda.com), QL2 (www.ql2.com) and Screen Scraper (www.screen-scraper.com). Screen scraping scripts can also be written by contract programmers who can be found on sites like Elance (www.elance.com) and Guru (www.guru.com). When beginning any screen scraping project be sure to consider all the costs. Some contract programmers charge large maintenance fees. Some software vendors require expensive hardware. You, by being having a clear definition for your screen scraping project and researching costs, will be able to find the most appropriate solution.