What Is Screen Scraping?
Defining Screen Scraping
Screen scraping and web data extraction are terms for the process of using a specialized computer program to extract information from a website. Screen scraping is different than using spiders ‘spiders’, ‘miners’, or other similar techniques, because–instead of indexing everything on the page–the screen scrapers only extract information selected by the user. A good screen scraping program is able to navigate through more than one page while extracting whatever content the user wishes. Screen scraping programs can save you time and money and are especially useful for automating human processes such as copying and pasting information from the web to a document or spreadsheet, keying in data from the web, or keeping up with regularly changing pages.
How Does Screen Scraping Work
Screen scraping is accomplished through building and running scripts that are commonly known as “agents”. Agents visit web pages and look for specific, user-defined content. Smart agents are able to navigate through a site much the same way a human would (i.e. clicking on links on the page). The agents, as they navigate through pages, ignore everything that is not needed and only capture the desired text.
Shortcomings of Screen Scraping
Screen scraping is not without its challenges. Pages that use a lot of JavaScript can be problematic for screen scraping technologies because they often change the information on the page without requiring the user to load a new page. This can cause agents that aren’t equipped to handle content changes made through JavaScript to complete miss or incorrectly record the data.
A second weakness to screen scraping is that when an agent is created, it is based around structure of the website at the time the agent was built. If— in the future—there are changes to the structure of the targeted website (e.g. links or content being moved to a different part of the page), the agent typically isn’t able to get the data because the page won’t match the expected layout. This means that the agent will either have to be fixed to adjust for the changes or be rebuilt from scratch. This process of having to “fix an agent” every time a website changes can be time consuming. When selecting a screen scraping vendor, make sure to research what methods the software has notify you of problems with agents and allow you to quickly fix those problems. Finding a software that helps you work around changing web pages will save you hours of frustration in the long run.
Finally, there are some anti-screen scraping techniques that making extracting data difficult or impossible. One such example is the use of CAPTCHAs. Because agents cannot typically read or input the text displayed in the CAPTCHA, they can be prevented from logging in, submitting forms, or even continuing to view content on a site. Other websites employ programs that flag or block abnormal bursts of traffic that are typical of agents scraping data. Programs looking for these bursts in traffic will occasionally block the IP address of the computer running the agent to prevent further scraping. To get around this problem, some screen scraping software offers access to anonymous proxy systems which make the traffic difficult to block.
Starting Your Own Screen Scraping
There are few vendors in the market that are known to support a wide variety of sites and styles of programming. Some of these include Connotate (www.connotate.com), Kapow Technologies (www.kapowtech.com), Mozenda (www.mozenda.com), QL2 (www.ql2.com) and Screen Scraper (www.screen-scraper.com). Screen scraping scripts can also be written by contract programmers who can be found on sites like Elance (www.elance.com) and Guru (www.guru.com). When beginning any screen scraping project be sure to consider all the costs. Some contract programmers charge large maintenance fees. Some software vendors require expensive hardware. You, by being having a clear definition for your screen scraping project and researching costs, will be able to find the most appropriate solution.
You can leave a response, or trackback from your own site.
2 Responses to “What Is Screen Scraping?”
Leave a Reply
Archives
Categories
Follow This Blog
Subscribe to RSS 







October 23rd, 2008 at 2:53 pm
All in all a good summary of screen scraping, but I do have a few comments.
With regards to web site changes you success depends a lot on the flexibility of the solution you use. With a good solution, you don’t have to rewrite you agent every time a small change occurs. Also changes actually happen a lot less than people think. I once wrote a screen scraping agent of Intel.com, and it worked consistently for 2 years.
I have worked with screen scraping for the last 5 years, and I spend far more time building new agents than maintaining old ones, if you don’t I think it is an indication that you are using wrong tools (or heavily understaffed). I know screen scraping projects, where each developer is maintaining 60+ agents (running daily), and still spend more than 70% on developing new agents.
Screen Scraping is not just scraping, it can also be data entry and input automation, and although screen scraping approach may often be a more volatile solution than backend API integration, it has a lot of upsides. The web interface is the most used GUI form today, and since you are ‘programming’ directly on the GUI and not an API it is much easier for business users to formulate the requirements.
For the developer it is also a major upside not to have to learn a new API for each application that must be integrated, since there is only one API which is the site itself. If you have ever tried to reverse engineer a 40+ table layout to discover the business rules required to extract/insert data, you will happily open your arms to screen scraping. Since you are operating at the GUI level all the business logic has already been applied (and tested) which is a huge time saver.
Back in 2005 I did an integration of two separate HR systems for a large government contractor. The project was completed in about 6 (man) weeks, since I didn’t have spend any time learning any People Soft APIs. In this particular case, getting API access to the servers (which were owned and hosed by two separate entities), would probably have take months of bureaucracy. The cost of the solution was about $150.000; it sounds like a lot of money, but it freed up 6 (East Cost) contractors, so it offered an extremely high ROI.
I have gotten past CAPTCHAs before, but only if they were simple enough to be read by an external OCR which I could integrate with the screen scraping. If your automating a data entry process and encounters an CAPTCHA, you solution should be capable of suspending the agent, and asking a user for input (for the CAPTCHA or other) and then continue the process. Also there are actually companies in India which offer CAPTCHA reading by humans for a very reasonable price.
One thing that you should know about screen scraping is that it is very hard to use in systems that require transactional logic. Screen scraping operates on the HTTP protocol which cannot give any delivery guarantees. For instance if an error occurs in the last step of an online ticket booking, and you don’t get the booking number, you cannot know if the ticket was booked by the remote system or not. If the booking number is the only identifier you wont even be able to have another agent check it for you, in which cases you probably have to grab the phone and have them look up the reservation using you visa card (Yes I have tried that).
Good luck scraping
/Klaus
December 26th, 2008 at 9:00 am
I found more here if anyone’s interested