Screen scraping and web data extraction are terms for the process of using a specialized computer program to extract information from a website. Screen scraping is different than using spiders ‘spiders’, ‘miners’, or other similar techniques, because–instead of indexing everything on the page–the screen scrapers only extract information selected by the user. A good screen scraping program is able to navigate through more than one page while extracting whatever content the user wishes. Screen scraping programs can save you time and money and are especially useful for automating human processes such as copying and pasting information from the web to a document or spreadsheet, keying in data from the web, or keeping up with regularly changing pages.
Screen scraping is accomplished through building and running scripts that are commonly known as “agents”. Agents visit web pages and look for specific, user-defined content. Smart agents are able to navigate through a site much the same way a human would (i.e. clicking on links on the page). The agents, as they navigate through pages, ignore everything that is not needed and only capture the desired text.
Screen scraping is not without its challenges. Pages that use a lot of JavaScript can be problematic for screen scraping technologies because they often change the information on the page without requiring the user to load a new page. This can cause agents that aren’t equipped to handle content changes made through JavaScript to complete miss or incorrectly record the data.A second weakness to screen scraping is that when an agent is created, it is based around structure of the website at the time the agent was built. If— in the future—there are changes to the structure of the targeted website (e.g. links or content being moved to a different part of the page), the agent typically isn’t able to get the data because the page won’t match the expected layout. This means that the agent will either have to be fixed to adjust for the changes or be rebuilt from scratch. This process of having to “fix an agent” every time a website changes can be time consuming. When selecting a screen scraping vendor, make sure to research what methods the software has notify you of problems with agents and allow you to quickly fix those problems. Finding a software that helps you work around changing web pages will save you hours of frustration in the long run.Finally, there are some anti-screen scraping techniques that making extracting data difficult or impossible. One such example is the use of CAPTCHAs. Because agents cannot typically read or input the text displayed in the CAPTCHA, they can be prevented from logging in, submitting forms, or even continuing to view content on a site. Other websites employ programs that flag or block abnormal bursts of traffic that are typical of agents scraping data. Programs looking for these bursts in traffic will occasionally block the IP address of the computer running the agent to prevent further scraping. To get around this problem, some screen scraping software offers access to anonymous proxy systems which make the traffic difficult to block.
There are few vendors in the market that are known to support a wide variety of sites and styles of programming. Some of these include Connotate (www.connotate.com), Kapow Technologies (www.kapowtech.com), Mozenda (www.mozenda.com), QL2 (www.ql2.com) and Screen Scraper (www.screen-scraper.com). Screen scraping scripts can also be written by contract programmers who can be found on sites like Elance (www.elance.com) and Guru (www.guru.com). When beginning any screen scraping project be sure to consider all the costs. Some contract programmers charge large maintenance fees. Some software vendors require expensive hardware. You, by being having a clear definition for your screen scraping project and researching costs, will be able to find the most appropriate solution.