Screen scraping and web data extraction are terms for the process of using a specialized computer program to extract information from a website. Screen scraping is different than using spiders ‘spiders’, ‘miners’, or other similar techniques, because–instead of indexing everything on the page–the screen scrapers only extract information selected by the user. A good screen scraping program is able to navigate through more than one page while extracting whatever content the user wishes. Screen scraping programs can save you time and money and are especially useful for automating human processes such as copying and pasting information from the web to a document or spreadsheet, keying in data from the web, or keeping up with regularly changing pages.
Screen scraping is accomplished through building and running scripts that are commonly known as “agents”. Agents visit web pages and look for specific, user-defined content. Smart agents are able to navigate through a site much the same way a human would (i.e. clicking on links on the page). The agents, as they navigate through pages, ignore everything that is not needed and only capture the desired text.
Screen scraping is not without its challenges. Pages that use a lot of JavaScript can be problematic for screen scraping technologies because they often change the information on the page without requiring the user to load a new page. This can cause agents that aren’t equipped to handle content changes made through JavaScript to complete miss or incorrectly record the data.A second weakness to screen scraping is that when an agent is created, it is based around structure of the website at the time the agent was built. If— in the future—there are changes to the structure of the targeted website (e.g. links or content being moved to a different part of the page), the agent typically isn’t able to get the data because the page won’t match the expected layout. This means that the agent will either have to be fixed to adjust for the changes or be rebuilt from scratch. This process of having to “fix an agent” every time a website changes can be time consuming. When selecting a screen scraping vendor, make sure to research what methods the software has notify you of problems with agents and allow you to quickly fix those problems. Finding a software that helps you work around changing web pages will save you hours of frustration in the long run.Finally, there are some anti-screen scraping techniques that making extracting data difficult or impossible. One such example is the use of CAPTCHAs. Because agents cannot typically read or input the text displayed in the CAPTCHA, they can be prevented from logging in, submitting forms, or even continuing to view content on a site. Other websites employ programs that flag or block abnormal bursts of traffic that are typical of agents scraping data. Programs looking for these bursts in traffic will occasionally block the IP address of the computer running the agent to prevent further scraping. To get around this problem, some screen scraping software offers access to anonymous proxy systems which make the traffic difficult to block.
There are few vendors in the market that are known to support a wide variety of sites and styles of programming. Some of these include Connotate (www.connotate.com), Kapow Technologies (www.kapowtech.com), Mozenda (www.mozenda.com), QL2 (www.ql2.com) and Screen Scraper (www.screen-scraper.com). Screen scraping scripts can also be written by contract programmers who can be found on sites like Elance (www.elance.com) and Guru (www.guru.com). When beginning any screen scraping project be sure to consider all the costs. Some contract programmers charge large maintenance fees. Some software vendors require expensive hardware. You, by being having a clear definition for your screen scraping project and researching costs, will be able to find the most appropriate solution.
Comments
March 21, 2009 @ 09:43 #
All in all a good summary of screen scraping, but I do have a few comments.With regards to web site changes you success depends a lot on the flexibility of the solution you use. With a good solution, you don't have to rewrite you agent every time a small change occurs. Also changes actually happen a lot less than people think. I once wrote a screen scraping agent of Intel.com, and it worked consistently for 2 years. I have worked with screen scraping for the last 5 years, and I spend far more time building new agents than maintaining old ones, if you don't I think it is an indication that you are using wrong tools (or heavily understaffed). I know screen scraping projects, where each developer is maintaining 60+ agents (running daily), and still spend more than 70% on developing new agents. Screen Scraping is not just scraping, it can also be data entry and input automation, and although screen scraping approach may often be a more volatile solution than backend API integration, it has a lot of upsides. The web interface is the most used GUI form today, and since you are 'programming' directly on the GUI and not an API it is much easier for business users to formulate the requirements. For the developer it is also a major upside not to have to learn a new API for each application that must be integrated, since there is only one API which is the site itself. If you have ever tried to reverse engineer a 40+ table layout to discover the business rules required to extract/insert data, you will happily open your arms to screen scraping. Since you are operating at the GUI level all the business logic has already been applied (and tested) which is a huge time saver. Back in 2005 I did an integration of two separate HR systems for a large government contractor. The project was completed in about 6 (man) weeks, since I didn't have spend any time learning any People Soft APIs. In this particular case, getting API access to the servers (which were owned and hosed by two separate entities), would probably have take months of bureaucracy. The cost of the solution was about $150.000; it sounds like a lot of money, but it freed up 6 (East Cost) contractors, so it offered an extremely high ROI. I have gotten past CAPTCHAs before, but only if they were simple enough to be read by an external OCR which I could integrate with the screen scraping. If your automating a data entry process and encounters an CAPTCHA, you solution should be capable of suspending the agent, and asking a user for input (for the CAPTCHA or other) and then continue the process. Also there are actually companies in India which offer CAPTCHA reading by humans for a very reasonable price.One thing that you should know about screen scraping is that it is very hard to use in systems that require transactional logic. Screen scraping operates on the HTTP protocol which cannot give any delivery guarantees. For instance if an error occurs in the last step of an online ticket booking, and you don't get the booking number, you cannot know if the ticket was booked by the remote system or not. If the booking number is the only identifier you wont even be able to have another agent check it for you, in which cases you probably have to grab the phone and have them look up the reservation using you visa card (Yes I have tried that). Good luck scraping/Klaus
Klaus
March 21, 2009 @ 09:44 #
I found more <a href="www.sourcearticle.info/ rel="nofollow">here</a> if anyone's interested
Ben Thompson
March 21, 2009 @ 09:45 #
Please, can you PM me and tell me few more things about this?.
Smartcardguy
We'd be happy to contact you. Send an email to support@mozenda.com
Nate Graves
June 3, 2009 @ 23:34 #
This is great information; thank you so much for posting! If I wouldn't have run into this post, I would have been looking for a long time! Also a thank you to Klaus's useful tips. Just what I've been looking for!
education software
June 11, 2009 @ 03:55 #
Nice post.Screen Scraping is not just scraping, it can also be data entry and input automation, and although screen scraping approach may often be a more volatile solution than backend API integration, it has a lot of upsides.
SEO
June 14, 2009 @ 09:35 #
In the 1980s financial data providers such as Reuters, Telerate, and Quotron displayed data in 24x80 format intended for a human reader. Users of this data particularly investment banks wrote applications to capture and convert this character data as numeric data for inclusion into calculations for trading decisions without re-keying the data. The common term for this practice, especially in the United Kingdom, was page shredding, since the results could be imagined to have passed through a paper shredder.
Precision Engineers
June 29, 2009 @ 12:44 #
I hadn't been using my rss reader for a while and I have a huge backlog of stuff to catch up on. Glad to have taken to time to catch up on your blog though. Cheers.
Kim
July 22, 2009 @ 16:46 #
Excellent stuff.Thanks for sharing such a useful information with us.Thanks!!!!!!
online poker
August 10, 2009 @ 09:58 #
Thank you Very much for this post .
Free Slots
August 12, 2009 @ 16:57 #
Of course, what a great site and advisory posts, Can I add backlink - import your rss feed? Regards, Reader.
bad credit loans
August 16, 2009 @ 13:43 #
The web interface is the most used GUI form today, and since you are 'programming' directly on the GUI and not an API it is much easier for business users to formulate the requirements.
sites de casinos en ligne fiables
August 16, 2009 @ 16:03 #
Admiring the time and effort you put into your blog and detailed information you offer!
Web design
August 27, 2009 @ 06:22 #
Awesome. just awesome...i haven't any word to appreciate this post.....Really i am impressed from this post....the person who create this post it was a great human..thanks for shared this with us.i found this informative and interesting blog so i think so its very useful and knowledge able.I would like to thank you for the efforts you have made in writing this article. I am hoping the same best work from you in the future as well. In fact your creative writing abilities has inspired me.Really the blogging is spreading its wings rapidly. Your write up is fine example of it
Fashion Industry News
August 28, 2009 @ 22:41 #
This article gives the light in which we can observe the reality. this is very nice one and gives in depth information. thanks for this nice article Good post.....Valuable information for all.I will recommend my friends to read this for sureā¦
giochi gratuiti del casino
September 1, 2009 @ 11:38 #
This process of extracting data from the HTML is called screen scraping because it's scraping the data off the screen instead of getting the data more directly.
gochi
September 26, 2009 @ 18:33 #
Screen scraping and web data extraction are terms for the process of using a specialized computer program to extract information from a website
Antivirus firewall
September 27, 2009 @ 03:50 #
Nice post . keep up the good work
Jacob
September 30, 2009 @ 04:16 #
This blog on screen scraping is really interesting, its great for someone who extracts information from websites.http://www.raidious.com/Content solutions
Content solutions
Add comment