Note, this post has been updated as of January 2009 to reflect changes in the Web Agent Builder version 1.8.128.
I recently added this entry as a post in our new forums (which by the way, we are very excited about!) and decided it deserved some attention here as well, given the increase in queries we receive about AJAX (no, not the cleaning powder) and how to handle it in the Web Agent Builder.
----
As web technology advances, many sites are using more advanced methods to display web content. For example, when using our web application to view a list of agents, when you click on an agent, the list disappears and is replaced with the details about the agent. The browser does not navigate to a new page, but rather, the webpage itself requests the new information from the server in the background and displays the new information by changing a part of the current webpage. This technology is know as AJAX (asynchronous JavaScript and XML, don't worry, it's not as scary as it sounds).
Because agents are primarily built using a Page structure, sites that rely heavily on AJAX to display content can be tricky. However, in most cases, sites that use AJAX do so lightly, and an agent can be designed to handle them. Here are a list of cases where AJAX is most frequently used:
1) Paging a list of results (for example, clicking [b]next >>[/b] to get to the next set of results from a search). When paging, the next list of items simply replaces the old list without causing a new page to load. [action: Page List]
2) Clicking a list item causes the item details to appear somewhere on the same page. Often, there is a designated part of the existing page that the information appears in, or a box containing the information appears on the page covering the list (with some sort of 'close' button or link that causes it to go away). [action: Click Item]
3) Selecting a value from a drop-down causes a part of the page to change, or the values of another drop-down to populate (for example, selecting the automobile manufacturer in one drop-down causes another drop-down to populate with the available manufacturer models). [action: Set Element Value]
4) After a page loads, some of the page contents take additional time to finish loading. This is often manifest when testing the agent in the builder. An Item Not Found error will occur for the first action on the page. [action: Page]
All cases can be handled by telling the agent to wait for AJAX to complete before proceeding to the next action. Most actions contain a property titled 'Wait for AJAX to alter the current web page'. This can be set by either double-clicking the specific action (or right-clicking the action and choosing 'properties') and clicking the 'Additional Settings' button in the properties panel. If this property is checked, the action will wait 2 seconds for AJAX requests by the webpage to begin. For example, if I have a Click Item action with this property checked, the action will wait up to 2 seconds for the page to begin making AJAX requests, and then any additional time it takes for any AJAX calls to complete. So, in reality it may take less than a second after performing the click action for an AJAX call to begin, but a total of 5 seconds for the AJAX call to finish. The next action will not be executed until any detected AJAX calls have completed.
On the other hand, the Wait x seconds before performing the next action property of an action waits an absolute amount of time. You can also force an agent wait an absolute number of seconds by inserting a Wait-Seconds action anywhere within your current list of actions. This can be done by right-clicking an action and choosing 'Insert a Wait-Seconds action after this action'.
Our many thanks to Tom at CodeSanity.net for his killer Mozenda review:
"Mozenda is a very powerful data scraping service. If you have ever found yourself writing scripts or manually copying and pasting data from one website to another then mozenda is for you. They have a very nice, full featured REST API which will be the focus of this article." Read more...
Tom wrote a nifty CodeIgniter Library (PHP) to easily interact with our Mozenda API. Download it here.
We look forward to his launch of MyGov365.
A few years ago, when InfoSquire was in its infancy, I received a phone call from a gentleman in New York named Jeff Stewart. He was curious if our technology was capable of 1) crawling multiple domains and discovering the presence of certain file formats, RSS/ATOM feeds and 2) capable of processing high volumes of web pages on demand.
At the time, InfoSquire was composed of no more than yours truly, and those of you who have run a tech sole-proprietorship know that when a potential client calls you asking if you have a feature that you know you could code up in a day or two, the answer is always yes. In this case yes and yes (this was early stage mind you).
Little did I know that I’d still be working with Jeff a few years later (and a couple of nice trips to Manhattan) and that I’d have built a whole system dedicated to managing, monitoring and saving the contents of (at one point) hundreds of thousands of RSS/ATOM feeds! The company that was using this custom feed “ping service”? You guessed it, Monitor110, a NY start-up specializing in real-time web-content monitoring/analysis and intelligence delivery to (primarily) the financial sector. Their goal was to provide a platform that gave knowledge workers a head-start on information that could impact their investments. This was accomplished at InfoSquire by “pinging” thousands of valuable resources to see when their contents had changed. Notification of changed resources was then passed onto Monitor110 who would then process the new contents of the resource.
At it’s height, our system was capable of pulling down well over a billion feeds per month, though the number of individual targeted feeds were eventually refined from hundreds of thousands to around 80,000 hand-picked premium feeds that contained the best information they were looking for. In turn, we provided a service to add/remove and update feed information in our system, as well as specify the interval at which a feed would be checked.
Last week Monitor110 announced that it was closing it doors. This is obviously sad to me for two reasons: 1) They had a great idea and had developed an awesome system, but unfortunately made some important direction changes too late in the game (see Roger Ehrenberg’s Monitor110: A Post Mortem) and 2) They were one of our best clients, and we loved working with that very talented and innovative group of individuals.
So I’ve been working lately on quite an overhaul to the InfoSquire backend. It involves the introduction of ‘Stateful Agent Executions’.
Essentially, it provides a mechanism for keeping track of exactly where an agent is in the exeuction process.. which URLs have been visited, which URLs have yet to be visited, how are URLs related (their path), which records have been saved, are yet to be saved or had problems being saved etc.. The whole state of the agent exeuction can at anytime be serialized and saved in case the execution needs to be stopped and resumed later. It also allows for a detailed analysis to be performed in the Agent debugger when a problem is reported.
In addition to these benefits, the stateful execution model will allow for the running of arbitrarily large or long-running agents since records will be saved to data repositories as they reach a ready-to-save state. Once records are saved, they can then be freed from memory.
The current stateless executions model is not as efficient for 3 reasons. 1) All records are kept in memory until agent exeuction is complete. 2) Large or long running agents can capture a lot of data and thus exhaust available memory, requiring multiple smaller agents to do the same work. 3) If something goes wrong during the execution process and causes it to crash, the process needs to start all over again.
Horray for stateful agents! This will be rolling into production over the next few weeks.
"Gather the day," the analogy of plucking data from the web like fruit from a tree.
"Carpe diem is a phrase from a Latin poem by Horace (Odes 1.11). It is popularly translated to 'seize the day'. However, the most appropriate translation, considering the meaning of 'carpe' in the sentence as a whole, is believed to be 'gather the day', as in picking or plucking fruit." (wiki)
Web-data is literally analogous to low-hanging fruit in a few curious ways.
1) It's tangible. It's right there. It's in your browser. You can point your little finger at it, read it, copy it, paste it, print it...
It’s like a farmer, he can go from one tree to the next, plucking one piece of fruit at a time, just as a browser navigates from one website to another, one web page to the next. But at the end of the day everybody knows that the farmer will never produce anything useful unless he uses equipment to harvest the fruit, and A LOT of it quickly.
2) Fruit grows on trees. Websites are like trees (literally, they’re heirarchically shaped like trees!). Fruit typically grows on or near the ends of tree branches. Valuable web-data typcially resides in the pages on or near the end of website navigation branches.
In other words, if you were to produce a 3D model out of the heirarchical page structure of most websites, you’d get a tree, and the valuable data would look like fruit on or near the ends of the branches.
This last analogy is useful when it comes to designing tools to harvest web-data. The inherent tree-structure of most websites goes a long way in determining the underlying infrastructure of the data represented by the website, not to mention the HTML itself.