Note, this post has been updated as of January 2009 to reflect changes in the Web Agent Builder version 1.8.128.
I recently added this entry as a post in our new forums (which by the way, we are very excited about!) and decided it deserved some attention here as well, given the increase in queries we receive about AJAX (no, not the cleaning powder) and how to handle it in the Web Agent Builder.
----
As web technology advances, many sites are using more advanced methods to display web content. For example, when using our web application to view a list of agents, when you click on an agent, the list disappears and is replaced with the details about the agent. The browser does not navigate to a new page, but rather, the webpage itself requests the new information from the server in the background and displays the new information by changing a part of the current webpage. This technology is know as AJAX (asynchronous JavaScript and XML, don't worry, it's not as scary as it sounds).
Because agents are primarily built using a Page structure, sites that rely heavily on AJAX to display content can be tricky. However, in most cases, sites that use AJAX do so lightly, and an agent can be designed to handle them. Here are a list of cases where AJAX is most frequently used:
1) Paging a list of results (for example, clicking [b]next >>[/b] to get to the next set of results from a search). When paging, the next list of items simply replaces the old list without causing a new page to load. [action: Page List]
2) Clicking a list item causes the item details to appear somewhere on the same page. Often, there is a designated part of the existing page that the information appears in, or a box containing the information appears on the page covering the list (with some sort of 'close' button or link that causes it to go away). [action: Click Item]
3) Selecting a value from a drop-down causes a part of the page to change, or the values of another drop-down to populate (for example, selecting the automobile manufacturer in one drop-down causes another drop-down to populate with the available manufacturer models). [action: Set Element Value]
4) After a page loads, some of the page contents take additional time to finish loading. This is often manifest when testing the agent in the builder. An Item Not Found error will occur for the first action on the page. [action: Page]
All cases can be handled by telling the agent to wait for AJAX to complete before proceeding to the next action. Most actions contain a property titled 'Wait for AJAX to alter the current web page'. This can be set by either double-clicking the specific action (or right-clicking the action and choosing 'properties') and clicking the 'Additional Settings' button in the properties panel. If this property is checked, the action will wait 2 seconds for AJAX requests by the webpage to begin. For example, if I have a Click Item action with this property checked, the action will wait up to 2 seconds for the page to begin making AJAX requests, and then any additional time it takes for any AJAX calls to complete. So, in reality it may take less than a second after performing the click action for an AJAX call to begin, but a total of 5 seconds for the AJAX call to finish. The next action will not be executed until any detected AJAX calls have completed.
On the other hand, the Wait x seconds before performing the next action property of an action waits an absolute amount of time. You can also force an agent wait an absolute number of seconds by inserting a Wait-Seconds action anywhere within your current list of actions. This can be done by right-clicking an action and choosing 'Insert a Wait-Seconds action after this action'.
Screen scraping has gotten a bad rap for a long time, and its reputation is not entirely without merit. Ryanair, an Irish based airline company, announced in August of 2008 that it would cancel all tickets purchased through websites (e.g. BravoFly, Opodo, Atrapalo, OTBeach, et. al) that employed screen scraping techniques. There have also been countless examples of entire websites being duplicated using screen scraping. But, does that mean that all screen scraping is bad? There are plenty of legitimate reasons to use techniques and technologies that allow you to get information off of a website. Hopefully, this article will address the stigma attached to screen scraping by discussing some of its legitimate uses.
It’s been a long standing practice of retail companies around the world to keep an eye on the pricing of the competition. By knowing your competitors prices, you’re able to make adjustments to your own pricing and remain an attractive shopping option. Now that most companies have moved their prices online, you no longer have to send “spies” into retail locations, spend hours leafing through newspaper inserts, or make price-inquiry phone calls. Many websites have no printed policy on the use of screen scraping techniques and while that’s not an open invitation to do whatever you want on the site, it may mean that, as long you’re not causing an unreasonable strain on the site’s servers.
Forums can contain a wealth of useful information for product manufacturers, service providers, and marketers but getting to that information is often clumsy and time consuming. Provided the site doesn’t restrict the use of data extraction techniques, using screen scraping can make a world of difference. Imagine you’re a cell phone manufacturer. You just released a new phone and want to keep an eye on the public’s reaction. Users are likely to be far more candid with the anonymity a forum offers than they would be in a more intimate setting such as a focus group. So, by monitoring a forum, the cell phone manufacturer may be able to find useful information such as design successes and flaws, manufacturing defects, and consumer demand. These same principles can be used to monitor blogs or blog comments in the event no RSS feed is available.
Getting product information to the people who need it can be a pain (especially if you’re one of the people who needs it). Distributers, wholesalers, and dropshippers often use archaic methods (CDs, Excel Files, physical product catalogs) to get out product information. None of these methods give those who need up-to-date information what they need at the time they need it. This can make it impossible to determine inventory levels, adjust pricing, and be aware of new product offerings or discontinuations. Screen scraping can provide a rather elegant solution. Whether scraping your own site and providing the information to resellers or scraping the site of your distributor, you’re able to extract needed information in a timely, simple fashion. Some solutions, such as Mozenda, offer the ability to not only regularly schedule a screen scraping agent, but to also automatically export that information to a file or to a website. This means that you can either alert distributors to changes or—if you are a distributor—you can monitor suppliers’ changes all without investing additional time and effort.
The above examples are only a handful of the thousands of legitimate uses for screen scraping. Hopefully in the future, responsible users will find new legal and ethical reasons to better organize and repurpose information from the web. Screen scraping–or whatever you chose to call it–won’t have such a stigma attached to it when that time arrives.
"Gather the day," the analogy of plucking data from the web like fruit from a tree.
"Carpe diem is a phrase from a Latin poem by Horace (Odes 1.11). It is popularly translated to 'seize the day'. However, the most appropriate translation, considering the meaning of 'carpe' in the sentence as a whole, is believed to be 'gather the day', as in picking or plucking fruit." (wiki)
Web-data is literally analogous to low-hanging fruit in a few curious ways.
1) It's tangible. It's right there. It's in your browser. You can point your little finger at it, read it, copy it, paste it, print it...
It’s like a farmer, he can go from one tree to the next, plucking one piece of fruit at a time, just as a browser navigates from one website to another, one web page to the next. But at the end of the day everybody knows that the farmer will never produce anything useful unless he uses equipment to harvest the fruit, and A LOT of it quickly.
2) Fruit grows on trees. Websites are like trees (literally, they’re heirarchically shaped like trees!). Fruit typically grows on or near the ends of tree branches. Valuable web-data typcially resides in the pages on or near the end of website navigation branches.
In other words, if you were to produce a 3D model out of the heirarchical page structure of most websites, you’d get a tree, and the valuable data would look like fruit on or near the ends of the branches.
This last analogy is useful when it comes to designing tools to harvest web-data. The inherent tree-structure of most websites goes a long way in determining the underlying infrastructure of the data represented by the website, not to mention the HTML itself.