Best Practices: Xpath and Regex for Web Scraping
June 16, 2016
Both XPath and regular expressions are important tools in Mozenda. Whether you are just getting started or a seasoned pro, knowing more about these features and how to use them in the Agent Builder can save a lot of time and headache in the agent-building process.
XPath is a query language that uses path expressions to select nodes in a markup language document. Web pages contain HTML (hypertext markup language), so XPath can be used to select text or images based on the specified query. Mozenda uses this method to identify and capture information or interact with an element based on its location within a web page. Every action used in the Agent Builder uses XPath.
If you’re new to XPath and want to read more, start with these help articles:
Useful XPath Extensions
Since XPath can be used in many programming situations, some dedicated developers have built free utilities to test queries in real time directly in a browser. Let’s cover a few that stand out and how they can be used to improve your agent-building processes.
Note: The following are all third-party extensions for the Chrome browser. Mozenda does not provide support or training for these tools. Use them at your own discretion.
This tool makes it easy to check an existing XPath or to write a custom one by hand and view the results. After installing, simply click on the new icon in the corner of the browser and an overlay will pop up. As you can see in the screenshot above (callouts have been added), once an XPath is entered, the matching elements on the current page are highlighted and listed in a sidebar.
Although XPather is the most basic of the tools we’re looking at, it comes in handy for quickly checking an XPath from the Agent Builder and making modifications as needed.
This extension takes the concept of XPather a step further by allowing the user to define a container for the query. Mozenda uses this in item lists, where the containing actions are performed relative to the item list’s XPath. Like XPather, you click on an icon to view it and display the total results and highlighted elements on the page.
Lists are used on practically every website, so most agents built using Mozenda also use list actions. If you’re running into problems with a list in the Agent Builder, this tool can be very useful for troubleshooting website lists that contain complex or inconsistent markup.
While the above tools are useful for fine-tuning an existing XPath query, RexPath can be used to create custom queries from scratch. Just like XPather and XPath Helper, it supports highlighting matching elements and displaying the number of matches, but also shows the attributes of each HTML element and will update the query on the fly based on the user’s selections.
After installing the extension, you can access it by right-clicking on the element you’re interested in and selecting “.RexPath” from the context menu. An overlay will appear at the bottom right of the page and display several explicit XPath queries that direct to that one element, but you’ll also notice a Custom Query Builder at the bottom, which is where the magic happens.
The default “short” XPath will usually be based on the HTML element and the text it contains, which often has just one match on a page. Clicking on the plus (+) button will add the element’s ancestor (the element that contains the selected element) to the XPath and remove the text requirement, which should yield more matches.
On more complex websites, pay close attention to the attributes of both the subject element and its ancestors. As you navigate through the ancestry of the element and build a more specific query, RexPath also displays the attributes of each element, giving you the opportunity to be even more specific based on CSS classes or IDs, visible text, and more. To apply an attribute to your query, click on the name or value shown in the corresponding table. View the image above for a quick demonstration of how the query matches change from 1 to 104, then 17, then the desired 10 per page.
RexPath is extremely useful for cases where the Agent Builder’s default XPath queries don’t consistently yield the desired results and require additional attention.
Once you have found the perfect XPath for your capture item, go back to the Agent Builder and copy/paste your query from your tool of choice into the XPath panel. Make sure that you replace existing entries and create backup queries as needed.
Note that none of the extensions listed above can completely replace a working knowledge of XPath. Treat these as supplemental tools that can help you get started, save you time, and assist you with troubleshooting.
A regular expression (or “regex” for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards. You may have used wildcards in your file manager in the past to find your text files by typing “.txt”. This is a form of regular expressions.
In Mozenda, regular expressions are used to identify specific strings of text within a larger body of text. This often includes phone numbers, addresses, emails, product numbers, and any other important data. With regular expressions, these strings of text can be isolated and captured separately from unnecessary data.
Much like XPath, regular expressions are used in many facets of software engineering, and there are several websites that are worth looking at to learn more. Here are some that we recommend:
- RegexOne. A simple, course-style experience with interactive exercises.
- Regex101. Online testing tool with a handy reference guide and matching results window. Perfect for troubleshooting regular expressions on a specific body of text.
In the Agent Builder, each capture action can make use of regular expressions through the Refine Captured Text window via the right-click context menu. Check the box next to Regular Expression to use this feature. In the above example, a regular expression is being used to extract a phone number from a sentence. Also note that you can use multiple capture definitions for a single, which is useful for when websites use different formatting for the same information (i.e. (123) 456 7890 vs. 123.456.7890).
Using XPath and regular expressions ensures that you are collecting data in the most effective and accurate way. Mozenda provides a solid base for getting started with these, but every website is different and some sites will require more attention than others.
Have questions or suggestions? Contact Mozenda Support at firstname.lastname@example.org or by calling (801) 995-4550, option 2.