Using XPath in Mozenda - Mozenda Help Center
  • The Data You Want is Closer Than You Think

    Tell us about your data extraction needs.

  • Can‘t wait and need answers now?
    Here are other ways to reach us:

    +1 (801) 995-4550

    sales@mozenda.com

  • 100% Privacy. You are that important to us. Privacy Policy

Using XPath in Mozenda

February 22, 2016

XPath allows you to customize how an action finds its target object on the page (for example, how a Click Item action finds the link it’s supposed to click). Learning to customize or write XPaths will allow you to make agents that are accurate and efficient. This article provides an introduction on using XPath in Mozenda.

Introduction

Let’s consider the following key points regarding the use of XPath in Mozenda.

  • A street address is to a house what an XPath is to elements on a web page. Just as you can use an address to find a house, actions in the agent use an XPath to find objects (such as links, pictures, or text) in the HTML.
  • In the Agent Builder, we use a special window that makes viewing the HTML easier, called the DOM window.
  • Individual locations in the HTML (i.e., particular elements in the HTML) are represented by tags.

If you don’t understand HTML at this time, don’t worry. It will make more sense as we go along.

In order to show how XPath works, we’ll reference a simple web page (below):

The following image shows what the HTML for that web page would look like:

This is the webpage of the HTML shown at the beginning of this article.
Sample web page
The 'DOM' is the street map of the web page.
HTML of the sample web page

Learn About the Target Object

Let’s target the first item in the first list. In the HTML, that corresponds to the first <li> tag of the first <ul>. Going back to our map analogy, there are a few routes you can take to get there:

  • You can write an XPath that starts at the top of the HTML document and then steps down until it reaches the list item.

    /html[1]/body[1]/ul[1]/li[1]

    Translated to plain English, this means—


    Go to the first <html> tag, then go down one level to the first <body> tag, then go down one level to the first <ul> tag, then go down one level to the first <li> tag and select it.

  • You can use attributes in the tags as identifiers.

    //ul[@class=’List1′]/li[1]

    Translated to plain English, this means—


    Go to a <ul> tag whose class attribute is ‘List1’, then go down one level to the first <li> tag and select it.

    Alternatively, you could specify elements using text contained in the elements.

    //ul[@class=’List1′]/li[contains(.,’List Item 1′)]

    Translated to plain English, this means—

    Go to the <ul> tag whose class attribute is “List1”, then go down one level to the <li> tag containing the text “List Item 1”.

Note that a given expression can match more than one tag in the HTML. In other words, if it’s not specific enough, it might lead to more than one place. You could rewrite the second XPath to read as follows:

//ul/li[1]

This matches the first <li> tags in both lists in the example DOM. This is a good way to conceptualize an Item List in Mozenda: the set of tags which match a given XPath. If the XPath for a capture or click action matches more than one tag, Mozenda performs the action on the first tag matching the XPath.

Item Lists

The XPaths we have covered so far assume that they start from the top of the HTML document and go down looking for the target object. When writing XPaths for Item Lists and other list-type actions, this is not the case. When an Item List is created, all capture, click, and input actions inside are automatically defined relative to tags matched by the Item List XPath. The Item List XPath is essentially the “anchor” on which related actions will be built. For example, in the sample webpage HTML, you could define a list using the following XPath: //ul

Remember to reference the sample web page HTML at the top of the article when trying to understand the XPath examples shown in this walkthrough.

Mozenda Action XPath Explanation
Begin Item List //ul For each <ul> tag …
    Capture – Header h2 Go down one level and capture the <h2> tag
    Capture – Item 1 li[1]/b Go down to the first <li> then down to the <b>
    Capture – Item 2 li[2]/b Go down to the second <li> then down to the <b>
End List
Output:
Header Item 1 Item 2
List 1 List Item 1 List Item 2
List 2 List Item 1 List Item 2

Alternatively, if you wanted to capture the data in a more “vertical” format, you could define the Item List as shown to the right.

Mozenda Action XPath English
Begin Item List //ul/li For each <li> tag under a <ul> tag …
    Capture – Header ../h2 Go up one level and capture the <h2> tag
    Capture – Item ./b Go down one level and capture the <b> tag
End List
Output:
Header Item
List 1 Item 1
List 1 Item 2
List 2 Item 1
List 2 Item 2

Associated Text and List Pagers

List Pagers and capture or click actions which have been associated with nearby text have two XPaths associated with them: one which identifies the “Anchor” tag (e.g. the “nearby text or image”), and another which describes the location of the item to be clicked or captured relative to the anchor tag.

Relative XPaths work by first finding a known object, like specific text, then going from there to find the desired object.
This associates the “stuff” about each list item with the text “List Item”

Set Operations in XPath 1.0

Mozenda uses XPath 1.0, which means that features introduced in XPath 2.0 (such as intersect, except, and set operators) are not* available. Fortunately, there are workarounds.

Let $X and $Y be XPath expressions which each select a set of tags on a web page.

XPath 2.0 Expression XPath 1.0 Equivalent Description
$X union $Y $X | $Y Select the tags matched by $X and the tags matched by $Y.
$X intersect $Y $X[count(.|$Y)=count($Y)] Select the tags matched by both $X and $Y.
$X except $Y $X[count(.|$Y)!=count($Y)] Select the tags matched by $X which are not also matched by $Y.

Notes

 *  The union operator, “|”, is available.

 

Here are some third-party resources you may find helpful.