Objective: Learn how to get the current time in NST using XPath
Author: Josh
Last Revised: 2 November, 2014
1. XPath
XPath (XML Path Language) is a language for querying XML documents (or in our case HTML documents). It's structured by referencing the DOM (Document Object Model) along with a few key selection characters.
Compared to regular expressions, XPath is more strict in the sense that it's more likely to break before a regular expression due to the referenced document changing (Neopets updating their HTML). However, the positive side of it being more strict is that you have much better control of what you're actually trying to parse and reduce the chance that you'll parse multiple strings when you're only expecting one.
Python does not have a native implementation of XPath. Instead, for the duration of this tutorial we'll be using the widely popular library called 'lxml.' To install it:
pip install lxml
If you don't have pip, go get it.
2. Understanding XPath
It's easiest to introduce with an example:
//*[@id="nst"]/text()
Looks rather cryptic at first, but this piece of XPath when applied to the Neopets index page will grab the current time in NST. This particular piece actually uses 6 different selection characters (remember I mentioned that in the beginning?) Let's break them down one by one:
- //: The '//' expressions says, "Match all elements throughout the entire document." An element of course is the basic building block of HTML (like <b>, <table>, etc.). If you don't understand elements I would advise you to review a HTML tutorial before proceeding. Any identifiers after the '//' identifier will be applied across the whole document. The other version of this identifier is simply one forward slash, '/'. With just one forward slash, it is now saying, "Match all elements starting from the root node." In the case of an HTML document, the root node would be the '<html>' element.
- *: The '*' expression says, "Match any element of any type." Unlike the previous identifier, this identifier acts as a wildcard so that every single element, regardless of type, is matched. To better demonstrate, look at these two examples:
//b
and//*
The first expression will match all bold elements throughout the entire document. The second expression will match every element throughout the entire document, regardless of type. - [ ]: The brackets mean one of two things depending on usage: In the example we're using, the brackets are used to say, "Look inside of the element." In other words, the following identifiers will be used to match anything inside of the HTML element (remember, elements can have attributes). This will make more sense with the next identifier. The other use of the bracket can be used to select one element from an expression that returns multiple elements. For example, we can go back to our example for selecting bold elements and modify it to select only the first bold element as so:
//b[1]
- @: The '@' identifier is used to match attributes of an element. This identifier is usually followed by the attribute name we wish to match (which in our case is the 'id' attribute), however it can also be followed by our '*' identifier to match any attribute.
- =: The '=' identifier is much what a programmer would expect (except usually we use ==), a comparison identifier. It is simply used to compare one thing to another. In our case we're comparing the 'id' attribute to 'nst.' In other words, we're saying: "Match an element who's 'id' attribute is equal to 'nst.'"
- text(): This is actually an internal function to xpath. XPath has a few internal functions, this particular one is used to grab any text within the bounds of an element. In our case it's grabbing the text within the bounds of an element that has an id of 'nst.'
Of course, this example is nothing close to an exhaustive explanation of all the identifiers Xpath has to offer. To get a more in-depth explanation of XPath, refer to the official W3C tutorial.
3. Testing the example
To wrap up this tutorial, we'll go ahead and run the above query as a proof of concept. To do this, I'm going to use the python 'requests' library, which can be installed with:
pip install requests
Follow the below interactive example to run the test yourself:
>>> import requests >>> import lxml.html >>> r = requests.get('http://www.neopets.com/') >>> d = lxml.html.document_fromstring(r.text) >>> d.xpath('//*[@id="nst"]/text()') ['7:50:34 am NST']
And that's how you use XPath to get the current time in NST!
Edited by Josh, 02 November 2014 - 08:26 AM.