Quantcast

Jump to content


Photo

[Tutorial] Parsing with Python and XPath


  • Please log in to reply
4 replies to this topic

#1 Josh

Josh
  • 318 posts

Posted 02 November 2014 - 08:22 AM

Objective: Learn how to get the current time in NST using XPath

Author: Josh

Last Revised: 2 November, 2014

 

1. XPath

 

XPath (XML Path Language) is a language for querying XML documents (or in our case HTML documents). It's structured by referencing the DOM (Document Object Model) along with a few key selection characters. 

 

Compared to regular expressions, XPath is more strict in the sense that it's more likely to break before a regular expression due to the referenced document changing (Neopets updating their HTML). However, the positive side of it being more strict is that you have much better control of what you're actually trying to parse and reduce the chance that you'll parse multiple strings when you're only expecting one. 

 

Python does not have a native implementation of XPath. Instead, for the duration of this tutorial we'll be using the widely popular library called 'lxml.' To install it:

pip install lxml

If you don't have pip, go get it

 

2. Understanding XPath

 

It's easiest to introduce with an example:

//*[@id="nst"]/text()

Looks rather cryptic at first, but this piece of XPath when applied to the Neopets index page will grab the current time in NST. This particular piece actually uses 6 different selection characters (remember I mentioned that in the beginning?) Let's break them down one by one:

  • //: The '//' expressions says, "Match all elements throughout the entire document." An element of course is the basic building block of HTML (like <b>, <table>, etc.). If you don't understand elements I would advise you to review a HTML tutorial before proceeding. Any identifiers after the '//' identifier will be applied across the whole document. The other version of this identifier is simply one forward slash, '/'. With just one forward slash, it is now saying, "Match all elements starting from the root node." In the case of an HTML document, the root node would be the '<html>' element.
  • *: The '*' expression says, "Match any element of any type." Unlike the previous identifier, this identifier acts as a wildcard so that every single element, regardless of type, is matched. To better demonstrate, look at these two examples:
    //b
    and
    //*
    The first expression will match all bold elements throughout the entire document. The second expression will match every element throughout the entire document, regardless of type.
  • [ ]: The brackets mean one of two things depending on usage: In the example we're using, the brackets are used to say, "Look inside of the element." In other words, the following identifiers will be used to match anything inside of the HTML element (remember, elements can have attributes). This will make more sense with the next identifier. The other use of the bracket can be used to select one element from an expression that returns multiple elements. For example, we can go back to our example for selecting bold elements and modify it to select only the first bold element as so:
    //b[1]
  • @: The '@' identifier is used to match attributes of an element. This identifier is usually followed by the attribute name we wish to match (which in our case is the 'id' attribute), however it can also be followed by our '*' identifier to match any attribute.
  • =: The '=' identifier is much what a programmer would expect (except usually we use ==), a comparison identifier. It is simply used to compare one thing to another. In our case we're comparing the 'id' attribute to 'nst.' In other words, we're saying: "Match an element who's 'id' attribute is equal to 'nst.'"
  • text(): This is actually an internal function to xpath. XPath has a few internal functions, this particular one is used to grab any text within the bounds of an element. In our case it's grabbing the text within the bounds of an element that has an id of 'nst.' 

Of course, this example is nothing close to an exhaustive explanation of all the identifiers Xpath has to offer. To get a more in-depth explanation of XPath, refer to the official W3C tutorial.

 

3. Testing the example

 

To wrap up this tutorial, we'll go ahead and run the above query as a proof of concept. To do this, I'm going to use the python 'requests' library, which can be installed with:

pip install requests

Follow the below interactive example to run the test yourself:

>>> import requests
>>> import lxml.html
>>> r = requests.get('http://www.neopets.com/')
>>> d = lxml.html.document_fromstring(r.text)
>>> d.xpath('//*[@id="nst"]/text()')
['7:50:34 am NST']

And that's how you use XPath to get the current time in NST!


Edited by Josh, 02 November 2014 - 08:26 AM.


#2 jorrakay

jorrakay
  • 7 posts


Users Awards

Posted 01 February 2015 - 08:29 PM

Dude how about using BeautifulSoup like a sane pythonista.
 
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.neopets.com/')
>>> page = BeautifulSoup(r.text)
>>> # now, promise me you'll never write xpath again
>>> page.select('#nst').contents
'7:50:34 am NST'
If you're trying to teach beginners I think this is MUCH easier to wrap your mind around. With BeautifulSoup you can use CSS selectors like I illustrated above, so you can just use the developer console in your browser to get a CSS selector for a certain tag or DOM object and then be able to use that info--much simpler than trying to mess around with XPaths. And if you're worried about performance, you can specify a number of different parsers for BS to use on the back end, including lxml. I'm not sure what the default is but it's not like it's slow for most purposes.

Edited by assertivist, 01 February 2015 - 08:30 PM.


#3 Josh

Josh
  • 318 posts

Posted 01 February 2015 - 09:21 PM

Dude how about using BeautifulSoup like a sane pythonista.
 

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.neopets.com/')
>>> page = BeautifulSoup(r.text)
>>> # now, promise me you'll never write xpath again
>>> page.select('#nst').contents
'7:50:34 am NST'
If you're trying to teach beginners I think this is MUCH easier to wrap your mind around. With BeautifulSoup you can use CSS selectors like I illustrated above, so you can just use the developer console in your browser to get a CSS selector for a certain tag or DOM object and then be able to use that info--much simpler than trying to mess around with XPaths. And if you're worried about performance, you can specify a number of different parsers for BS to use on the back end, including lxml. I'm not sure what the default is but it's not like it's slow for most purposes.

 

 

You just came into an tutorial clearly labeled as an 'xpath tutorial' and insulted the writer for being stupid for writing about xpath. To make it worse, you spent your one and only post to do that. Wow.

 

One advantage I prefer is that xpath is one string that can be used and shared accordingly. Meaning I can write one library in Python and export 90% of the work to another language just by transferring the xpath strings (very powerful for cross-compatibility and great for sharing in a community setting like here in Neocodex).

 

Also, xpath is more widely used by developers than beautifulsoup and lxml as almost every modern language provides an interface for it (thus virtually future-proofing any work done with it). I'd bet my money on an entire cross-language ecosystem than one library written by a handful of developers.

 

Finally, noting sufficiency with xpath on a resume will undoubtedly go father than putting, 'I know how to use beautifulsoup.' (See this Q&A for more information on that). The aforementioned link actually insists you only NOT use xpath when you're learning as xpath is used widely in various industries (Google being one of them, and the source of that information being me as I work for Google :) )

 

You can use whatever you what, choosing one or the other doesn't define a person's sanity. This is just one of many examples of how to parse HTML with python. There's advantages and disadvantages to each side, but I have a feeling the insulting context of your post was meant more to start a fight than anything else, so I will back out of this conversation now as I'm not interested in playing the e-peen game.

 

*EDIT * It's also worth noting that the word 'beginner' only showed up in your post. You've fabricated an entire argument around nothing as my original post did not state it was designed for beginners.


Edited by Josh, 01 February 2015 - 09:37 PM.


#4 jorrakay

jorrakay
  • 7 posts


Users Awards

Posted 01 February 2015 - 09:41 PM

Chill out bro. I only need to make one thing clear:

 

 

parse HTML with python

 

XML black magic is clearly not the best way to do this. And CSS selectors, you know, like in my example, are another one of those "entire cross-language ecosystem" things. I'm just trying to help someone looking to parse HTML in Python. Maybe someone who isn't a masochist  :sarcasm_re:

 

Yeah it was my first post, thanks for welcoming me to the community Josh.  :rawr:



#5 Josh

Josh
  • 318 posts

Posted 02 February 2015 - 06:45 PM

Chill out bro. I only need to make one thing clear:

 

 

 

XML black magic is clearly not the best way to do this. And CSS selectors, you know, like in my example, are another one of those "entire cross-language ecosystem" things. I'm just trying to help someone looking to parse HTML in Python. Maybe someone who isn't a masochist  :sarcasm_re:

 

Yeah it was my first post, thanks for welcoming me to the community Josh.  :rawr:

 

CSS selector's don't have magical parsing powers. BeautifulSoup uses CSS selectors (which by the way are not present on a large majority of items in Neopets) to parse HTML (not the other way around) and BeautifulSoup is not inherently cross-language. Also, BeautifulSoup is not consistent in it's parsing manners (especially since multiple backends can be used) which can lead to unpredictable results (even with CSS selectors). XPath obviously doesn't have this problem as it's a W3 standard.

 

If you come into the community trolling with your first post then you shouldn't be surprised to be treated like someone who came to the community and trolled with their first post.


Edited by Josh, 02 February 2015 - 06:50 PM.



0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users