XPath is a technology that uses path expressions to select nodes or node- sets in an XML document (or in our case an HTML document). Even if XPath is not a programming language in itself, it allows you to write expressions that can access directly to a specific HTML element without having to go through the entire HTML tree.
- More Videos For Web Scraping 101 »
- Web Scraping With Python 101
- Node Web Scraper
- Web Scraper Bot
- Python Web Scraping Api
Web Scraping -101. 52 Comments / Uncategorized / By M. Now, we know that we need web scraping, let us start with web scraping. Today we will scrape very simple data using a very beginner friendly python library. Today, we will be needing two python libraries at first let us install them using pip and then import them. See full list on thedataschool.co.uk.
It looks like the perfect tool for web scraping right? At ScrapingBee we love XPath!
In our previous article about web scraping with Python we talked a little bit about XPath expression. Now it's time to dig a bit deeper into this subject.
Why learn XPath
- Knowing how to use basic XPath expressions is a must-have skill when extracting data from a web page.
- It's more powerful than CSS selectors
- It allows you to navigate the DOM in any direction
- Can match text inside HTML elements
Entire books have been written on XPath, and I don’t have the pretention to explain everything in-depth, this is an introduction to XPath and we will see through real examples how you can use it for your web scraping needs.
But first, let's talk a little about the DOM Mac x downloader.
Document Object Model
I am going to assume you already know HTML, so this is just a small reminder.
As you already know, a web page is a document containing text within tags, that add meaning to the document by describing elements like titles, paragraphs, lists, links etc.
Let's see a basic HTML page, to understand what the Document Object Model is.
This HTML code is basically HTML content encapsulated inside other HTML content. The HTML hierarchy can be viewed as a tree. We can already see this hierarchy through the indentation in the HTML code.
When your web browser parses this code, it will create a tree which is an object representation of the HTML document. It is called the Document Object Model.
Below is the internal tree structure inside Google Chrome inspector :
On the left we can see the HTML tree, and on the right we have the Javascript object representing the currently selected element (in this case, the <p>
tag), with all its attributes.
The important thing to remember is that the DOM you see in your browser, when you right click + inspect can be really different from the actual HTML that was sent. Maybe some Javascript code was executed and dynamically changed the DOM ! For example, when you scroll on your twitter account, a request is sent by your browser to fetch new tweets, and some Javascript code is dynamically adding those new tweets to the DOM.
XPath Syntax
First let’s look at some XPath vocabulary :
• In Xpath terminology, as with HTML, there are different types of nodes : root nodes, element nodes, attribute nodes, and so called atomic values which is a synonym for text nodes in an HTML document.
Fast forward a few days and for various reasons I checked out Sophos XG and I believe I'm here to stay, thereby removing pfSense from the picture entirely. I added PiVPN as a custom 'service' with the Wireguard port entered within Sophos XG. I then walked through the DNAT steps to complete the NAT setup for PiVPN to work. FWIW, I've started testing Wireguard and it seems quite impressive so far. While it is to some extent still in beta, it would be great if this is something that Sophos could start working on implementing and testing - it seems stable enough at present to at least do that. There's no 'big company' in the NGFW market that already implemented Wireguard, so don't create any expectations on It. And considering Sophos is using a OpenVPN version from 2015 with TLSv1 support, well, you shouldn't have any hope on this. Ii would suspect that the XG does not investigate the packets correctly. 90% of my NTP is classified as UDP 123 not reported as NTP. Also there are other intermittent mis-analyis because Apple access is occasionally called a VPN tunnel along with intermittent mail access being reported as an attack on a MS mail server (external). Sophos xg wire guard reviews. Our Free Home Use XG Firewall is a fully equipped software version of the Sophos XG firewall, available at no cost for home users – no strings attached. Features full protection for your home network, including anti-malware, web security and URL filtering, application control, IPS, traffic shaping, VPN, reporting and monitoring, and much more.
• Each element node has one parent. in this example, the section element is the parent of p, details and button.
• Element nodes can have any number of children. In our example, li elements are all children of the ul element.
More Videos For Web Scraping 101 »
• Siblings are nodes that have the same parents. p, details and button are siblings.
• Ancestors a node’s parent and parent’s parent…
• Descendants a node’s children and children’s children…
There are different types of expressions to select a node in an HTML document, here are the most important ones :
Xpath Expression | Description |
---|---|
nodename | This is the simplest one, it select all nodes with this nodename |
/ | Selects from the root node (useful for writing absolute path) |
// | Selects nodes from the current node that matches |
. | Selects the current node |
. | Selects the current node's parent |
@ | Selects attribute |
* | Matches any node |
@* | Matches any attribute node |
You can also use predicates to find a node that contains a specific value. Predicates are always in square brackets: [predicate]
Here are some examples :
Xpath Expression | Description |
---|---|
//li[last()] | Selects the last li element |
//div[@class='product'] | Selects all div elements that have the class attribute with the product value. |
//li[3] | Selects the third li element (the index starts at 1) |
//div[@class='product'] | Selects all div elements that have the class attribute with the product value. |
Now we will see some examples of Xpath expressions. We can test XPath expressions inside Chrome Dev tools, so it is time to fire up Chrome.
To do so, right-click on the web page -> inspect and then cmd + f
on a Mac or ctrl + f
on other systems, then you can enter an Xpath expression, and the match will be highlighted in the Dev tool.
Tip
In the dev tools, you can right-click on any DOM node, and show its full XPath expression, that you can later factorize.
XPath with Python
There are many Python packages that allow you to use XPath expressions to select HTML elements like lxml, Scrapy or Selenium. In these examples, we are going to use Selenium with Chrome in headless mode. You can look at this article to set up your environment: Scraping Single Page Application with Python
E-commerce product data extraction
In this example, we are going to see how to extract E-commerce product data from Ebay.com with XPath expressions.
On these three XPath expressions, we are using a //
as an axis, meaning we are selecting nodes anywhere in the HTML tree. Then we are using a predicate [predicate]
to match on specific IDs. IDs are supposed to be unique so it's not a problem do to this.
But when you select an element with its class name, it's better to use a relative path, because the class name can be used anywhere in the DOM, so the more specific you are the better. Not only that, but when the website will change (and it will), your code will be much more resilient to changes.
Automagically authenticate to a website
When you have to perform the same action on a website or extract the same type of information we can be a little smarter with our XPath expression, in order to create generic ones, and not specific XPath for each website.
In order to explain this, we're going to make a “generic” authentication function that will take a Login URL, a username and password, and try to authenticate on the target website.
To auto-magically log into a website with your scrapers, the idea is :
GET /loginPage
Select the first tag
Select the first before it that is not hidden
Set the value attribute for both inputs
Select the enclosing form and click on the submit button.
Most login forms will have an <input type='password'>
tag. So we can select this password input with a simple: //input[@type='password']
Once we have this password input, we can use a relative path to select the username/email input. It will generally be the first preceding input that isn't hidden:.//preceding::input[not(@type='hidden')]
It's really important to exclude hidden inputs, because most of the time you will have at least one CSRF token hidden input. CSRF stands for Cross Site Request Forgery. The token is generated by the server and is required in every form submissions / POST requests. Almost every website use this mechanism to prevent CSRF attacks.
Now we need to select the enclosing form from one of the input:
Web Scraping With Python 101
.//ancestor::form
And with the form, we can select the submit input/button:
.//*[@type='submit']
Node Web Scraper
Here is an example of such a function:
Web Scraper Bot
Of course it is far from perfect, it won't work everywhere but you get the idea.
Conclusion
XPath is very powerful when it comes to selecting HTML elements on a page, and often more powerful than CSS selectors.
One of the most difficult task when writing XPath expressions is not the expression in itself, but being precise enough to be sure to select the right element when the DOM will change, but also resilient enough to resist DOM changes.
At ScrapingBee, depending on our needs, we use XPath expressions or CSS selectors for our ready-made APIs. We will discuss the differences between the two in another blog post!
I hope you enjoyed this article, if you're interested by CSS selectors, checkout this BeautifulSoup tutorial
Python Web Scraping Api
Happy Scraping!