The joys of XPath and SEO Tools for Excel. Here we are going to talk through how to scrape the HREF attribute using the =XPathOnURLfunction as this is often needed when you want to scrape the actual links from a website which match a certain criteria.
If you just want the quick answer, here is how you do it;
=XPathOnURL(“http://www.example.com”, “//a”, “href”)
Well something like that anyways, as it depends on the actual XPath you need to use to get the actual links you need. One thing to note about the XPathOnURL function is that it doesn’t work quite the same as standard XPath but this will be explained a little later.
Firstly if you want to scrape the HREF attribute then you may actually be able to do this much quicker using the Google Chrome plugin calledXPath Helper, but that isn’t always the case.
I came across an example recently when I needed to scrape one HREF attribute on around 100 pages, so the XPath Helper plugin wouldn’t quite do it. Below shows the setup that I was working with whereby there was a lot of pages where I needed to scrape some data, particularly one specific HREF attribute as is shown in the example below;
In the example above I am wanting to scrape a link to an image file which just happens to be within the first Div on the page (in this fictitious example!).
Normally if I wanted to do this via standard XPath then I would use the XPath of: //div/a[@href] – which is saying “get the HREF attribute which is contained within an A tag which is contained within a DIV tag.
When using the XPathOnURL function within SEO Tools then this doesn’t quite work in the same way. Instead if you want to pull back an attribute instead of the content between the opening and closing tags, then you need to add the extra parameter within the function which is: , “href” – which is telling the function to pull back the HREF attribute instead.
I am sure that you will come across a need for this at some point – especially if doing a lot of scraping!
That is all of the explanation I am going to do here. Go and give it a go yourself if you ever need to scrape the HREF attribute using XPathOnURL