Select Page

How To Get The Domain Name From a URL in Excel

Often when analysing backlinks it is important to de-duplicate rows and rows of data into some kind of useful list of domains that you can work with. Often this can be quite a tiresome tasks but with a few Excel formulas it is possible to get the domain name from a long URL.

If you are just looking for the quick answer on how to get the domain name from a cell that contains a URL in Excel then it is;

 

=MID(A1, FIND(“//”, A1)+2, FIND(“/”, A1, 10)-8)

=MID({CELL OF FULL URL}, FIND(“//”, {CELL OF FULL URL})+2, FIND(“/”, {CELL OF FULL URL}, 10)-8)

 

 

 

If you are looking for a bit more of an understanding then read on a little further…

 

Explanation of Excel Formula

Lets break down the formula into the key components the finally join them all together into the full formula.

 

MID(A1, … , … )

The MID formula in Excel is translated as =MID(text, start_number, number_of_characters) and is designed to create a sub-string of the original string. This can be very handy if you know exactly what position you want to start at and how many characters you want to put into the sub-string, although domain names are different lengths which doesn’t help much for doing this on scale.

If we use the previous blog post as an example to illustrate this then we could use the formula  =MID(A1, 8, 24);

 

 

While this is great if wanting to get the domain name from lots of URLs on the same domain (although alternatively you could just type the domain out on your keyboard….) – this formula falls short when you are trying to get the domain name from different URLs from different domains as shown below;

 

 

Half of the domain name is missing…Not very useful…

Fortunately we can group this together with other formulas to figure out where the start position should be and how many characters should be included within the sub-string.

 

FIND(“//”, A1)+2

The next bit to the formula is figuring out where to start the sub-string. To do this we use the FIND function which as you may have guessed….finds something. So if we take the example FIND(“//”, A1)then what this formula is doing is finding the first occurrence of the text// within cell A1 – in this case, it is finding where the http:// bit ends. Why add 2? Because the FIND function in Excel identifies the starting point opposed to the end point as in the example below;

 

 

The 6th character in the domain name in the example above is the first forward slash. So if we used that in combination with the MID formula then we would end up with something like//www.michaelcropper.co.uk being returned which contains the leading slash which isn’t correct. So if we add 2 (as in two characters to the right) then the returned result ends up beingwww.michaelcropper.co.uk instead which is what we are looking for.

 

 FIND(“/”, A1, 10)-8

The next bit is figuring out the number of characters that should be used in the sub-string. This is using the same FIND function as in the previous step but using an additional argument within the formula (the 10). The FIND function within Excel works as follows: =FIND(sub-string, string, optional_start_position). So in this example, the 10 represents where the FIND function should begin looking for the first instance of the forward slash.

Since you cannot have a forward slash in the domain name itself and the two initial forward slashes within the http:// are at a maximum position of 8 characters along, then the first forward slash after this point has to be the end of the domain name and the start of the URI.

Why -8 characters? Because the full length of http:// is 8 characters and I am not that interested in that for this purpose so I get rid of it. If you want to keep this in then just remove the -8 bit of the formula.

 

Joining it all together

So while the initial formula may seem a little daunting at first, when you break it down into its smaller components it is not too difficult to figure out what is going on. If you want to think of the full function as a sentence it can be described as;

 

=MID(A1, FIND(“//”, A1)+2, FIND(“/”, A1, 10)-8)

 

Get me the domain name from the full URL which is contained within cell A1. Start the sub-string after the first occurrence of two forward slashes together and end the sub-string at the first occurrence of a forward slash after the first 10 characters.

 

Real world example

An example of how this can be used is with competitor backlink analysis to quickly identify all of the unique domains that are linking to your competitors websites but not your own. There will be a blog post to follow on this as it is a little more in-depth for what this post is looking at, but the Excel functions described here about how to get the domain name from a full URL can certainly help speed up that process massively.

How To Return Multiple Nodes Using XPathOnUrl

XPathOnURL is an amazing function within the amazing SEO Tools plugin for Excel that can scrape specific pieces of content that you need when entering in a URL and the XPath of the content that you want to retrieve. This post isn’t going to be a quite to XPathOnURL, nor is it going to be a guide to XPath – instead what it is looking at is a way to retrieve multiple pieces of data when using XPathOnURL. If you are interested in playing around with XPath then download the XPath Helper plugin for Google Chrome that will help you get the XPath of the items you want to scrape.

By default the XPathOnURL function within the SEO Tools plugin only brings back the first occurrence of the item within the DOM (Document Object Model – i.e the HTML of the page..). As an example lets take my blogroll;

 

 

When using the XPath Helper plugin for Google Chrome, and tweaking the XPath a little to get what I need, you can see that the XPath present is showing that there are four links present within the blogroll;

 

 

All sounds good right. Well not quite, since when you copy this XPath into the XPathOnURL formula as part of SEO Tools then you will only get back the first URL instead of all four of them. Note, you cannot simply copy the exact XPath as shown in the screenshot above since you need to do something slightly different to scrape the HREF attribute using XPathOnURL – luckily I wrote a blog post just about that a while ago!

For the purpose of this blog post, I will use the XPath version for use with XPathOnURL without further explanation about that – read the previous blog post if this doesn’t make sense.

Anyways, back to the point. So when trying to scrape all four of the blogroll links using XPathOnURL you would think it would be as simple as entering the correct formula right? Unfortunately not. Blow is what happens when you try and do this;

 

 

While it is useful to get something back, it isn’t quite what I wanted. Instead I wanted all of the links to come back with this query which is more useful than just the first one. Depending on what you are scraping this can be a bit of a pain if there are quite a few pieces of content you want to scrape.

Luckily there is another function within SEO Tools for Excel which is the StringJoin(“,”, “……”) function (information) which helpfully joins together data and separates them by whatever you want – in this case, I have chosen a comma to separate the results by.

How this function can be used in conjunction with XPathOnURL is that you simple wrap it around the previous formula shown above and tell Excel to separate the results using a comma as seen in the example below;

 

 

This is a little more useful! So now we have all of the scraped data that we need. From here you can easily count the number of occurrences of a comma in the cell which will tell you how many pieces of data you have scraped. Or alternatively you could use the handy text to columns tool that is built into Excel by default to split out the data into separate columns (if you would prefer it in Rows instead then use the Transposetool)

On a small scale like this it is easy enough to do this manually. But recently I had need to do this for around 500 URLs each with an unknown amount of ‘bits’ on that I needed to count. Life is too short to be doing that kind of stuff manually so found a nice quick way of doing this using a few formulas and a bit of Excel magic. It turned out that the results were over 6500 ‘bits’ which would have taken me an awful long time to count manually!

So hopefully this formula can help save you some time when trying to return multiple nodes using XPathOnURL.

 

As a final note, as I was writing this post on another computer (and creating the formulas and screenshots for this post etc.), I noticed that the latest version of SEO Tools for Excel actually does this out of the box automatically. So if you don’t want to be doing all of this work manually then maybe it is just easier to upgrade to the latest version of the plugin :-) Below is what it looks like when using the latest version with all of the pieces of data you have scraped separated by a semi-colon ;

 

 

Thought I would carry on writing the post anyway as it would have saved me a bit of time earlier if I came across this post as I didn’t realise I wasn’t running the latest version!

How to Scrape the HREF Attribute Using XPathOnURL SEO Tools

The joys of XPath and SEO Tools for Excel. Here we are going to talk through how to scrape the HREF attribute using the =XPathOnURLfunction as this is often needed when you want to scrape the actual links from a website which match a certain criteria.

If you just want the quick answer, here is how you do it;

 

=XPathOnURL(“http://www.example.com”, “//a”, “href”)

 

Well something like that anyways, as it depends on the actual XPath you need to use to get the actual links you need. One thing to note about the XPathOnURL function is that it doesn’t work quite the same as standard XPath but this will be explained a little later.

Firstly if you want to scrape the HREF attribute then you may actually be able to do this much quicker using the Google Chrome plugin calledXPath Helper, but that isn’t always the case.

 

Example

I came across an example recently when I needed to scrape one HREF attribute on around 100 pages, so the XPath Helper plugin wouldn’t quite do it. Below shows the setup that I was working with whereby there was a lot of pages where I needed to scrape some data, particularly one specific HREF attribute as is shown in the example below;

 

 

In the example above I am wanting to scrape a link to an image file which just happens to be within the first Div on the page (in this fictitious example!).

Normally if I wanted to do this via standard XPath then I would use the XPath of: //div/a[@href] – which is saying “get the HREF attribute which is contained within an A tag which is contained within a DIV tag.

When using the XPathOnURL function within SEO Tools then this doesn’t quite work in the same way. Instead if you want to pull back an attribute instead of the content between the opening and closing tags, then you need to add the extra parameter within the function which is: , “href” – which is telling the function to pull back the HREF attribute instead.

I am sure that you will come across a need for this at some point – especially if doing a lot of scraping!

That is all of the explanation I am going to do here. Go and give it a go yourself if you ever need to scrape the HREF attribute using XPathOnURL :-)

How to VLOOKUP Using Partial Match

If you have ever used the VLOOKUP function within Excel before and tried the “Approximate Match” type then you will have realised that this is about as useful as a chocolate tea cup. Below shows how you can easily and quickly perform a VLOOKUP using a partial match on the look up value.

For those of you who just want the answer quickly then here is the formula to VLOOKUP on a partial match;

 

=VLOOKUP(CONCATENATE(“*”, A2, “*”), ‘Tab2’!$A$2:$A$100, 1, FALSE)

 

So what does all of that mean?

The VLOOKUP formula is described as

 

=VLOOKUP({lookup this value or string}, {within this range of data}, {bring back this column number}, {TRUE (approximate match) or FALSE (exact match)}

 

So as an example if you have the following data to play around with;

 

Tab 1

Cell A2: “Jim”

 

Tab 2

Cell A2: “Jim Bob”

 

Then if you run the formula “=VLOOKUP(A2,  ‘Tab2’!$A$2:$A$100), 1, FALSE)” which is aiming to find the string “Jim” within the range of data in the second tab with an exact match. Since this doesn’t exist within the range of data in its exact form then this will return “#N/A” as the result since it cannot be found.

If you try the exact same method but use a partial match instead then you could get any number of results returned depending on the size of the range of data you are working with. I am not going to cover the details about why the partial match on VLOOKUPs using “True” as the match type is a waste of time in this blog post, just trust me – it is awful, I have never found a reason why this would exist (I’m sure they may be some reason, I just haven’t ever found a use for it!).

 

VLOOKUP Partial Match

So if you want to look up a partial match which doesn’t use the “Approximate Match” type then you need to do a bit of clever excel magic.

In this instance the * character represents a wildcard which means that when Excel is looking up the value it uses the lookup for a partial match yet still follows the strict criteria of the “Exact Match” type of the VLOOKUP formula.

Let me explain that in a little more detail.

So here is the original formula again;

 

=VLOOKUP(CONCATENATE(“*”, A2, “*”), ‘Tab2’!$A$2:$A$100, 1, FALSE)

 

What this formula is saying is as follows;

  • CONCATENATE(“*”, A2, “*”): Add an * at the start and end of the value which is being looked up, in this case cell A2
  • =VLOOKUP(…{above formula}…): Look up this value
  • ‘Tab2’!$A$2:$A$100: Lookup the original value within this range of data
  • 1: Bring back this column of data when the item is found
  • FALSE: Still use the “Exact Match” type as part of the VLOOKUP function

 

So it really is as simple as that to look up a partial match on a row of data

 

Alternative Uses

Alternative uses when you want to perform a partial match look up could be if you wanted to count the number of times where a certain piece of text was occurring within a range of data as follows;

 

Tab 1

A2: Jim Bob

A3: Jimmy

A4: Bob

A5: Jimmy Mallet

 

Then if you performed the following formula on the above range of data;

 

=IF(COUNTIF(A2:A5,CONCATENATE(“*”, “Jim”, “*”))>0,”Yes”,”No”)

 

Then this would count the number of cells which contain the word “Jim” within the range of data, in this case 3. In the example above it simply outputs a “Yes” or a “No” if there is at least one occurrence of the word within the range of data, although if you would like the exact figure then you can simply strip that part of the formula out which would become;

 

=COUNTIF(A1:A4,CONCATENATE(“*”, “Jim”, “*”))

 

A few nice quick Excel tips to help with looking up values in Excel using partial matches.

 

Previous Comments

34 thoughts on “How to VLOOKUP Using Partial Match”

  1. I knew TRUE was a waste of time! I’ve tried to implement it to be useful for this purpose so many times, but it’s completely useless!!!

    Thank you so much for this!

    You probably could add the point that you can leave the wildcard off either side if you know that the string you’re looking for is a prefix or suffix to the data in the table_array (i.e. the start of a postal code etc.).

  2. This solved my problem well where the middle part is text but it doesn’t work with numbers. For example in my specific case one of the partial numbers I want to look up is 50001 and the target value in the search range is 50001.801. Using *50001* returns #N/A
    I have a mixture of text and numbers so I need a solution that deals with both. Any suggestions?
    Thanks

  3. Further investigation into my problem reveals that in order for the partial lookup to work it’s not the value being looked up that needs to be text it is in fact the look up range. I have used “text to columns” to convert the numbers to text and now the partial lookup works perfectly for both numbers AND text so my problem is solved in a round about way. To save the extra step it would be nice to have a formula that deals with both if you know the answer!

    1. Hi Bob,

      Yes I see the problem and solution. I’ve had a play around with doing all of this in a single formula although haven’t managed to get that working yet. This is the inherent problem when working in Excel for things like this. It’s good up until a point, whereby an SQL database would be better as it is more flexible. There is the TEXT() formula, although this still stores the data as a ‘number’ and not as a ‘string’, it simply changes the visual formatting of the text opposed to actually changing the type of content that it is stored as.

      If you do find a solution to this, do pop back and let me know though

      Thanks
      Michael

  4. Hi,

    Thanks for this! Is there any way to tweak the formula so it searches on part of what is in cell A?

    E.G. in Cell A1 I have a text string like BXXXX-1XX-XX

    What I want to search for in the range is anything that matches only the BXXXX part and disregards the remainder of the value in Cell A1, and returns the value of column 3 in the range.Do you know if this is possible?

    Thanks,

    Babs

    1. Hi Babs,

      Yes this is possible. Rather than trying to do all of this in one mammoth formula (which is possible if you like), it may be easier to split the lookup column based on the ‘-‘, then you can simply lookup that column instead which will be the first bit you are looking for. A VLOOKUP will return the whole cell, so while the lookup on a mammoth formula would work, it would still return the whole contents of the cell where the result has been found.

      Hope that helps

  5. I was looking for just this very thing! I went down a number of pathways, but your answer was very simple and elegant….I studied math in college and I appreciate a simple and elegant answer. In particular, your CONCATENATE(“*”,A2,”*”) really helped me out. I needed a way to embed the actual cell reference within the wildcards. Great job and thanks!

  6. Michael,
    i’m trying to work this backward. IE. I have a list with “Jim Bob (12234)” and need to lookup Jim Bob in the longer list. Limiting the # of characters from left to right will help but some first_last name combos are much longer than others. If i restrict it to 7 characters to work for “Jim bob” it will find him, but it will pull all of the instances of Rachael regardless of last name. is there a way to capture everything up to a certian character… ie “(”
    thanks

    1. You could use the MID function along with FIND which would replace the static lookup value. See here for an idea for how to use – “How To Get The Domain Name From a URL in Excel”,http://www.michaelcropper.co.uk/seo-tools/excel-tips-tricks-for-seo.

      That said, the above solution isn’t that great. What I’d recommend is to split out your data using the Text to Columns button in the Excel navigation and structure your data better. I.e. 3 columns, First Name, Surname, Customer ID. Then you can look things up better.

  7. I have a master class list that I am creating that will contain columns for Group #, Day (there will be three different training days), Date, Location, Time, and Instructor. I want to be able to list multiple groups in the group numbers column. My goal is on subsequent worksheets have the groups class information reorganized based on there number and which day of the series the class is. So the columns would look like Group #, Day 1- Date, Day 1-Location, Day 1-Time, Day 2- Date, Day 2-Location, Day 2-Time, Day 3- Date, Day 3-Location, Day 3-Time.

    The reason this is so complicated and important to be able to break it down is that different groups will be combined in different days of the class. Example on day one 5/22 it might be groups 1, 2,3 but on day two for group 2 they might be joined by 5 & 6 and day three might be even different combos. I need to be able to say group 2 here are your three days in an easy way.

    Thank you.

    1. If you need to look up data which could be in multiple columns this is difficult as the VLOOKUP formula looks at the first column and brings back any column you choose once this data has been found. I’d suggest re-organising your initial lookup master list so that it just has something like; Group Number | Day | Date | Location | Time | Instructor. Then for anyone in “Group 1” (i.e. the lookup value) then you can pull all of the required information where ever you like.

      1. I have it organized similar to what you suggest but in the Group Number box there is multiple groups. I need it to search in the array of those group values to find a specific group. Once if finds those values I want it to populate the list on a different page.

        1. If you want ‘something’ to run through same data, then populate this data into another Sheet then you need to be using Visual Basic to run a script to do this. Formulas are used to pull data into a specific cell or range of cells. To push data into ‘somewhere’ (a Sheet or a Cell) then you need to be using VB to do this which is fun to work with.

          Depending on the scale of your application, maybe it would be worth looking at a real database solution such as MySQL?

  8. I need to use partial search, but in vice a versa way.
    I have in Tab 1 general data (column A product type; column B price; column C category):
    product type price category
    456389 600 a
    4A578964 500 b
    35289 700 b

    and in Tab 2 list of products
    product
    456389aa
    4A57896478
    35289ee
    456389ab
    35289ec
    How could I get price and category behind each product in Tab 2?

    1. Just use a normal VLOOKUP on Sheet 2, columns B and C;

      B2 (Price)…….=VLOOKUP(Sheet1!A2, Sheet2!A2:B2, 2, false)
      C2 (Category)….=VLOOKUP(Sheet1!A2, Sheet2!A2:B3, 3, false)

      As long as there is a match for A1 in both Sheet 1 and Sheet 2, then data will match OK.

      1. Hi Michael,

        Please can you assist me with a formula to VLOOKUP (A1 – which is a 5 or 6 digit number) from a the 1st 5 or 6 digits of a text and bring back the comment in the same row

  9. If I have understood correctly, this should work;

    =VLOOKUP(CONCATENATE(“*”, {the cell you want to lookup on partial match}, “*”), {the rows/columns you want to find the value in – the first column is the lookup column}, {the column number you want to look up, i.e. if you are in C2 is the first column and you want to bring back D2, then this number would be set to 2}, false)

  10. Dear Michael, thank you so much for posting this! It helped me quite a bit! However, just wanted to ask you this:
    When I try to look up ALWAYS PLASTICS COMPARY in a list that contains:
    ALWAYS PLASTIC COMPANY
    ALWAYS PLASTIC COMPANY(I)
    ALWAYS PLASTICS CO (LF)

    I don’t get a match, and I found out because I am looking up COMPARY with an R and also I have PLASTICS with an S at the end. What would you recommend in this situation? The same thing happens when I look up ALL IN 1 PRODUCTS LTD in a list that contains:
    ALL IN 1 LTD

    1. Hi Dan,

      Glad this has been of some use to you. Yes, you will find that if you are working with extremely bad data that any kind of VLOOKUP just isn’t going to be suitable. Really, what you need is to run through your data and clean it all up, which is unfortunately often a time consuming and manual(ish) job. Depending on what you are trying to do with this, it may be best to use an actual database (i.e. MySQL or similar) to accomplish your task. Bad data is very difficult to work with, so it would always be recommended to clean this up so that you can use the data well. Regular expression matching is possible within Excel, although it’s not straight forward,http://stackoverflow.com/questions/22542834/how-to-use-regular-expressions-regex-in-microsoft-excel-both-in-cell-and-loops.

      Hope that helps anyhow, let me know if you need any other pointers.

  11. Hi Michael, I am wondering if I am having the same question as some previous folks.
    I have column a and column b.
    Column A is a short list of company name I want to match up to Column B longer list of company names.

    The problem is it needs to be a partial match because some of the companies will be worded slightly differently – say Institute of California and then California Institute.

    Is it possible to have column c pull out from column b the likely match from column a?

    Thanks!
    Kat

    1. Hi Kat,

      The only way of doing this is to lookup a single word or phrase that is definitely going to be in the list. In that example you gave, “California” is always in the list. the problem being is that Column A is “Institute of California” and I’m guessing cell A2 could then be “San Francisco Institute” (the opposite of the first row).

      It sounds like a bad data problem again unfortunately. When using VLOOKUP using partial match, you need to have a defined point that can be checked.

  12. Hi Michael,

    I need your help as well. I have a long list of contact numbers from different countries and I need to match them with their respective country names. I defined a group with the international codes and their country names. Now I want a formula that only looks for the first 3 or 4 integers of the phone numbers (as theses are international codes) in the contact list to match with the international codes defined by me and pull the name of the country for each.

    I defined the group in both column A (International code) and B (Country name) and the list of phone numbers is in column C. I need the to see the country names in column D.
    Please assist with the formula. Thank you

    1. Hi AJ,

      Ok here is what I would do. Firstly, see about splitting Column C for the phone numbers into two columns, Column C for the country code portion of the phone number, and Column D for the rest of the phone number. This may take a bit of fiddling to get this data right but it will be much easier to then look up the correct info. So once you have, Column C for the Country Code Part of Phone Number, and Column D for the Phone Number, you can then run a standard VLOOKUP in Column E for “Country Name” which is along the lines of =VLOOKUP(C2, $A$1:$A$1000, 2, false), which should pull back the relevant country name.

      You could probably achieve the same thing with an enormous function trying to cram this in to take into account 3, 4, hyphenated country codes etc. but this is just a pain. So I’d recommend cleaning the data a little first to make your life easier.

      Hope that helps

  13. Hi Michael,

    I try to use the formula that you gave but it didn’t work in my excel sheet. For example, I try lookup these names:
    Jaspal Gurdev
    Nurul Amira Ahmad

    …… in a list that contains:
    Nurul Amira binti Ahmad
    Jaspal s/o Gurdev

    but it give me N/A values. I try to use LEFT and Vlookup together,
    but still it didn’t working out. Any suggestion?

    Thanks!

    1. Hi Dira,

      You’ll have trouble trying to do that. When you don’t have the full string to lookup it is difficult. If you had in th elist “Nurul Amira Ahmad binti” and “Jaspal Gurdev s/o” then this would be simple. The only other option you have would be to split the names into different columns first, so “Jaspal” and “Gurdev” then use a “*”, “Jaspal”, “*”, “Gurdev”, “*” approach although this is going to bring back some horrible results. I’d look at cleaning your data first as there is no guarantee that there isn’t another “Jaspal Bob Gurdev” in the lookup table which would bring back the wrong results than you were expecting.

      Hope that helps

  14. Michael – outstanding information. I have implemented this for VLOOKUP for – ahem – a Fantasy Football Draft board that yields on to a team roster after the selection is made. I am dealing with one column (to simplify), and (example) I want to pull the Quarter Backs (via another formula set as “QB”). The concatenate is great for the partial and will pull the 1st RB but I have tried everything to pull the 2nd etc and cannot even after using VLOOKUP, and a combo of INDEX/MATC. SMALLIF does not work because I really need text output. As an example Matt Ryan QB is put on the draft sheet, and is pulled over on to the Roster sheet with this code:

    =VLOOKUP(CONCATENATE(“*”,$B$12), Draft!G3:G17,1,FALSE)
    But I cannot get a formula to pull the next QB (Tom Brady) even after experimenting with the “+1),2)” and many iterations. Any fix here?

    Thanks, Andy

    1. Hi Andy,

      Glad you found the information useful. So to look at your query, here’s what I’d do. The following INDEX/MATCH combo works quite well to pull the next result of something without using VLOOKUP, “=INDEX(A2:A7, MATCH(C2, A2:A7)+1)” – Whereby you are looking to MATCH the value in “C2” from the array in “A2:A7”, then moving it down one with the “+1” in the INDEX part of the formula.

      The other option would be to re-jig the data a bit to add in some unique lookup values. For example, (I know nothing about English football, never mind American football!), if there are 2 Quarter Backs, could you not add in another column in the Team Roster table where you have Column A for all of the positions available in the team, then Column B would be an option to enter in the persons name, then Column C would be any other info you require to pull in about this person based on the unique lookup id of the person’s name?

      Another option may be to look at how to VLOOKUP inside a VLOOKUP

      Hope that helps get you on the right track :-)

How to Count the Number of Occurrences of Text in a Column in Excel

Often when working with large data sets in Excel you want to quickly identify a list of all duplicate pieces of content within a row or column. With a nice little formula and a bit of filtering this is quickly achievable. This is achieved with a simple =COUNTIF() formula as is shown below;

 

 

So what does all that mean? All the formula is saying is “Count the number of times the adjacent cell occurs in the whole range”. So in the example of cell B3, the formula is saying “Count the number of times the word ‘Orange’ appears within the range A2 to A8”.

Why are there dollar $ signs wrapped around the first part of the formula? This is simply telling Excel that this range is fixed. By default when you drag a formula down, all of the corresponding parts of the formula are also dragged down too, so if the dollar sign wasn’t included then then you drag down the list the results may be different as shown in another similar example below. As you can see, the results for what you would want to show the same answer are actually different – one cell says there are two occurrences of ‘Oranges’ while the other says that there is only one;

 

 

The next step that you have counted the number of occurrences of text in a range is to filter this data out by selecting the whole of Row 1 then clicking on Data > Filter as shown below;

 

 

This will then allow you to filter by showing all rows that have a number greater than 1. If you click on the small drop down arrow, then onNumber Filters, then on Greater Than then you will be able to enter in data as shown below;

 

 

Once you click this you will be presented with an input box where you can type the number 1 into this box then press OK so that the applied filter will show just the data you require;

 

 

Now that this filter has been applied correctly you will see all the rows which have multiple occurrences within the whole column as shown below;

 

 

The next and final step is to simply copy and paste all of the filtered data from column A into a new tab on the spreadsheet. From there you can then select the whole of column A in the new tab and select Data > Remove Duplicates from the Excel menu as shown below;

 

 

This will now leave you with a final list of unique items which occur multiple times in a range of data. Quite simple really and it can have many practical uses;

 

 

This type of work can be really useful when you need to find out what specific pages on a website have duplicate content on them or other similar tasks such as duplicate title tags or meta descriptions etc. There are endless ways that this can be used so this should provide as a starting point to guide you on some of the more useful Excel functions and opportunities.

How I Built SimpleSitemapGenerator.org in a Weekend

Whilst working in the SEO industry there are time when certain tools would make your life easier and you just can’t quite find a tool that does the job that you need. This is one occasion where I was looking for a simple sitemap generator and all of the tools that I could find were either limiting the number of URLs which could be contained to a really small number or didn’t allow me to tell the tool what the URLs actually were.

So that I why I built SimpleSitemapGenerator.org over a weekend. I’m sure there will be sitemap tools which can achieve a similar result out there that I simply have found but my patience was wearing thin searching :-)

Was it difficult? Not really. It was just working through some basic logic to build in exactly what I needed. Below explains how I built the tool.

 

What is it built with?

Simple Sitemap Generator is built on a Java platform running on an Apache Tomcat web server. Why? Because I know Java. The exact same task could be achieved using any programming language you choose if you require. My referred method of developing websites is using the Integrated Development Environment (IDE) called NetBeans.

Some hardcore programmers always prefer not to use these types of tools as they can get their self tied in knots sometimes which require a deeper understanding to untangle – so if you only use these tools you may find it difficult to figure out what is wrong. Personally, I prefer to make my life as simple as possible – why make things more difficult than they have to be to achieve the task in hand?

 

 

How does it work?

Quite simply really, the list of URLs are parsed using a Java program behind the scenes which separates all URLs by the new line character. The other items including the change frequency, last modified and the priority are also picked up from the main form then used in the program.

The program ultimately just runs through each of the URLs within the list (up to a maximum of 50,000 URLs due to this limitation within XML sitemaps) and wraps the correct tags around each item based on the latest XML sitemap specification.

Below is a simple diagram about how the program uses the data which has been entered on the form so you can see how the logic works in the program. I have excluded any of the Java code so it is a little easier to understand for the non-technical people.

(click on image for a larger view)

 

 

Then the sitemap is complete! So it is just about displaying that nicely to the user.

 

How did you design the logo?

I am not a big fan of designing anything and I am very poor at doing so. My preferred method of developing logos and nice graphics is using Microsoft Word combined with Paint.Net to achieve a few nicer effects if needed.

Why do I use these tools? Because using the more advanced tools are way beyond my skill set and I don’t have the time or desire to try and master these. The basics serve my purposes for the time being but not to say that I may not learn in the future – just not in the near future.

 

Why didn’t you build in a website scraper?

A lot of other sitemap generator tools have built in website scrapers and can identify all URLs on your website easily, although these are always limited by the number of URLs they can crawl. There are several reasons why I didn’t build in a web scraper to the tool;

The first reason being that by having a website that crawls the whole of a website leaves the tool open to abuse by people wanting to attack certain websites by making the tool send thousands of requests towards a certain website. This is more commonly known as a Denial of Service (DOS) attack. This amount of requests can bring websites to their knees or totally offline.

If I built in a scraper function into the tool then it would be very simple for someone to enter in “www.website.com” into the scraper tool and press ‘go’ and continue doing the same in endless tabs in their browser. The result from which would be thousands of requests going to www.website.com. There is always ways to get around this type of abuse but this requires more time to build into the tool.

The second reason why I didn’t build a web scraper into the tool is because there are already really good tools out there that can do this for you, namely Xenu Link Sleuth. Why re-invent the wheel?

I primarily built this tool for myself as I will find it useful as I work on a lot of different websites, so it makes my life simpler. I can quickly identify all URLs on a website using Xenu so I didn’t need to go re-designing this as I can simply use a combination of tools to achieve the task which works out quicker.

The third reason why I didn’t do this is because the actual server overheads to crawl an awful lot of URLs to scrape a website, then parse all of the information to use in the sitemap is an awful lot and since there will likely be very little income from the tool (advertising makes pennies!) then this would purely be a loss making exercise for me and that doesn’t sound like too much fun.

 

Why can’t you have different priorities for different URLs?

Because I didn’t build this in as (in my opinion – I’m sure there will be people with other opinions on this!) there is very little value in changing this from 0.1, 0.4, or 1.0. The aim of the tool is to quickly build an XML sitemap from a list of URLs so you can tell search engines about content they may not already be aware of. If you want to quickly tell them about content then why would you set a lower priority for content?

While it may be interesting to build into the tool a way to prioritise URLs based on their importance, there are no plans to do this in the near future. If you want to begin doing things like this then I suggest you build a custom XML sitemap generator which is more integrated into your content management system / database so that it can be continually upgraded.

 

 

How did you choose the font and colour scheme?

As you know already that I created the logo in Microsoft Word, well you may notice the font from another post I did a while ago about the 200 signals in Google ranking algorithm (and yes, that image was also created in Word). Why the font? Because I like it. Simple as that.

Why the colour scheme? For the same reason, I like that basic green colour in Word for colouring sections of text in (I usually use this for ticking off items on a to-do list or similar) so it seemed like a nice choice and I think it works quite well.

How about the main navigation colour scheme? Well I actually just pulled this whole navigation from another website I have developed as I wanted to quickly create a navigation menu and there was little value in creating one from scratch. So this was more of a quick and dirty approach which achieves the aim of being a navigation menu.

 

How did you get the XML sitemap to look pretty?

If you view the sitemap for the actual website, http://www.simplesitemapgenerator.org/sitemap.xml then you can see the sitemap is styled all nicely as is seen below;

 

 

Isn’t an XML sitemap supposed to look like a normal XML document though? Well usually yes, but it is possible to style up XML sitemaps so they look nice. This is using an XML Stylesheet which is achieved by adding a line of code to the top of the XML Sitemap as follows;

 

<?xml-stylesheet type=”text/xsl” href=”http://www.simplesitemapgenerator.org/sitemap-stylesheet.xsl”?>

 

This line of code is pulling in the stylesheet information from a separate stylesheet file which is making the XML document look a little nicer. I will be doing another post about how to create these as they are reasonably straight forward to implement and can make your XML sitemap a little more user friendly and they also have other SEO benefits such as being able to easily ping all of the URLs to ensure they are working etc.

 

Summary

So there is a bit of information about how I builtSimpleSitemapGenerator.org in a weekend. Quite simple really, it was just about allowing basic data to be entered onto a form then parsing the results and outputting to a nice format which is in line with the latest XML sitemap specifications.

I always encourage people to give something a go and try and solve a solution to a problem yourself as it really isn’t that difficult. The added bonus that it is fun doing so too!

This tool has certainly made my life easier and will continue to do so. I hope it can be of some use to you as well. If you do find it useful then please share :-)

Excel Tips & Tricks for SEO

Quick reference manual for myself as to regular Excel things I use and useful formulas for various tasks. If you find it useful too, then please share around :-) It will be a growing list as/when I find something useful that I use a lot but can never remember the exact way of doing it!

 

How to Count the Number of Occurrences of Text in a Cell

Quick reference tip;

 

=SUM(LEN(<range>)-LEN(SUBSTITUTE(<range>,”text”,””)))/LEN(“text”)+1

=SUM(LEN(B2)-LEN(SUBSTITUTE(B2,”|”,””)))/LEN(“|”)+1(example)

=SUM(LEN(B2)-LEN(SUBSTITUTE(B2,”separator”,””)))/LEN(“separator”)+1(example)

 

Read the full blog post about how to count the number of occurrences of text in a cell here about how to use this.

 

How to Count the Number of Occurrences of Text in a Column in Excel

Quick reference tip;

 

=COUNTIF(<range>, “text”)

=COUNTIF($A$2:$A$8, A2) (example)

=COUNTIF($A$2:$A$8, “Apples”) (example)

 

Read the full blog post about How to Count the Number of Occurrences of Text in a Column in Excel here about how to use this.

 

How to VLOOKUP Using Partial Matches in Excel

Quick reference tip;

 

=VLOOKUP(CONCATENATE(“*”, <lookup value>, “*”), <range>, 1, FALSE)

=VLOOKUP(CONCATENATE(“*”, A2, “*”), B2:B10, 1, FALSE)(example)

=VLOOKUP(CONCATENATE(“*”, “Jim”, “*”), B2:B10, 1, FALSE) (example)

 

Read the full blog post about How to VLOOKUP Using Partial Match for how to use this.

 

How To Get The Domain Name From a URL in Excel

Quick reference tip;

 

=MID({CELL OF FULL URL}, FIND(“//”, {CELL OF FULL URL})+2, FIND(“/”, {CELL OF FULL URL}, 10)-8)

=MID(A1, FIND(“//”, A1)+2, FIND(“/”, A1, 10)-8)

=MID(“http://www.michaelcropper.co.uk/2012/10/how-to-scrape-the-href-attribute-using-xpathonurl-seo-tools-1252.html“, FIND(“//”, “http://www.michaelcropper.co.uk/2012/10/how-to-scrape-the-href-attribute-using-xpathonurl-seo-tools-1252.html“)+2, FIND(“/”, “http://www.michaelcropper.co.uk/2012/10/how-to-scrape-the-href-attribute-using-xpathonurl-seo-tools-1252.html“, 10)-8)

 

Read the full blog post on How To Get The Domain Name From a URL in Excel for detailed information about how to use this formula.

How to Identify All External Links on a Website Using XPath

Often can be the case where interesting requests come in from people you are working with which there doesn’t appear to be a useful tool available for quickly gathering the information. This is an interesting required which has come in today about how to quickly identify all external links on a website.

This is actually quite a common issue since with various content management systems simply adding various content (and links) around hundreds/thousands of pages across a website, how do you easily find all of the external links correctly?

Fortunately, I have figured out a nice way using a bit of XPath, the SEO Tools plugin for Excel, and Xenu.

If you are looking for a quick answer then this is the XPath required to identify external links on a single web page;

 

//a[not(contains(@href, ‘www.michaelcropper.co.uk’))]/@href

 

So what does this actually mean?

  • //a : Get me any links that …
  • [not(contains( : … do not contain …
  • @href : … a link which ….
  • ‘www.michaelcropper.co.uk’))] : … contains this website address and …
  • /@href : … get the HREF attribute for this link

 

Make sense? Good. Lets look at actually using this XPath in a useful way.

 

SEO Tools Plugin

Now the interesting thing is when using XPathOnURL with SEO Tools, this doesn’t actually bring back the HREF attribute, instead it pulls back the first URL on the page which may be good enough for this purpose. So the function would be as follows when the URL you want to test against is in cell A1;

 

=XPathOnUrl(A1, “//a[not(contains(@href, ‘www.michaelcropper.co.uk’))]/@href”)

 

In the example above I was testing on the URL http://www.michaelcropper.co.uk/2012/06/googles-business-plan-steal-content-and-screw-publishers-1081.html as that contains a link to an external website. So now we want to look at scaling this up for a bunch of URLs on a website.

 

Xenu

Now you know how to check if a specific URL contains an external link, then the next step would be to do this for all URLs on the site you want to check.

Simply get Xenu installed and run the program on any website which you can then export all website URLs into an Excel file.

Now you will have a huge list of all URLs on a website, where you can then run the same XPathOnURL function on to identify all pages on a website which contain at least one external link.

 

Summary

This is likely only one solution to a problem and doesn’t actually allow you to create a definitive list of every single external link on every page of the website, but it does tell you which pages on a website contain an external link to another website.

Simple, but effective. :-)

How to Count The Number of Occurrences of Text in a Cell in Excel

With various SEO tasks it is often the case where you need to count the number of occurrences of text in a cell. Unfortunately there isn’t a nice and simply formula to do this, but there is a work around.

Take the example if you download all of the duplicate title tags from a website from Google Webmaster Tools. The way that Google presents this data is in a two column format as seen below;

 

 

As you can see in the image above, this isn’t that useful in itself as often the pages with duplicate title tags can be the same page simply with parameters appended. These pages can often be listed a number of times if it is something like a session ID as the parameter which Google has managed to identify. What would be nice is if Google actually listed a number next to the list which is downloaded telling you how many pages the title tag is duplicate over as this would be a nice starting point to prioritise fixing all of those issues – unfortunately they don’t.

Instead you have to use a bit of Excel Magic to calculate this information. Fortunately within the downloaded spreadsheet, the URLs listed in column B are separated by the pipe symbol ‘|’ which can be used as part of the formula below;

 

=SUM(LEN(<range>)-LEN(SUBSTITUTE(<range>,”text”,””)))/LEN(“text”)+1

 

So what does this mean? Well if we put this formula into cell C2 following on from the original example above, then this is what it would mean….

 

=SUM(LEN(B2)-LEN(SUBSTITUTE(B2,”|”,””)))/LEN(“|”)+1

 

What the formula is doing is counting the number of occurrences of the pipe symbol ‘|’ then adding 1 to that number. Why adding 1? Because if two URLs in (this example) are listed in cell B2, then there will only be 1 pipe symbol and we want to know how many URLs the title tag is duplicated over.

Below is the final example of how this looks when implementing this solution;

 

 

Now that you have the formula for doing this work, you can easily apply this same logic to any similar tasks where you need to count the number of occurrences of text contained within a cell in Excel.

HTML5 SEO Best Practices

HTML5 is one of the subjects that is going to gain more importance within the SEO area over the coming months and years with more people choosing to adopt the technology. So this is a guide to introduce you to HTML5 from an SEO point.

Firstly, if we look at what is HTML5 and who has adopted HTML5. Then we can look at the new HTML5 tags and finally what HTML5 means for SEO.

 

 

What is HTML5?

HTML5 is a better version of HTML4 (no surprises there!). HTML5 has been designed to standardise multimedia formats so they can be consistently understood by computers (i.e browsers). HTML5 makes it easy to include interactive multimedia elements within your website without the need for external plugins, such as Flash.

Whilst HTML5 is an evolving process with new features planning on being added over the coming months (and likely years), it makes sense to at least begin understanding the benefits of HTML5 sooner rather than later.

Below is a nice HTML5 infographic which you can see on Wikipedia’s page on HTML5. It gives a good idea about what s currently within the HTML specification and the status of other HTML5 ideas which are still on the drawing board.

 

 

There are some new cool features within HTML5 such as the

 

Who is already adopting HTML5?

Lots of people! BinVisions produced a report in October 2011 which accessed some of the top websites to see who was and who wasn’t adopting HTML5 as a new technology and found the following for the top 100 websites;

 

 

As you can see from the image above, only 34 out of the top 100 websites have fully adopted HTML5. These 34 websites are;

 

 

For a full breakdown of who is / isn’t using HTML5 yet then download BinVision’s PDF report (direct link to PDF doc).

 

New HTML5 Tags

If you are not familiar with HTML5 then I can recommend running through W3School’s HTML5 tutorial as it contains a bit more background than what is going to be covered in this post.

 

New HTML5 Tags;

 

  • <canvas> – This is similar to the <textarea> tag, but cooler. When linked up with JavaScript it allows you to draw different things on the canvas. This can be really useful when wanting to make your pages more interactive without
  • <video> – This tag is designed to replace typical Flash plugins to play videos.
  • <audio> – This tag is exactly the same as the video tag
Content specific HML5 tags;
  • <article>/strong> – Main content on the page
  • <section> – Main content on the page
  • <header> – Header (not the) part of your document, for example maybe your company logo would go in here
  • <footer> – Footer section of your page, for example you may want to put your footer links within here (although you could use the element for that as well)
  • <nav> – For your navigation menus on your site.
  • <hgroup>– This tag is designed to group normal heading tags together ( <h2>etc), for example if you have multiple headings within your article or content.
  • <aside> – For content which is not the main content on the page.
  • <summary> – This defines a visible heading for a <details> element (only works in Google Chrome!)
  • <details> – This is for additional details that a user can view or hide (only works in Google Chrome!)
  • <time> – This defines the date and time which has likely been introduced as Google hasn’t been brilliant at determining the date of blog posts in the past.

 

 

HTML5 SEO Best Practices

The new content specific tags listed above are a little easier to explain with a graphic which shows HTML5 SEO best practices;

 

 

When reading through the tutorials available online, the tags are a little ambigious so the information I have documented above is just my take on this. Please do have a read yourself and see if you come to the same conclusions (or not!).

 

HTML5 SEO Benefits

In short, there are none. You are not going to get better rankings with your website by simply marking up your content in HTML5 over HTML4 or even HTML1. In the same sense, there is no harm to marking up your content with HTML5 SEO wise as you can see from the quotes listed below.

 

What does Google say about HTML5?

Not much really as there have been no official blog posts released on the subject. That said, John Mueller from Google has commented on several threads about HTML5 which provides an insight into the pros and cons.

 

When asked “How well does Googlebot deal with non-standard tags“,  John Mueller replied with;

 

John Mueller – Google Webmaster Trends Analyst

In general, our crawlers are used to not being able to parse all HTML markup – be it from broken HTML, embedded XML

content or from the new HTML5 tags. Our general strategy is to wait to see how content is marked up on the web in practice and to adapt to that.

If we find that more and more content uses HTML5 markup, that this markup can give us additional information, and that it doesn’t cause problems if webmasters incorrectly use it (which is always a problem in the beginning), then over time we’ll attempt to work that into our algorithms.

With that in mind, I definitely wouldn’t want to stand in the way of your implementing parts of your site with HTML5, but I also wouldn’t expect to see special treatment of your content due to the HTML5 markup at the moment. HTML5 is still very much a work in progress, so it’s great to see bleeding-edge sites making use of the new possibilities :)

From 17th October 2010

 

Then a day later in 2010, Google was asked the question “Does semantic HTML5 matter to Google yet?” and again John was on hand to provide a few more nice snippets of information;

 

John Mueller – Google Webmaster Trends Analyst

“In general, we work hard to understand as much of the web as possible, but I have a feeling that HTML5 markup is not yet as widely in use (and in use correctly) that it would make sense for us to use it as a means of understanding content better. As HTML5 gains in popularity and as we recognize specific markup elements that provide value to our indexing system, this is likely to change, but at the moment I would not assume that you would have an advantage by using HTML5 instead of older variants. 

Returning different content to search engine crawlers as you would to similar users would be considered cloaking and against our Webmaster Guidelines. I would not recommend treating search engine crawlers different than you would similar users. 

Personally, I would recommend using HTML5 where you think that it already makes sense, perhaps reverting to HTML4 if you can determine that the browser won’t support the elements of HTML5 that you use properly. While this will not result in an advantage for your content in our search results, it generally wouldn’t be disadvantageous either. “

From 18th October 2010

 

Another one of the questions which keeps cropping up is about having multiple H1’s on a page due to the new HTML5 markup and it this is good or bad for SEO. Well here is what John said on the matter;

 

“Having multiple h1 tags on a page is fine. Matt also talked about this – not in the context of HTML5, but in a more general sense”

From 18th October 2010

 

For reference, here is the video John was referring to;

 

 

But can we really rely on information from over 18 months ago? 18 months online is like a lifetime, so I decided to see if I could get some more clarity on the issue with regards to how Google treats HTML5 with SEO in 2012.

I asked “How well does Google Bot handle HTML5” and John was back again to answer the question in great detail;

 

John Mueller – Google Webmaster Trends Analyst

“In short: our systems are pretty good at parsing the common kinds of HTML-like markup, including the various flavors of HTML and XHTML. As far as I know, our crawling and indexing systems currently don’t do anything special for HTML5, so there is no “bonus” for using HTML5 constructs, but similarly also generally no downside.

That said, most of the pages with HTML5 markup that I’ve seen tend to be very clean – with little “cruft” and unnecessary elements. While having clean markup isn’t something that our algorithms explicitly look for, we occasionally see pages that are almost unparseable.

Also, clean markup makes it much easier to maintain the website, easier to add new elements like microformats, and frequently makes the pages more portable across a variety of browsers and devices. So with that in mind, there’s nothing holding you back (from a search point of view) from creating an awesome site in HTML5 :-)”

From 21st March 2012

 

On that note, time to get crating some awesome websites :-)

Ian Hickson - Research and Standards Development at Google and Specification Editor at WHATWG
Ian Hickson – Research and Standards Development at Google and Specification Editor at WHATWG

I would have been very surprised if Google didn’t handle HTML5 since Google Chrome supports all the new features of HTML5 (the only browser that does at the moment!). PlusIan Hickson who works at Google within the Research and Standards Development field has been acting as the specification editor atWHATWG (The Web Hypertext Application Technology Working Group ) who have been developing the new HTML5 standard.

 

 

 

 

 

 

Issues with HTML5

As with most new technologies, cross browser compatibility issues are always an issue.  When reading through the HTML5 tutorial from W3Schools there are plenty of examples about what does and doesn’t work based on different browser types and versions.

So whilst HTML5 is the cool new technology on the block, it may not work for everyone. All of the new HTML5 markup works in Google Chrome and the other browsers will be updating in future releases.

 

 

Cool HTML5 Examples

Google’s Pac-Man example

 

 

http://www.google.com/pacman/ – How is this good for SEO? It’s cool, shareable, likeable, tweetable, playable and accessible without any plugins.

 

Command and Conquer HTML5 example

 

Play the full HTML5 Command and Conquer game.

 

Summary

Overall, HTML5 is a great new technology designed to make life easier and better for web users. Whilst I wouldn’t go and re-code the whole site into HTML5 since there is no real benefit for doing so (SEO or otherwise), instead for any new developments I would begin to adopt the new HTML5 technology.

As has been seen with other new technologies (Rich Snippets come to mind), if you can implement new technologies on the off-chance Google will give some additional bonus at some unknown point in the future then you are onto a winner. Google may never provide additional benefit to websites for designing in HTML5, but if you begin using this from now on then if they do begin to promote HTML5 websites then you will be one of the first to benefit from this.

If you are interested in looking at some cool things you can do with HTML5 then Google produced a nice Pac Man

7 thoughts on “HTML5 SEO Best Practices”

      1. Mick,
        Thank you for your time, it’s big job.

        Do you have any link list where I can get good material about HTML 5, Dummy SEO Friendly? :)

        Cheers,
        Ramūnas Bruzgys

        1. Hi Ramas,

          Your best bet if you want a dummy-friendly guide for HTML5 would be to run over the tutorial at W3Schools. SEO wise, there isn’t much benefit at the moment so I wouldn’t worry too much about that side of things.

          Michael