Those of us who do SEO have been increasingly pleased with the various search engines for providing or allowing tools and protocols to allow us to help direct, control, and manage how our sites are indexed. However, the search engines still have a significant need to keep much of their workings a secret out of fear of being exploited by ruthless black-hats who will seek to improve page rankings for keywords regardless of appropriateness. This often leaves the rest of us with tools that can be used in some limited cases, but there’s little or no documentation to tell us how those tools operate functionally in the complex real world.Â The Robots META tag is a case in point.
The idea behind the protocol was simple, and convenient. It’s sometimes hard to use a robots.txt file to manage all the types of pages delivered up by large, dynamic sites. So, what could be better than using a tag directly on a page to tell the SE whether to spider and index the page or not?Â Here’s how the tag should look, if you wanted a page to NOT be indexed, and for links found on it to NOT be crawled:
<meta content=”noindex,nofollow” name=”ROBOTS”>
Alternatively, here’s the tag if you wanted to expressly tell the bot to index the page and crawl the links on it:
<meta content=”index,follow” name=”ROBOTS”>
But, what if you wanted the page to not be indexed, while you still wanted the links to be spidered? Or, what if you needed the page indexed, but the links not followed? The major search engines don’t clearly describe how they treat these combinations, and the effects may not be what you’d otherwise expect. Read on and I’ll explain how using this simple protocol with the odd combos had some undesirable effects.
One of the sites that I manage has many millions of pages of content. In order to get that content exposed so that SEs can find it all, each of those pages must be linked-to from other pages that the SEs spider. So, it’s necessary to build out pages which cascade the links out, unfortunately resulting in some usable –Â though not terribly desirable –Â link pages.
So, what could be better than asking the search engine bots to follow the links on those pages while not indexing them? One would use “NOINDEX,FOLLOW”, in that case. The bots would theoretically crawl the page, see the NOINDEX command, drop that page out of their index, yet still FOLLOW all the links on the page. One could even imagine that the Rank value of those pages might get released back to help apply to the rest of the indexed pages of the site as well.
But, would the spiders actually process the page in this way?Â Perhaps NOINDEXing the page might result in the links on the page not getting properly followed. Or, perhaps this would interfere with how the rank of the pages above would get trickled down to those linked pages?
What I found was that the use of that combination did not work out as I had hoped. NOINDEXing those vital link node pages resulted in all of the pages that they linked-to losing substantial amounts of traffic. If those terminus pages hadn’t had other links elsewhere pointing to them, I think they might’ve risked getting dropped from the indices as well! Whether the NOINDEXing broke the link weight processing or damaged the overall rankings of the pages involved is uncertain, but the overall effect pretty well established that this command combination does not work as one might expect.
I really wish each of the search engines would clearly state and document what outcomes should be expected for the various combinations allowed by the bots metatag. This was provided for us to help specify what pages should/shouldn’t be indexed, and the lack of transparency damages our ability to partner for the best end-user experience. The linking pages I described are not optimal for endusers seeking information. We’ve done our best to make them entirely usable, but the SEs already consider those pages to be less relevant for their keywords involved, so it would be great if we could just not have them show up in SERPs at all. I’d prefer to not have users dropped on pages that are not ideal for what they’re seeking.
The robots.txt site describes the protocol as well , but doesn’t branch out into what effects may be expected from each of the four possible combinations.