Natural Search Blog Yahoo's Recent Spider Improvement Beats Google's

Googlebot Spider

Yahoo!’s Search Blog announced yesterday ^[1] that they were making some final changes to their spider, (named “Slurp”), standardizing their crawlers to provide a common DNS signature for identification/authorization purposes.

Previously, Slurp’s requests may have come from IP addresses associated with inktomisearch.com, and now they should all come from IPs associated with domains in this standard syntax:

[something].crawl.yahoo.net

What will this mean to most of us? In most cases, likely nothing. Most sites out there are not likely to be currently performing reverse DNS lookups to check if search engine spiders are actually coming from the IPs/Domains they’re supposed to, except when those spiders get really impolite in requesting too many pages per second. Most people are only identifying bots by their User-Agent strings.

In fact, Yahoo’s provision of this authoritative bot ID syntax is more advanced than Google’s! Google only recommends ^[2] that people identify their bot (aka “Googlebot”) solely through the User-Agent String — a bit unsatisfactory for a lot of webmasters out there. I’ve heard quite a number of webmasters ask what IP address block to expect the Googlebot requests to originate from, and Google wouldn’t provide them with an authoritative answer.

Of course, one could take a visiting bot’s IP address, say “66.249.65.69”, and perform a Network WHOIS lookup on it to find out if it’s in a block owned by Google. The Network Whois for 66.249.65.69 returns the following info (lookup info provided by Hexillion’s Domain Dossier ^[3]) :

OrgName: Google Inc.
OrgID: GOGL
Address: 1600 Amphitheatre Parkway
City: Mountain View
StateProv: CA
PostalCode: 94043
Country: US

NetRange: 66.249.64.0 – 66.249.95.255
CIDR: 66.249.64.0/19
NetName: GOOGLE
NetHandle: NET-66-249-64-0-1
Parent: NET-66-0-0-0-0
NetType: Direct Allocation
NameServer: NS1.GOOGLE.COM
NameServer: NS2.GOOGLE.COM
NameServer: NS3.GOOGLE.COM
NameServer: NS4.GOOGLE.COM
Comment:
RegDate: 2004-03-05
Updated: 2007-04-10

OrgTechHandle: ZG39-ARIN
OrgTechName: Google Inc.
OrgTechPhone: +1-650-318-0200
OrgTechEmail: arin-contact@google.com

While webmasters could do this lookup for requests for bots displaying the Googlebot user-agent string, it’s still very unsatisfactory because Google does not state that all requests necessarily come from IP blocks that are identifiably owned by Google. So, webmasters would be nervous about blocking something that claimed to be Googlebot yet came from non-Google IP address ranges. After all, it’s possible that Google could have purchased IP addresses and domain names through a proxy in order to perform various types of investigative page requests on sites.

There are cases where hostile dataminers will set their user-agent strings up to masquerade as major search engine spiders, so this newly authoritative method for IDing the bots places Yahoo one step ahead of the game for those webmasters who feel the need to ban the badguys who are scraping their site’s content or requesting pages fast enough to be a defacto denial of service attack.

. . . . . . . . . . . . . . . . . . . .

UPDATE: incrediBILL ^[4], one of the moderators at WebmasterWorld, kindly pointed out to me that Matt Cutts had provided the same sort of Googlebot authentication method ^[5] via the Webmaster Central Blog not long ago. I wish that Google would update their webmaster help section to reflect the same information, if this is indeed intended to be a trustworthy method for authenticating Googlebot. With the instruction only to be found in the blog and not in the actual help section, it still leaves one with the uncomfortable feeling that it’s perhaps an informal method and might still not be depended upon to be true for all cases or it could abruptly change. Hopefully, they’ll update the help pages so everything will be in sync!