<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
	xmlns:media="http://search.yahoo.com/mrss/"
>

<channel>
	<title>Natural Search Blog &#187; bots</title>
	<atom:link href="http://www.naturalsearchblog.com/tag/bots/rss2" rel="self" type="application/rss+xml" />
	<link>http://www.naturalsearchblog.com</link>
	<description>Thought leaders in search engine optimization weigh in with the latest SEO news and commentary</description>
	<pubDate>Fri, 05 Sep 2008 15:56:41 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.1</generator>
	<language>en</language>
		<!-- podcast_generator="podPress/8.8" -->
		<copyright>&#xA9; </copyright>
		<managingEditor>chris@netconcepts.com ()</managingEditor>
		<webMaster>chris@netconcepts.com()</webMaster>
		<category></category>
		<itunes:keywords></itunes:keywords>
		<itunes:subtitle></itunes:subtitle>
		<itunes:summary>Thought leaders in search engine optimization weigh in with the latest SEO news and commentary</itunes:summary>
		<itunes:author></itunes:author>
		<itunes:category text="Society &amp; Culture"/>
		<itunes:owner>
			<itunes:name></itunes:name>
			<itunes:email>chris@netconcepts.com</itunes:email>
		</itunes:owner>
		<itunes:block>No</itunes:block>
		<itunes:explicit>no</itunes:explicit>
		<itunes:image href="http://www.naturalsearchblog.com/wp-content/plugins/podpress/images/powered_by_podpress_large.jpg" />
		<image>
			<url>http://www.naturalsearchblog.com/wp-content/plugins/podpress/images/powered_by_podpress.jpg</url>
			<title>Natural Search Blog</title>
			<link>http://www.naturalsearchblog.com</link>
			<width>144</width>
			<height>144</height>
		</image>
		<item>
		<title>Yahoo&#8217;s Recent Spider Improvement Beats Google&#8217;s</title>
		<link>http://www.naturalsearchblog.com/archives/2007/06/06/yahoos-recent-spider-improvement-beats-googles/</link>
		<comments>http://www.naturalsearchblog.com/archives/2007/06/06/yahoos-recent-spider-improvement-beats-googles/#comments</comments>
		<pubDate>Wed, 06 Jun 2007 15:08:35 +0000</pubDate>
		<dc:creator>Chris Silver Smith</dc:creator>
		
		<category><![CDATA[Google]]></category>

		<category><![CDATA[Spiders]]></category>

		<category><![CDATA[Yahoo]]></category>

		<category><![CDATA[bot-detection]]></category>

		<category><![CDATA[bots]]></category>

		<category><![CDATA[Googlebot]]></category>

		<category><![CDATA[slurp]]></category>

		<category><![CDATA[spidering]]></category>

		<category><![CDATA[user-agents]]></category>

		<guid isPermaLink="false">http://www.naturalsearchblog.com/archives/2007/06/06/yahoos-recent-spider-improvement-beats-googles/</guid>
		<description><![CDATA[
Yahoo!&#8217;s Search Blog announced yesterday that they were making some final changes to their spider, (named &#8220;Slurp&#8221;), standardizing their crawlers to provide a common DNS signature for identification/authorization purposes.
Previously, Slurp&#8217;s requests may have come from IP addresses associated with inktomisearch.com, and now they should all come from IPs associated with domains in this standard syntax:
[something].crawl.yahoo.net

What [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://farm2.static.flickr.com/1156/533230958_38914f7e6b_t.jpg" alt="Googlebot Spider" align="right" border="0" height="100" width="100" /></p>
<p>Yahoo!&#8217;s Search Blog <a href="http://www.ysearchblog.com/archives/000460.html" title="Yahoo! Search Blog" target="_blank">announced yesterday</a> that they were making some final changes to their spider, (named &#8220;Slurp&#8221;), standardizing their crawlers to provide a common DNS signature for identification/authorization purposes.</p>
<p>Previously, Slurp&#8217;s requests may have come from IP addresses associated with inktomisearch.com, and now they should all come from IPs associated with domains in this standard syntax:</p>
<blockquote><p><strong>[something].crawl.yahoo.net</strong></p></blockquote>
<p><span id="more-257"></span></p>
<p>What will this mean to most of us? In most cases, likely nothing. Most sites out there are not likely to be currently performing reverse DNS lookups to check if search engine spiders are actually coming from the IPs/Domains they&#8217;re supposed to, except when those spiders get really impolite in requesting too many pages per second. Most people are only identifying bots by their User-Agent strings.</p>
<p>In fact, Yahoo&#8217;s provision of this authoritative bot ID syntax is more advanced than Google&#8217;s! Google only <a href="http://www.google.com/support/webmasters/bin/answer.py?answer=33577&amp;topic=8460" title="Google Help on Googlebot" target="_blank">recommends</a> that people identify their bot (aka &#8220;Googlebot&#8221;) solely through the User-Agent String &#8212; a bit unsatisfactory for a lot of webmasters out there. I&#8217;ve heard quite a number of webmasters ask what IP address block to expect the Googlebot requests to originate from, and Google wouldn&#8217;t provide them with an authoritative answer.</p>
<p>Of course, one could take a visiting bot&#8217;s IP address, say &#8220;66.249.65.69&#8243;, and perform a Network WHOIS lookup on it to find out if it&#8217;s in a block owned by Google. The Network Whois for 66.249.65.69 returns the following info (lookup info provided by <a href="http://centralops.net/co/" title="Domain Dossier - DNS lookup and Network WHOIS" target="_blank">Hexillion&#8217;s Domain Dossier</a>) :</p>
<blockquote><p>OrgName:    Google Inc.<br />
OrgID:      GOGL<br />
Address:    1600 Amphitheatre Parkway<br />
City:       Mountain View<br />
StateProv:  CA<br />
PostalCode: 94043<br />
Country:    US</p>
<p>NetRange:   66.249.64.0 - 66.249.95.255<br />
CIDR:       66.249.64.0/19<br />
NetName:    GOOGLE<br />
NetHandle:  NET-66-249-64-0-1<br />
Parent:     NET-66-0-0-0-0<br />
NetType:    Direct Allocation<br />
NameServer: NS1.GOOGLE.COM<br />
NameServer: NS2.GOOGLE.COM<br />
NameServer: NS3.GOOGLE.COM<br />
NameServer: NS4.GOOGLE.COM<br />
Comment:<br />
RegDate:    2004-03-05<br />
Updated:    2007-04-10</p>
<p>OrgTechHandle: ZG39-ARIN<br />
OrgTechName:   Google Inc.<br />
OrgTechPhone:  +1-650-318-0200<br />
OrgTechEmail:  <a  rel="nofollow" id="emailShroud1" stoDom="google.com" stoUser="arin-contact" href="http://www.somethinkodd.com/emailshroud/emailaddress.php?domainName=google.com&amp;userName=arin-contact&amp;ver=2.1.0" >arin-contact</a></p></blockquote>
<p>While webmasters could do this lookup for requests for bots displaying the Googlebot user-agent string, it&#8217;s still very unsatisfactory because Google does not state that all requests necessarily come from IP blocks that are identifiably owned by Google. So, webmasters would be nervous about blocking something that claimed to be Googlebot yet came from non-Google IP address ranges. After all, it&#8217;s possible that Google could have purchased IP addresses and domain names through a proxy in order to perform various types of investigative page requests on sites.</p>
<p>There are cases where hostile dataminers will set their user-agent strings up to masquerade as major search engine spiders, so this newly authoritative method for IDing the bots places Yahoo one step ahead of the game for those webmasters who feel the need to ban the badguys who are scraping their site&#8217;s content or requesting pages fast enough to be a defacto denial of service attack.</p>
<p align="center">. . . . . . . . . . . . . . . . . . . .</p>
<p><strong><font color="red">UPDATE:</font></strong> <a href="http://incredibill.blogspot.com/" title="incrediBILL's blog">incrediBILL</a>, one of the moderators at WebmasterWorld, kindly pointed out to me that Matt Cutts had provided the same sort of <a href="http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html" title="Googlebot Authentication Method" target="_blank">Googlebot authentication method</a> via the Webmaster Central Blog not long ago. I wish that Google would update their webmaster help section to reflect the same information, if this is indeed intended to be a trustworthy method for authenticating Googlebot. With the instruction only to be found in the blog and not in the actual help section, it still leaves one with the uncomfortable feeling that it&#8217;s perhaps an informal method and might still not be depended upon to be true for all cases or it could abruptly change. Hopefully, they&#8217;ll update the help pages so everything will be in sync!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.naturalsearchblog.com/archives/2007/06/06/yahoos-recent-spider-improvement-beats-googles/feed/</wfw:commentRss>
		</item>
		<item>
		<title>AdSense Spider Cross-Pollinates for Google</title>
		<link>http://www.naturalsearchblog.com/archives/2006/04/19/adsense-spider-cross-pollinates-for-google/</link>
		<comments>http://www.naturalsearchblog.com/archives/2006/04/19/adsense-spider-cross-pollinates-for-google/#comments</comments>
		<pubDate>Thu, 20 Apr 2006 02:05:24 +0000</pubDate>
		<dc:creator>Chris Silver Smith</dc:creator>
		
		<category><![CDATA[Google]]></category>

		<category><![CDATA[Spiders]]></category>

		<category><![CDATA[AdSense]]></category>

		<category><![CDATA[bots]]></category>

		<category><![CDATA[Googlebot]]></category>

		<category><![CDATA[Robots.txt]]></category>

		<category><![CDATA[URL-submission]]></category>

		<guid isPermaLink="false">http://www.naturalsearchblog.com/archives/2006/04/19/adsense-spider-cross-pollinates-for-google/</guid>
		<description><![CDATA[A few bloggers such as Jenstar have just posted that pages spidered by Google&#8217;s AdSense bot are appearing in Google&#8217;s regular search results pages. Shoemoney just blogged that Matt Cutts has officially verified that this is happening, saying that this was done so that they wouldn&#8217;t have to spider the same content twice, and that [...]]]></description>
			<content:encoded><![CDATA[<p>A few bloggers such as Jenstar have <a href="http://www.jensense.com/archives/2006/04/matt_cutts_conf.html">just posted</a> that pages spidered by Google&#8217;s AdSense bot are appearing in Google&#8217;s regular search results pages. <a href="http://www.shoemoney.com/2006/04/18/matt-cutts-confirms-media-bot-crawling-for-big-daddy">Shoemoney just blogged</a> that Matt Cutts has officially verified that this is happening, saying that this was done so that they wouldn&#8217;t have to spider the same content twice, and that Google did this as part of their recent Big Daddy infrastructure improvements.</p>
<p>This has a couple of interesting ramifications for SEO professionals and those of us who are optimizing our sites for Google, since bot detection systems may now need to be updated and since this may essentially be a new way of committing site/page submissions into Google&#8217;s indices.  And we all thought automated URL submissions were dead!  I&#8217;ll explain further&#8230;.<span id="more-123"></span></p>
<p>First of all, quite a few people like to track the movements of bots through their site pages, in order to know the frequency of spider visits, and to confirm that a page has been spidered, period. For sites/pages which have frequent updates happening upon them, it&#8217;s also useful to know the date/time the page gets re-spidered and then to see when the updated text will typically appear in the SERPs. Also, some folks have set their robots.txt to disallow spiders into sections of their sites for various reasons.</p>
<p>So, this change in Google&#8217;s spidering functionality will be important for you if you have AdSense ads running on your site. You&#8217;ll want to update your robots.txt file to reflect the AdSense bot&#8217;s user agent string, and you&#8217;ll also want to make sure this user-agent string definition is present in the logfile analysis systems you&#8217;re using to track spider activity on your site.</p>
<p>Many of us are using the Web Robots Database at <a href="http://www.robotstxt.org/">Robotstxt.org</a> in order to identify bots and spiders passing through our sites, and it&#8217;s a great resource for all information about the robots exclusion protocol and related matters &#8212; Google even <a href="http://www.google.com/webmasters/bot.html">cites them as an information resource throughout their webmaster info pages</a>. However, the Robots Database has not been updated to include the definition of the AdSense bot as of the time that I&#8217;m writing this. (I&#8217;ve just reported the bot identification information over to them to add in, so hopefully this won&#8217;t be the case for long.)</p>
<p>(Some webmasters and systems are using the IP address of the bots instead of the User-Agent Strings, but I consider the preferred method to be to use the User Agent for this purpose. Otherwise, you risk counting a search engine employee who is browsing your site during their coffee break to be their spider visiting you!)</p>
<p>Matt Cutts is apparently informally referring to this bot as &#8220;Mediabot&#8221; or &#8220;Media Bot&#8221;, but the bot is currently declaring itself with this User-Agent String:</p>
<blockquote><p><font face="courier">Mediapartners-Google/2.1</font></p></blockquote>
<p>If you want to specifically disallow this bot from some section of your site, you should wildcard the bot version number at the end in your robots.txt file like this:</p>
<blockquote><p><font face="courier">User-agent: Mediapartners-Google*<br />
Disallow: /dont-crawl-this-uri-on-my-site</font></p></blockquote>
<p>Another interesting point is raised due to the indexing of the Mediabot-spidered content:  doesn&#8217;t this basically provide a new way to, errrr, <strong>automatically submit pages to Google</strong>?!?</p>
<p>If you manually submit your site to Google using <a href="http://www.google.com/addurl/?continue=/addurl">their submission form</a>, you&#8217;re only allowed to provide the top-level domain name of your site. Using that method, Googlebot will initially visit your homepage, and likely only crawl through one or two levels of links out from the homepage in that initial spidering visit. If you&#8217;ve got a really deep site with thousands of pages of content, Googlebot might later revisit the site to try to spider more deeply, in a widening circle out from the homepage.  This existing process could utlimately take quite some time before all of your content gets spidered and can begin appearing in the SERPs.</p>
<p>I&#8217;m thinking that for cases like that where you have a lot of pages on a new or non-indexed site, adding the Google ads onto all your pages might actually result in them getting initially spidered more rapidly.</p>
<p>Also, quite a lot more pages could potentially get indexed if they have the ads on them, since there are situations where Googlebot will not or cannot spider pages on sites.  For instance, if your site content is accessible primarily only through a submission form on your homepage, or through Java/Javascripted menus, or you have only some Flash-enabled navigation system (of course, no SEO professional worth his or her salt would use a Flash-only nav system!) &#8212; if your site pages aren&#8217;t navigable through regular links displayed on your pages, Googlebot would otherwise never find and index their content.</p>
<p>But, now pages that were only accessible to users through a submission form on your site could potentially now get indexed and appear in SERPs if they have AdSense ads on them!</p>
<p>This is mostly a good thing for those of us working hard to expose content through the SEs, but I bet it could create some havoc for webmasters who are AdSense publishers and who are taken unaware by the potential sudden influx in traffic which pounds their databases as pages suddenly become visible in SERPs.  Fun problem to have, though!</p>
<p>Google apparently uses other bots specifically for harvesting other types of content as well.  Two that I&#8217;ve come across include a bot which grabs images from websites to use in their Google Images section, and a bot for gathering RSS feeds for use in personalized Google homepage, or in the Google Reader.</p>
<p>These are also not identified in Robotstxt.org yet, but their User-Agent strings are as follows:</p>
<blockquote><p><font face="courier">User-agent: Googlebot-Image</font><font face="courier">User-agent: Feedfetcher-Google</font></p></blockquote>
<p>Note: Feedfetcher <strong>ignores</strong> robots.txt exclusion files! Google does this because:</p>
<blockquote><p><em>Feedfetcher retrieves feeds only after users have explicitly added them to their Google homepage or Google Reader. Feedfetcher behaves as a direct agent of the human user, not as a robot, so it ignores robots.txt entries. Feedfetcher does have one special advantage, though: because it&#8217;s acting as the agent of multiple users, it conserves bandwidth by making requests for common feeds only once for all users.</em></p></blockquote>
<p>Do you know if Google is using other specialized bots for other sections of their site or other types of media? If so, I&#8217;d be interested in hearing about it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.naturalsearchblog.com/archives/2006/04/19/adsense-spider-cross-pollinates-for-google/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
