A few bloggers such as Jenstar have just posted  that pages spidered by Google’s AdSense bot are appearing in Google’s regular search results pages. Shoemoney just blogged  that Matt Cutts has officially verified that this is happening, saying that this was done so that they wouldn’t have to spider the same content twice, and that Google did this as part of their recent Big Daddy infrastructure improvements.
This has a couple of interesting ramifications for SEO professionals and those of us who are optimizing our sites for Google, since bot detection systems may now need to be updated and since this may essentially be a new way of committing site/page submissions into Google’s indices. And we all thought automated URL submissions were dead! I’ll explain further….
First of all, quite a few people like to track the movements of bots through their site pages, in order to know the frequency of spider visits, and to confirm that a page has been spidered, period. For sites/pages which have frequent updates happening upon them, it’s also useful to know the date/time the page gets re-spidered and then to see when the updated text will typically appear in the SERPs. Also, some folks have set their robots.txt to disallow spiders into sections of their sites for various reasons.
So, this change in Google’s spidering functionality will be important for you if you have AdSense ads running on your site. You’ll want to update your robots.txt file to reflect the AdSense bot’s user agent string, and you’ll also want to make sure this user-agent string definition is present in the logfile analysis systems you’re using to track spider activity on your site.
Many of us are using the Web Robots Database at Robotstxt.org  in order to identify bots and spiders passing through our sites, and it’s a great resource for all information about the robots exclusion protocol and related matters — Google even cites them as an information resource throughout their webmaster info pages . However, the Robots Database has not been updated to include the definition of the AdSense bot as of the time that I’m writing this. (I’ve just reported the bot identification information over to them to add in, so hopefully this won’t be the case for long.)
(Some webmasters and systems are using the IP address of the bots instead of the User-Agent Strings, but I consider the preferred method to be to use the User Agent for this purpose. Otherwise, you risk counting a search engine employee who is browsing your site during their coffee break to be their spider visiting you!)
Matt Cutts is apparently informally referring to this bot as “Mediabot” or “Media Bot”, but the bot is currently declaring itself with this User-Agent String:
If you want to specifically disallow this bot from some section of your site, you should wildcard the bot version number at the end in your robots.txt file like this:
Another interesting point is raised due to the indexing of the Mediabot-spidered content: doesn’t this basically provide a new way to, errrr, automatically submit pages to Google?!?
If you manually submit your site to Google using their submission form , you’re only allowed to provide the top-level domain name of your site. Using that method, Googlebot will initially visit your homepage, and likely only crawl through one or two levels of links out from the homepage in that initial spidering visit. If you’ve got a really deep site with thousands of pages of content, Googlebot might later revisit the site to try to spider more deeply, in a widening circle out from the homepage. This existing process could utlimately take quite some time before all of your content gets spidered and can begin appearing in the SERPs.
I’m thinking that for cases like that where you have a lot of pages on a new or non-indexed site, adding the Google ads onto all your pages might actually result in them getting initially spidered more rapidly.
But, now pages that were only accessible to users through a submission form on your site could potentially now get indexed and appear in SERPs if they have AdSense ads on them!
This is mostly a good thing for those of us working hard to expose content through the SEs, but I bet it could create some havoc for webmasters who are AdSense publishers and who are taken unaware by the potential sudden influx in traffic which pounds their databases as pages suddenly become visible in SERPs. Fun problem to have, though!
Google apparently uses other bots specifically for harvesting other types of content as well. Two that I’ve come across include a bot which grabs images from websites to use in their Google Images section, and a bot for gathering RSS feeds for use in personalized Google homepage, or in the Google Reader.
These are also not identified in Robotstxt.org yet, but their User-Agent strings are as follows:
User-agent: Googlebot-ImageUser-agent: Feedfetcher-Google
Note: Feedfetcher ignores robots.txt exclusion files! Google does this because:
Feedfetcher retrieves feeds only after users have explicitly added them to their Google homepage or Google Reader. Feedfetcher behaves as a direct agent of the human user, not as a robot, so it ignores robots.txt entries. Feedfetcher does have one special advantage, though: because it’s acting as the agent of multiple users, it conserves bandwidth by making requests for common feeds only once for all users.
Do you know if Google is using other specialized bots for other sections of their site or other types of media? If so, I’d be interested in hearing about it.