Natural Search Blog


Search Engine Crawling and Indexing Factors

The post today is about getting a site crawled and indexed effectively by the major search engines. It can be frustrating for a site owner to find that her newly built site with bells and whistles is just not appearing on the Google SERPs for a search query relevant to her business.

It is a good idea to have some knowledge of the factors that influence the crawling of a site and its successful indexing before the site ranks on the SERPs. The site can be built in a user friendly way that allows the spiders to know what to crawl and how frequently to crawl.

Crawling Factors

  1. Links:
    All major search engines crawl the web through link structures. If a site has a good link structure starting broadly from the top and going down into the category and sub category level, with all the money pages three to four clicks away from the home page, the bots would find crawling the site a lot easy. Placing a sitemap on the home page further assists the bots to find all the content on the site.
  2. Content Freshness and Updates:
    This is one of the best ways to keep the bots coming back to your site regularly. It is vital to have fresh content updated regularly on a site. A blog will go a long way in achieving this. To a googlebot, new content is a sign of attaching more importance to the site by visiting it more often.

    There is a Query Deserves Freshness (QDF) component in Google’s algorithm that awards sites with updated content (news sites for example) that invites the bots back to the site for repeated crawling and indexing.

  3. Feeds:
    If a site has a regularly updated blog or fresh articles posted on it at regular intervals, it would be ideal to have a feed and export it. Google Blog Search and feed tracking help in increasing the crawl activity. When a new post or article is published on the site, the search engine is pinged to let it know that the content has been updated.
  4. Importance of Domain:
    A powerful domain that has good quality links coming in from diverse trustworthy domains is very important and it affects both the crawl rate and indexing of the site that resides on that domain.
  5. Technical Factors:
    A site can have spider traps in the form of linking structures that have infinite loop system. The crawling can be interrupted by broken links. The problem of duplicate content with same content found on multiple URLs due to use of a CMS is also possible. All these factors inhibit the capacity of a bot to crawl the site exhaustively.
  6. Increase the Crawl Rate in Google Webmaster Tools:
    If you login to the Google webmaster tools, there is a provision to increase the spider’s crawl rate. It is a small consolation if the site is affected by problems listed above. On its own, it cannot influence the crawl rate to any extent.

Crawling Factors
You can picture Google (Yahoo and Bing most likely) to consist of a Main Index and a Supplemental Index. The main index consists of the top 10 or 20 results served for important search queries.

If Google thinks a page is not relevant and of high quality, it places it in its Supplemental index. But this index is not visible. A good post by Aaron Wall on Supplemental index will give you a better idea.

The third scenario is where a site’s pages can be crawled and then dropped from the index.

  1. Content That is Valuable and Unique:
    To have your pages in the main index, you must provide valuable and unique content. Google is extremely good at identifying content that is unique. Gone are the days when content could be scraped and the introduction and conclusion added to make it look unique. Content that is engaging and valuable definitely play a big part in a page being part of the main index.
  2. Domain Importance:
    If a domain has variety of good trustworthy domains pointing to it, it helps Google retain its pages in its main index. A good example is Wikipedia. Some of its pages with just one line of content or duplicate content gets ranked at the top of the SERPs. This is due to the domain trust and authority that Wikipedia commands.
  3. PageRank:
    Pagerank or raw link juice is determined by the number of links pointing to your site and their importance. The internal linking structure on your site is also part of the calculation. Pagerank sculpting has become popular over the past few years to direct the link juice to important pages on a site. Links to less perceived important pages are nofollowed. Aaron Wall has remarked that a certain Pagerank threshold is required for a page to be crawled and indexed by Google.
  4. External Links:
    If your site has a problem with unique quality content, is on a not so strong domain and does not have enough PR, having a few backlinks from good or lesser known domains to your money pages will be sufficient to get them indexed by Google.

    Oftentimes, I have seen new sites insulated on the web where they have not linked out or have had a lack of incoming links. A simple step such as submitting to a local popular directory that gets crawled by the search engines regularly is enough to get the site pages indexed by Google.

    When the search engines see backlinks to your site, they are literally votes for the site as someone on the web thinks that your site is quite important. That is a key factor that gets your site pages indexed and retained in the index.

  5. Other Signals:
    There is a belief among the wider search community that search and traffic volumes to a site, the number of clicks earned by pages on a site on the SERPs, average time spent on a site etc are all signals which search engines are using to retain such pages in their index.

    If the number of visits to a site is increasing steadily and users are spending more time on the site, it is logical to assume that such pages will be found relevant to search queries and retained in the search engines’ index. There is no confirmation officially from the search engines themselves to this effect. There is always the prospect of the signals getting noisy over time.

Rand has done a great video on this in his Whiteboard Friday post on Search Engine Crawling and Indexing factors.

Ravi Venkatesan is a senior SEO consultant at Netconcepts, a well established Auckland SEO company that has a great track record of optimising client sites for organic search and delivering great results.

Possible Related Posts

Posted by of Netconcepts Ltd. on 07/05/2009

Permalink | Email This Post Email This Post | Print Print | Trackback | Comments (5) | Comments RSS

Filed under: Search Engine Optimization, SEO, Spiders , , , , , , , , , , , , , , , , ,

5 comments for Search Engine Crawling and Indexing Factors »

  1. MyAvatars 0.2

    I bookmarked this page. Thank you for given this…

    Comment by Seo Firm — 7/6/2009 @ 3:29 am


  2. MyAvatars 0.2

    Great post. In your section on “Feeds:
    If a site has a regularly updated blog or fresh articles posted on it at regular intervals, it would be ideal to have a feed and export it” — my question is –how would you do this (export it) on a web site that is not a blog. Would one just manually post a relevant story on a regular basis, or do you link to a blog, or do you embedd an rss from a blog. And if so, are all of these ways crawlable?

    Many thanks,
    Joel

    Comment by joel — 7/6/2009 @ 4:50 am


  3. MyAvatars 0.2

    Strange. I had a comment earlier and it’s not here. My question is as you refer to “Feeds:
    If a site has a regularly updated blog or fresh articles posted on it at regular intervals, it would be ideal to have a feed and export it.” My questions… if it’s a website, do you just write a post on the web site, or do you embedd an RSS feed from one’s blog into the website, and will Google then crawl it?

    Thanks.

    Comment by Joel — 7/6/2009 @ 12:44 pm


  4. MyAvatars 0.2

    Is there a way to get Google to crawl a page, but not add that one page to it’s index? (ie. just indexing the linked pages, not the page itself).

    Comment by Ron — 10/9/2009 @ 4:36 pm


  5. MyAvatars 0.2

    Hi Ron,
    Assuming that you have Page A on which you have 2 links (link1 and link2) to other pages on your site or other sites, you want only the pages pointed to by the links to be indexed.

    If this is the case, then you can have a meta robots tag on Page A that says
    . This will not index page A but will follow the links (link1 and link2)on Page A. Hope this answers your question.
    Cheers
    Ravi

    Comment by Ravi — 10/11/2009 @ 1:28 pm


Leave a comment

* Do not use spammy names!

RSS feed for comments on this post. TrackBack URI

RSS Feeds
Categories
Archives
2013
Feb      
2011
May      
2010
Jan Feb Mar Apr
Sep      
2009
Jan Feb Apr May
Jun Jul Aug Sep
Oct Nov Dec  
2008
Jan Feb Mar Apr
May Jun Jul Aug
Sep Oct Dec  
2007
Jan Feb Mar Apr
May Jun Jul Aug
Sep Oct Nov Dec
2006
Mar Apr May Jun
Jul Aug Sep Oct
Nov Dec    
2005
Jan Feb Mar Dec
2004
May Jun Jul Aug
Sep Oct Nov Dec
Other

web hosts reviews cheap web hosting reviews how to build muscle for women symptoms of depression in women painkiller addiction how to get rid of depression drug addiction