The post today is about getting a site crawled and indexed effectively by the major search engines. It can be frustrating for a site owner to find that her newly built site with bells and whistles is just not appearing on the Google SERPs for a search query relevant to her business.
It is a good idea to have some knowledge of the factors that influence the crawling of a site and its successful indexing before the site ranks on the SERPs. The site can be built in a user friendly way that allows the spiders to know what to crawl and how frequently to crawl.
All major search engines crawl the web through link structures. If a site has a good link structure starting broadly from the top and going down into the category and sub category level, with all the money pages three to four clicks away from the home page, the bots would find crawling the site a lot easy. Placing a sitemap on the home page further assists the bots to find all the content on the site.
- Content Freshness and Updates:
This is one of the best ways to keep the bots coming back to your site regularly. It is vital to have fresh content updated regularly on a site. A blog will go a long way in achieving this. To a googlebot, new content is a sign of attaching more importance to the site by visiting it more often.
There is a Query Deserves Freshness (QDF) component in Google’s algorithm that awards sites with updated content (news sites for example) that invites the bots back to the site for repeated crawling and indexing.
If a site has a regularly updated blog or fresh articles posted on it at regular intervals, it would be ideal to have a feed and export it. Google Blog Search and feed tracking help in increasing the crawl activity. When a new post or article is published on the site, the search engine is pinged to let it know that the content has been updated.
- Importance of Domain:
A powerful domain that has good quality links coming in from diverse trustworthy domains is very important and it affects both the crawl rate and indexing of the site that resides on that domain.
- Technical Factors:
A site can have spider traps in the form of linking structures that have infinite loop system. The crawling can be interrupted by broken links. The problem of duplicate content with same content found on multiple URLs due to use of a CMS is also possible. All these factors inhibit the capacity of a bot to crawl the site exhaustively.
- Increase the Crawl Rate in Google Webmaster Tools:
If you login to the Google webmaster tools, there is a provision to increase the spider’s crawl rate. It is a small consolation if the site is affected by problems listed above. On its own, it cannot influence the crawl rate to any extent.
You can picture Google (Yahoo and Bing most likely) to consist of a Main Index and a Supplemental Index. The main index consists of the top 10 or 20 results served for important search queries.
If Google thinks a page is not relevant and of high quality, it places it in its Supplemental index. But this index is not visible. A good post by Aaron Wall on Supplemental index  will give you a better idea.
The third scenario is where a site’s pages can be crawled and then dropped from the index.
- Content That is Valuable and Unique:
To have your pages in the main index, you must provide valuable and unique content. Google is extremely good at identifying content that is unique. Gone are the days when content could be scraped and the introduction and conclusion added to make it look unique. Content that is engaging and valuable definitely play a big part in a page being part of the main index.
- Domain Importance:
If a domain has variety of good trustworthy domains pointing to it, it helps Google retain its pages in its main index. A good example is Wikipedia. Some of its pages with just one line of content or duplicate content gets ranked at the top of the SERPs. This is due to the domain trust and authority that Wikipedia commands.
Pagerank or raw link juice is determined by the number of links pointing to your site and their importance. The internal linking structure on your site is also part of the calculation. Pagerank sculpting has become popular over the past few years to direct the link juice to important pages on a site. Links to less perceived important pages are nofollowed. Aaron Wall  has remarked that a certain Pagerank threshold is required for a page to be crawled and indexed by Google.
- External Links:
If your site has a problem with unique quality content, is on a not so strong domain and does not have enough PR, having a few backlinks from good or lesser known domains to your money pages will be sufficient to get them indexed by Google.
Oftentimes, I have seen new sites insulated on the web where they have not linked out or have had a lack of incoming links. A simple step such as submitting to a local popular directory that gets crawled by the search engines regularly is enough to get the site pages indexed by Google.
When the search engines see backlinks to your site, they are literally votes for the site as someone on the web thinks that your site is quite important. That is a key factor that gets your site pages indexed and retained in the index.
- Other Signals:
There is a belief among the wider search community that search and traffic volumes to a site, the number of clicks earned by pages on a site on the SERPs, average time spent on a site etc are all signals which search engines are using to retain such pages in their index.
If the number of visits to a site is increasing steadily and users are spending more time on the site, it is logical to assume that such pages will be found relevant to search queries and retained in the search engines’ index. There is no confirmation officially from the search engines themselves to this effect. There is always the prospect of the signals getting noisy over time.
Rand has done a great video on this in his Whiteboard Friday post on Search Engine Crawling and Indexing  factors.
Ravi Venkatesan is a senior SEO consultant  at Netconcepts, a well established Auckland SEO company  that has a great track record of optimising client sites for organic search and delivering great results.