Utilize your 'robots.txt' file efficiently
- Friday, July 31, 2009
Never underestimate the value of 'robots.txt' file, if you want your website to be ranked top in the search engines results. It's essential for you to understand how the search engine spiders crawls over your website, by understanding how to control the direct the spiders, you can be certain that your website will rise in rankings. After all it's the 'crawlers, spider, robots' which determines the relevancy of your website and decide where your website should be ranked in the search engine results page.
Always keep this in mind that the 'robots.txt' file is meant to prevent search engine spiders from searching certain pages. That is why by using 'robots.txt' file properly, you can direct the spiders to locate the most important pages on your website and not to locate certain files. You can also use the spiders to prevent indexing the duplicate pages in your website, because having duplicate content can actually reduce your search engine ranking. With the help of the 'robots.txt' file, you can tell the search engine spiders which pages they should and should not crawl and index. You can instruct the spiders to look only at the main/important web pages and folder and leave the rest alone.
To successfully use the 'robots.txt' file, first you need to determine which pages you don't want the spiders to search. You need to upload the 'robots.txt' file to the root folder/directory of your domain or in your sub-domains. Adding it to your subdirectories will not work.
For example, you can upload the 'robots.txt' file in/as http://www.yoursite.com/robots.txt or to http://subdomain.yoursite.com/robots.txt but uploading it to a sub-folder like http://www.yoursite.com/products/robots.txt will not work. Only one robots.txt file in your root folder/directory, you can manage your entire website. But if you have sub-domains, you need to have a robots.txt file for each sub domain. You will also need separate robots.txt files for your secure (https) and non-secure (http) pages.
How to create a 'robots.txt' file
Creating a robots.txt file is very simple, all you need to name a text file robots.txt with any text editor, such as Notepad, Editplus or Textpad. It needs to contain two lines in order to be effective. If you want to stop the spiders/clawers from searching the 'library' folder in your website, you should add the following to your 'robots.txt' file:
The "User-agent" line is used to define which search engine spiders you want to have blocked. By placing the asterisk (*) here, you are instructing all search engine spiders to avoid the specified files/folders. You can also target specific search engine spiders by replacing the asterisk with the following codes:
User-agent: * Disallow: /library/
- Google: Googlebot
- Yahoo: Slurp
- Microsoft: msnbot
- Ask: Teoma
The "Disallow" line specifies which part of the site you want the spiders to ignore. So, if you want the spiders to ignore the 'resource' folder in your website, you would replace 'library' with 'resource' and so on. If you wanted to instruct the spiders to ignore multiple sections, you would simply add a new "Disallow" line for each area you want to be ignored.
You may also want only the Google to visit the folder and nobody else. In that case, you can use the asterisk to instruct all other search engines to avoid the folder while instructing a specific spider to crawl the folder. If you want Google to access a folder, you need to write the following command:
You can also use your 'robots.txt' files to prevent dynamic URLs from being indexed by the search engine spiders. You can do that with the following code:
User-agent: * Disallow: /library/ User-agent: Googlebot Allow: /library/
With this command, you are instructing the spiders to index only one of the URLs that matchs the parameter you have set. Example: You had the following dynamic URLs:
Your 'robots.txt' instructions will tell the spiders to only list the third example because it will disallow any URLs that start with a forward slash (/) and contain the (&) symbol. You can use the same strategy to block any URLs containing a question mark by using the following:
User-agent: * Disallow: /*?
You can also instruct the spiders to avoid an entire folder on your website while still allowing it to access specific pages in that folder. To do this, you would write something like:
User-agent: * Disallow: /library/ Allow: /library/only-this-page.html
Also if you want the crawlers and spiders to avoid indexing certain types of files, you will need to use the dollar ($) sign symbol. Example: If you want the spiders to avoid PDF files, use the following:
User-agent: * Disallow: /*.pdf$
You can use the same for other file types that you don't want, such as .gif$, .jpg$ or .png$ etc. Another very good use of the 'robots.txt' file is that, it can help you to create path for your XML sitemap.
You can do that with the following code:
By utilizing your 'robots.txt' file in this way, you can submit your XML sitemap to all search engines without registering with multiple Webmaster Tools programs. Make sure you upload your XML sitemap in root folder.
It is still possible that a search engine may index the pages, which you have included in your robots.txt file not to be indexed. If somebody created a page, which contains a link to that page, it will be easily crawled by the spiders through that link. To avoid this you need to place a META noindex and nofollow tag on the page.
<meta name="robots" content="noindex,nofollow">
It is important to be very careful and use caution while writing the 'roborts.txt', as you may accidentally block yous web pages that you do want to be indexed.