Home  >  Robots.txt

Robots.txt

What is robots.txt?

Robots.txt is a small file you put in the root of your web folder. By doing so, you tell search engine crawlers what files and folders they should and shouldn't access on your site. This can be useful if you have sections of your site that you don't want search engines to index, or if you want to prevent them from following certain links.

However, this doesn't mean, that these parts of website won't appear in search engines. Robots.txt is more like a suggestion, not a command. Not all web crawlers are behaved and will obey the rules laid out in a robots.txt file. This means that some web crawlers will ignore the instructions in the robots.txt file and crawl the website anyway. However, most major search engines do respect the directives in robots.txt, so it's still a good idea to use it.

There are a few different things that you can include in a robots.txt file, but the most common are "Allow" and "Disallow" directives. These tell crawlers whether they should or shouldn't index a certain file or folder. For example, if you had a folder full of images that you didn't want search engines to index, you could add a "Disallow" directive for that folder in your robots.txt file.

The basic syntax for robots.txt is quite easy, althought it might not seem so at first. The basic syntax is as follows:

User-agent: [user-agent name] 
Disallow: [URL string of what is not allowed to be crawled] 
Allow: [URL string of what is allowed to be crawled] 

The following is an example of a robots.txt file where some user agents are allowed and some disallowed to crawl certain directories:

User-agent: Googlebot
Disallow: /private/

User-agent: Bingbot
Allow: /public/

User-agent: *
Disallow: /cgi-bin/

This would allow Googlebot to crawl the public directory, but not the private directory. Bingbot would be allowed to crawl both the public and private directories. Any other user agent would be disallowed from crawling the cgi-bin directory.

If you want to include multiple urls, you can include each of them on a separate line:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /admin/

The instructions above tell all robots not to crawl any of the specified directories.

As you can see, the formatting for a robots.txt file is pretty simple. You just need to list the instructions, one per line, and then save the file. The instructions themselves can be a little bit more complicated. 

Allow instruction is used to specify which files or directories a bot is allowed to access. In the example above, Bingbot is allowed to access the public directory. 

Disallow instruction is used to specify which files or directories a bot is not allowed to access. In the examples above, all bots are disallowed from accessing the cgi-bin, tmp and admin directories. 

You can also include a Crawl-delay instruction, which tells bots how long they should wait between requests when they're crawling your site. This can be useful if you have a lot of traffic on your site and you don't want your server to get overloaded.

The syntax for crawl-delay instruction is as follows:

User-agent: Googlebot
Allow: /images/logo.jpg
Disallow: /cgi-bin/contact-form.pl
Crawl-delay: 5

Do not rely on robots.txt to hide private information

Robots.txt is not a secure way to hide information. If you have sensitive information on your website that you don't want anyone to find, don't rely on robots.txt to keep it hidden. Anyone can view your robots.txt file, so don't put anything in there that you wouldn't want someone to see.

However, robots.txt cannot prevent your web page from being indexed. The recommended way to keep a web page not being indexed is to block indexing with noindex or even better to password-protect the page.

Optimize crawl budget

Crawling is the process by which search engines discover new web pages. They do this by “crawling” the links on each page they find. When a search engine’s crawler (also called a “spider”) finds a link to another page, it follows that link to the new page and then adds it to the list of pages to be crawled.

The crawl budget is the number of pages a search engine will crawl on your site during a given time period. It’s important to optimize your crawl budget so that the spider isn’t wasting time on pages that aren’t important to search engines. 

You can optimize your crawl budget by using a robots.txt file to exclude any pages that you don't want the spider to crawl. These may be duplicate pages or pages not important to your search engine rankings. With robots.txt you may focus the crawler on pages you actually want to be crawled and be included in the search index. 

Block indexing of duplicate pages

Duplicate content is a common issue for many website owners. This type of content can prevent search engines from indexing your pages correctly, which can impact your website's ability to appear in search results. 

Duplicate content can occur when there are multiple pages on your website with the same or similar content. This can happen with product listing pages, blog posts, and more. 

Hopefully, you can block indexing of this duplicate content with with robots.txt. This way bots can focus on pages you actually want to appear in search results. Duplicate content thank to robots.txt will be skipped by crawlers and won't be an issue in web indexing. 

Include sitemaps in robots.txt file

Sitemaps are an important part of any website and can be a valuable tool for search engines. Sitemaps are included in the robots.txt file to make sure that crawlers can find them. There may be multiple sitemaps in the robots.txt file, each with its own set of rules. Sitemaps help crawlers understand your website's structure and find all of the content you want them to index.

The syntax to include sitemaps in robots.txt files is simple and straightforward. Just add the following line to your robots.txt file:

Sitemap: http://www.example.com/sitemap.xml

This will tell search engine crawlers that your sitemap is located at the specified URL. Be sure to replace "www.example.com" with the actual URL of your website.

You can also submit your sitemap to Google via Google Search Console. This will help Google index your website more effectively. For other search engines, it is always good idea to put the link to the sitemap file into the robots.txt file.

 

Contact us

Tel: +421 903 666 844
Tomas Remek

Intensic s.r.o.
A. Rudnaya 21
010 01 Zilina, Slovakia

COMPANY No. 44756801
VAT No. SK2022816543

How we may help you?