Summary of robots.txt



robots.txt is a plain text file used to tell web crawlers which specific pages on the website are not allowed to be crawled.


Sometimes, website owners do not want certain pages to be crawled, such as product listing pages sorted by different conditions, some meaningless pages, or pages that are still in the testing stage, etc., in order to avoid crawling by search engines These pages consume the time spent on other parts of the website being crawled, or the burden of server traffic caused by crawling certain pages, you can use the robots.txt file to instruct these web crawlers (also known as web spiders, Crawlers, web robots, crawlers).



1. How robots.txt works
 

The main tasks of search engines can be roughly divided into several items:

  1. Crawling (retrieving) various websites on the Internet and discovering the content of the pages
  2. Index these different web pages (include them)
  3. When the user searches, the pages in the index are presented in the proper order


Before crawling the content of the website, the web crawler of the search engine will first search for the robots.txt plain text file in the root directory of the website, and crawl the content of the website according to the instructions given in it.

However, the instructions in the robots.txt file are not mandatory. More decent web crawlers such as Googlebot will follow the instructions in the file, but not all web crawlers will do so. And you should pay attention to whether there are certain instructions that are not used by a particular crawler.


When the robots.txt file does not exist or has no content, it means that the search engine can crawl all the content of the website.

Crawling by search engines and indexing are different procedures. If you want web pages not to be searched, you should use robots.txt; if you want web pages not to be indexed, you should use noindex meta tags or other methods.

It is forbidden to crawl a webpage in robots.txt. For search engines that follow the instructions, the page has no content, which may cause its ranking to drop or disappear from the search results, but there is no guarantee that the page will not appear in the search As a result, it is still possible for search engines to enter through other import links, causing the page to be indexed.

 

 

 
2. The basic form of the document

User-agent: The name of the crawler 
Disallow: 
URLs of pages that should not be crawled 


Since the main function of robots.txt is to tell web crawlers which web pages "cannot" crawl, the rules formed by the two lines of instructions written above can be regarded as the simplest robots.txt file.

After specifying the web crawler, the instructions for directories or files should be written on a separate line. The instructions for different crawlers are separated by blank lines, as shown in the following figure:


 

 
Three, usable instructions
1. User-agent

Necessary items. You can specify one or more user-agents in each rule. Most of the user-agent names can be found in the Robots Database and Google searcher lists . This command can be used with * wildcard characters. For example, the scope of User-agent: *  includes all crawlers except Adsbot.

Note: Adsbot is a search program used by Google to evaluate landing page experience. In order to avoid affecting ads, the system will ignore all excluded items. Therefore, to prevent Adsbot from crawling web pages, you need to write rules specifically for it

 

2. Disallow

Each rule must have at least one Disallow or Allow command. Disallow is used to indicate items that prohibit crawlers. If it is a web page, the complete relative path should be written; if it is a directory, it must end with /.

 

3. Allow

Each rule must have at least one Disallow or Allow command. Allow is used to indicate items that allow crawlers and can override items prohibited by Disallow. If it is a web page, the complete relative path should be written; if it is a directory, it must end with /.

 

4. Crawl-delay 

Non-essential items, used to tell how long to delay before starting to crawl the webpage, the unit is milliseconds. It’s just that Googlebot will ignore this rule, because Google Search Console already has a setting for limiting the search frequency.

 

5. Sitemap

Not necessary, you can use this command to point out the location of the XML sitemap, or you can provide multiple sitemaps at the same time, just list them by branch. This command should use the absolute path.


In the Disallow and Allow instructions mentioned above, the * and $ characters in regular expressions can be used, and they are used as follows:

  • * Can represent 0 or more than one valid character.
  • $ Represents the end of the URL.

 

 

4. Examples of common rules

Take www.example.com as an example below to list some common rules for reference.

1. Prohibit crawling the entire website

The following rules will prohibit all crawlers from crawling the entire website (but not Google's Adsbot crawler).

User-agent: *
Disallow: /

 

2. Allow crawling of the entire website

The following rules will allow all crawlers to crawl the content of the entire website. It will have the same effect if no robots.txt file is created or the file has no content.

User-agent: *
Disallow:

 

3. Allow a single crawler to crawl the entire website

The following rules will prohibit crawlers other than baiduspider from crawling the entire website.

User-agent: baiduspider
Allow: /

User-agent: *
Disallow: /

 

4. Prohibit specific crawlers from crawling specific directories

The following rules will prevent Google’s crawler (Googlebot) from crawling all web content starting with www.example.com/folder1/.

User-agent: Googlebot
Disallow: /folder1/

 

5. Prohibit specific crawlers from crawling specific pages

The following rules will prevent Bing's search program (Bingbot) from crawling the content of this page www.example.com/folder1/page1.html

User-agent: Bingbot
Disallow: /folder1/page1.html

 

6. Specify URLs ending with a specific string. The
following rules can block any URL ending with .gif, and can also be applied to prohibit specific types of files.

User-agent: Googlebot
Disallow: /*.gif$

 

 

5. How to create robots.txt file

You can use almost any text editor to create the robots.txt file, or use the robots.txt test tool to create it. However, you should avoid using text editing software to avoid content analysis problems caused by the stored format or incompatible characters.

The file name of this plain text file must be robots.txt, the file name is case sensitive, and there can only be one file, and it must be placed in the root directory of the website host.

Taking https://www.example.com/ as an example, the location of robots.txt must be https://www.example.com/robots.txt.


Each subdomain needs to create its own robots.txt file. For example, https://blog.example.com/ should be created at https://blog.example.com/robots.txt

. The robots.txt file created is public. Anyone who enters /robots.txt at the back of the root domain can see which pages are prohibited from crawling on the site, so you must take this into consideration when creating commands in the file.

 

 

Sixth, SEO best practices for robots.txt

1. Make sure the page you want to be retrieved is not blocked by robots.txt.

2. For web pages that are blocked from being retrieved through robots.txt, the links in them will not be crawled through. This means that if the linked page does not have links from other web pages, the web page will not be retrieved and may not be included.

3. If you want to prevent more sensitive information from appearing in the search results, do not use robots.txt, but use other methods such as password protection or robots meta directives.

4. The search engine will cache the content of robots.txt, but it will usually be updated within one day. If you change the content of the file and want it to take effect as soon as possible, you can submit it to Google .

5. There are many instructions for search engine crawlers! There are robots.txt and robots meta directives. The difference between the two is: robots.txt gives instructions to web crawlers about searching and crawling website directories; while robots meta directives are instructions for whether to index individual pages.

Post a Comment

0 Comments