Back to top

The robots.txt file is a document with instructions for search robots. The rules (directives) are written there: which pages need to be scanned (indexed) in order for them to be included in the search results. Especially useful if you need:

prohibit the display of the site or its pages containing non-unique or useless content, confidential information;
open one part of the pages of Google, and close the other part;
indicate the priority for indexing site address.

Robots.txt is a UTF-8 encoded text document. Works for protocols http, https. The file is placed in the root directory of the site. Robots must be available at: https://site.com/robots.txt.

How ROBOTS.TXT influences site indexI

A search robot enters the site, if it has a robots.txt file, then it first addresses it. If it clearly states what needs to be indexed and what not, then it follows the rules of the document.

But sometimes search engines still index pages, although their processing is prohibited by robots.txt directives. This happens when there are direct links to these materials on your or any other sites.

USER-AGENT

User-agent is a rule about which robots should follow your instructions.

BASIC ROBOTS

  • User-agent: * – use this entry if we want all work to follow robots.txt rules.
  • User-agent: Googlebot – if the rule is only for the Google bot.

OTHER ROBOTS

  • Mediapartners-Google – for the Adsense service;
  • AdsBot-Google is a Google robot that checks the quality of the landing page;
  • Googlebot-Image – for Google images;
  • Googlebot-Video – for video;
  • Googlebot-Mobile – for the mobile version;

DISALLOW

Disallow is a command that closes a resource or its individual pages from indexing.

  • Disallow: / – to hide the portal from all search engines. It will come in handy if the site is still under development and you are not yet ready to present it to the whole world.
  • Disallow: / folder / – if you need to hide a folder. Only instead of folder you need to enter the name of the desired folder.
  • Disallow: /secret-info.html – to close one page after the directive, specify a link to it.

ALLOW

Allow, on the other hand, allows data to be indexed. Let’s say we want robots to index only one page of the site.

Here you need to follow the logical order of the rules while applying Allow and Disallow. First, we specify the command that applies to the entire site, then – the command for the section, subsection or page.

SITEMAP

This function specifies the URL of the sitemap. At each visit, the robot goes to the specified URL and sees links that must be indexed.

CLEAN-PARAM

This command helps to get rid of duplicate pages – this is when one page is available at different addresses. Such addresses appear if the site has different sortings, session id, and so on. For example, the page is available at the addresses:

list of page addresses

Then the rule in robots.txt will look like this:

User-agent: *

Clean-param: sorts & order / catalog / category /

SPECIAL CHARACTERS IN ROBOTS.TXT

The main characters are “/, *, $, #”.

SLASH  “/”

With these symbols we show that we want to hide from robots.

If we put one slash in the Disallow rule, then we prohibit indexing of the entire site.
If we put two slashes, then we prohibit scanning of any one part of the site. For example: / catalog / – here a ban on scanning the entire catalog folder.

And if we write like this: / catalog, it will mean that we do not want the robot to crawl links that begin with / catalog.

STAR “*”

Placed after each rule. Indicates any sequence of characters in the file. For example, such an entry means that all robots should NOT index files with the .gif extension in the / catalog / folder.

User-agent: *
Disallow: /catalog/*.gif$

DOLLAR SIGN “$”

Limits the actions of the asterisk. For example, we want to close the entire contents of the catalog folder. But at the same time, urls that start with / catalog need to be working:

User-agent: *
Disallow: / catalog $

GRILLE “#”

Robots ignore this sign. It is used for comments.

BASIC RULES OF ROBOTS.TXT

  • We write the file name and its extension in lowercase letters;
  • We write each new directive on a new line;
  • You cannot put a space before the command;
  • After the block of rules, put an empty line;
  • We write the name of the directive with a capital letter, the rest of the letters in the word must be lowercase;
  • Do not forget to put a slash / in front of the parameter name containing the text;
  • You cannot place more than one prohibition on a line;
  • Only one parameter can be specified in the Allow and Disallow directives;
  • Characters from national alphabets cannot be used in the code.

Attention! Robots will ignore your file if it is unavailable for some reason or if it weighs more than 32 KB.

HOW THE MINIMUM ROBOTS.TXT LOOKS LIKE

The site is open for indexing and the sitemap is indicated:

User-agent: *
Disallow:
Sitemap: http://site.com/sitemap.xml