Robots: How to Influence Crawling and Indexing on Google | SEO COURSE 2020 【Lesson #29】
Articles,  Blog

Robots: How to Influence Crawling and Indexing on Google | SEO COURSE 2020 【Lesson #29】

In SEO terms, the crawling phase occurs when
Googlebot accesses a page and analyzes it, while the indexing occurs when the webpage
appears to be suitable for inclusion in the search engine index. Since the 1990s, webmasters around the world
have used a robots.txt file in the root of their websites in order to provide any bot
with some instructions on how to access their content. In this very simple text file, a Disallow
directive is inserted, containing the paths of the pages or folders that the bot must
not scan, in order not to overload the resources of our server. There is also a User-agent for referring to
a specific bot, like Googlebot, Bingbot, etc. User-agent: * Disallow: /secret In order to generate this file, you can use
a specific tool, available on In addition to the robots.txt file, we can
communicate with the bot crawler through the robots meta tag which needs to be inserted
in thesection of the webpage. There are four combinations, alternating between
index and follow values with the negative variants.For example, using index, follow will allow
the bot to index the current webpage and also scan the pages linked within its content,
while with noindex, nofollow we suggest neither indexing the current content nor following
the links. By playing with the robots.txt file or the
meta robots tag, we can choose which contents to hide or index. Now, this is the theory, because in practice
many smaller bots don’t actually care about our directives and they do a wild scan without
rules, spoofing the User-Agent in order to avoid being blocked. Fortunately, at least the main search engines
follow our indications, but I have many experiences with minor bad players, which are really annoying
and difficult to block. This argument is not a simple technicality,
as it may have implications in the daily SEO operations. For example, by blocking a page from the robots.txt
file, you will ensure that the content will not be analyzed by Google, which will not
even know if there is a meta robots with the noindex value. In this particular situation, you have a snippet
without having read neither the title tag nor the meta description because Google couldn’t
read them, but the result is still visible (since it was not able to read it, it couldn’t
read the meta robots in HTML)! These results have the classic formula “A
description is not available for this result due to the website’s robots.txt file”, and
they are the outcome of incorrect management. A common mistake for WordPress users, is the
insertion of the /wp-content folder into the robots.txt file, because this will block access
for Googlebot to all the CSS or JS dependencies of the plugins and the graphic theme, thus
failing to render the page correctly. Another common move is the inclusion of secret
or private paths into this file, but in this way we are revealing to a potential hacker
exactly where to hit and this is not safe at all! We should rather use a noindex meta robots
tag and a password authentication system. Finally, I remember seeing sites that used
the robots.txt file to remove content, thus allowing an almost perpetual indexing, because
the bot could not access the page to know that it had been removed. In this case, in order to avoid a redirect
to a new version of this content, it would have been correct to leave free access to
the bots and set a status code of 410 (Gone), but not 404 (Not found), since the resource
would never be back available, not even in the future.

Leave a Reply

Your email address will not be published. Required fields are marked *