robots.txt Pattern Exclusion
is a text file located in the root directory of a web site that adheres to the
dard. Taking the risk of repeating ourselves and generating a bit of “duplicate content,” here are three
basic things to keep in mind regarding
There can be only one
The proper location of
is in the root directory of a web site.
files located in subdirectories will not be accessed (or honored).
The official resource with the official documentation of
. There you can find a Frequently Asked Questions page, the complete reference, and a list with
the names of the robots crawling the web.
If you peruse your logs, you will see that search engine spiders visit this particular file very frequently. This
is because they make an effort not to crawl or index any files that are excluded by
and want to
keep a very fresh copy cached.
excludes URLs from a search engine on a very simple pattern-
matching basis, and it is frequently an easier method to use when eliminating entire directories from a site,
or, more specifically, when you want to exclude many URLs that start with the same characters.
Sometimes for various internal reasons within a (usually large) company, it is not possible to gain access
to modify this file in the root directory. In that case, so long as you have access to the source code of the
part the application in question, use the meta
specifications, which define your exclusion targets, and
entries for one or more URLs you want to exclude therein. Lines in
are comments, and are ignored.
file, placed in the root folder of your site, would not permit any robots (
access any files on the site:
# Forbid all robots from browsing your site
is not a form of security! It does not prevent access to any files. It does
stop a search engine from indexing the content, and therefore prevents users from
navigating to those particular resources via a search engine results page. However,
users could access the pages by navigating directly to them. Also,
is a public resource, and anyone who wants to peruse it can do so by pointing their
. If anything, using it for “security” would only make those
resources even more obvious to potential hackers if used for that incorrect purpose.
To protect content, you should use the traditional ways of authenticating users, and
authorizing them to visit resources of your site.
Chapter 5: Duplicate Content
c05.qxd:c05 10:40 99