Professional Search Engine Optimization (Seo). Developer’s Guide to SEO

Ajax software Free javascripts

Main Page

robots.txt Pattern Exclusion

robots.txt

is a text file located in the root directory of a web site that adheres to the

robots.txt

stan-

dard. Taking the risk of repeating ourselves and generating a bit of “duplicate content,” here are three

basic things to keep in mind regarding

robots.txt

There can be only one

robots.txt

file.

The proper location of

robots.txt

is in the root directory of a web site.

robots.txt

files located in subdirectories will not be accessed (or honored).

The official resource with the official documentation of

robots.txt

http://www.robotstxt

.org/

. There you can find a Frequently Asked Questions page, the complete reference, and a list with

the names of the robots crawling the web.

If you peruse your logs, you will see that search engine spiders visit this particular file very frequently. This

is because they make an effort not to crawl or index any files that are excluded by

robots.txt

and want to

keep a very fresh copy cached.

robots.txt

excludes URLs from a search engine on a very simple pattern-

matching basis, and it is frequently an easier method to use when eliminating entire directories from a site,

or, more specifically, when you want to exclude many URLs that start with the same characters.

Sometimes for various internal reasons within a (usually large) company, it is not possible to gain access

to modify this file in the root directory. In that case, so long as you have access to the source code of the

part the application in question, use the meta

robots

tag.

robots.txt

file includes

User-agent

specifications, which define your exclusion targets, and

Disallow

entries for one or more URLs you want to exclude therein. Lines in

robots.txt

that start

with

are comments, and are ignored.

The following

robots.txt

file, placed in the root folder of your site, would not permit any robots (

) to

access any files on the site:

# Forbid all robots from browsing your site

User-agent: *

Disallow: /

robots.txt

is not a form of security! It does not prevent access to any files. It does

stop a search engine from indexing the content, and therefore prevents users from

navigating to those particular resources via a search engine results page. However,

users could access the pages by navigating directly to them. Also,

robots.txt

itself

is a public resource, and anyone who wants to peruse it can do so by pointing their

browser to

/robots.txt

. If anything, using it for “security” would only make those

resources even more obvious to potential hackers if used for that incorrect purpose.

To protect content, you should use the traditional ways of authenticating users, and

authorizing them to visit resources of your site.

Chapter 5: Duplicate Content

c05.qxd:c05 10:40 99

Ajax software Free javascripts
→