This bit of text tells crawlers first that all crawlers should ignore the temporary directories. So
every crawler reading that file will automatically ignore the temporary files. But you’ve also told a
specific crawler (indicated by CrawlerName) to disallow both temporary directories and the links
on the Listing page. The problem is, the specified crawler will never get that message because it
has already read that all crawlers should ignore the temporary directories.
If you want to command multiple crawlers, you need to first begin by naming the crawlers you want
to control. Only after they’ve been named should you leave your instructions for all crawlers. Written
properly, the text from the preceding code should look like this:
If you have certain pages or links that you want the crawler to ignore, you can accomplish
this without causing the crawler to ignore a whole site or a whole directory or having to
put a specific meta tag on each page.
Each search engine crawler goes by a different name, and if you look at your web server log, you’ll
probably see that name. Here’s a quick list of some of the crawler names that you’re likely to see in
that web server log:
Yahoo! Web Search: Yahoo SLURP or just SLURP
SearchHippo: Fluffy the Spider
These are just a few of the search engine crawlers that might crawl across your site. You can find a
complete list along with the text of the Robots Exclusion Standard document on the Web Robots
). Take the time to read the Robots Exclusion Standard document.
It’s not terribly long, and reading it will help you understand how search crawlers interact with
your web site. That understanding can also help you learn how to control crawlers better when
they come to visit.
Robots, Spiders, and Crawlers
16 1 9:55 231