This chapter contains some references I consider useful, like the Robots Exclusion Standard specification, as well as a list of contributors to GNU Wget.
Since Wget is able to traverse the web, it counts as one of the Web robots. Thus Wget understands Robots Exclusion Standard (RES)---contents of `/robots.txt', used by server administrators to shield parts of their systems from wanderings of Wget.
Norobots support is turned on only when retrieving recursively, and never for the first page. Thus, you may issue:
wget -r http://fly.cc.fer.hr/
First the index of fly.cc.fer.hr will be downloaded. If Wget finds
anything worth downloading on the same host, only then will it
load the robots, and decide whether or not to load the links after all.
`/robots.txt' is loaded only once per host. Wget does not support
the robots META
tag.
The description of the norobots standard was written, and is maintained
by Martijn Koster <m.koster@webcrawler.com>
. With his
permission, I contribute a (slightly modified) texified version of the
RES.
WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. For more information see the robots page.
In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).
These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.
This document represents a consensus on 30 June 1994 on the robots
mailing list (robots@webcrawler.com
), between the majority of
robot authors and other people with an interest in robots. It has also
been open for discussion on the Technical World Wide Web mailing list
(www-talk@info.cern.ch
). This document is based on a previous
working draft under the same title.
It is not an official standard backed by a standards body, or owned by any commercial organization. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.
The latest version of this document can be found at:
http://info.webcrawler.com/mak/projects/robots/norobots.html
The format and semantics of the `/robots.txt' file are as follows:
The file consists of one or more records separated by one or more blank
lines (terminated by CR
, CR/NL
, or NL
). Each
record contains lines of the form:
<field>:<optionalspace><value><optionalspace>
The field name is case insensitive. Comments can be included in file using UNIX bourne shell conventions: the `#' character is used to indicate that preceding space (if any) and the remainder of the line up to the line termination is discarded. Lines containing only a comment are discarded completely, and therefore do not indicate a record boundary.
The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognized headers are ignored.
The presence of an empty `/robots.txt' file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome.
The value of this field is the name of the robot the record is describing access policy for.
If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.
If the value is `*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the `/robots.txt' file.
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, `Disallow: /help' disallows both `/help.html' and `/help/index.html', whereas `Disallow: /help/' would disallow `/help/index.html' but allow `/help.html'.
Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.
The following example `/robots.txt' file specifies that no robots should visit any URL starting with `/cyberworld/map/' or `/tmp/':
# robots.txt for http://www.site.com/ User-agent: * Disallow: /cyberworld/map/ # This is an infinite virtual URL space Disallow: /tmp/ # these will soon disappear
This example `/robots.txt' file specifies that no robots should visit any URL starting with `/cyberworld/map/', except the robot called `cybermapper':
# robots.txt for http://www.site.com/ User-agent: * Disallow: /cyberworld/map/ # This is an infinite virtual URL space # Cybermapper knows where to go. User-agent: cybermapper Disallow:
This example indicates that no robots should visit this site further:
# go away User-agent: * Disallow: /
When using Wget, you must be aware that it sends unencrypted passwords through the network, which may present a security problem. Here are the main issues, and some solutions.
ps
. If this
is a problem, avoid putting passwords from the command line--e.g. you
can use `.netrc' for this.
GNU Wget was written by Hrvoje Nik@v{s}i'{c} <hniksic@srce.hr>
.
However, its development could never have gone as far as it has, were it
not for the help of many people, either with bug reports, feature
proposals, patches, or letters saying "Thanks!".
Special thanks goes to the following people (no particular order):
Digest
authentication.
The following people have provided patches, bug/build reports, useful suggestions, beta testing services, fan mail and all the other things that make maintenance so much fun:
Tim Adam, Martin Baehr, Dieter Baron, Roger Beeman and the Gurus at Cisco, Mark Boyns, John Burden, Wanderlei Cavassin, Gilles Cedoc, Noel Cragg, Kristijan @v{C}onka@v{s}, Damir D@v{z}eko, Andrew Davison, Ulrich Drepper, Marc Duponcheel, Aleksandar Erkalovi'{c}, Andy Eskilsson, Masashi Fujita, Marcel Gerrits, Karl Heuer, Gregor Hoffleit, Erik Magnus Hulthen, Richard Huveneers, Simon Josefsson, Mario Juri'{c}, Goran Kezunovi'{c}, Robert Kleine, Fila Kolodny, Martin Kraemer, Tage Stabell-Kulo, Hrvoje Lacko, Jordan Mendelson, Charlie Negyesi, Francois Pinard, Andrew Pollock, Steve Pothier, Marin Purgar, Jan Prikryl, Keith Refson, Tobias Ringstrom, Robert Schmidt, Sven Sternberger, Markus Strasser, Mike Thomas, Russell Vincent, Douglas E. Wegscheid, Jasmin Zainul, Bojan @v{Z}drnja, Kristijan Zimmer.
Thanks everyone; I've wouldn't have done it without you. Apologies to all who I accidentally left out. Also thanks to all the subscribers of the Wget mailing list.
Go to the first, previous, next, last section, table of contents.