When retrieving recursively, one does not wish to retrieve the loads of unnecessary data. Most of the time the users bear in mind exactly what they want to download, and want Wget to follow only specific links.
For example, if you wish to download the music archive from `fly.cc.fer.hr', you will not want to download all the home pages that happen to be referenced by an obscure part of the archive.
Wget possesses several mechanisms that allows you to fine-tune which links it will follow.
When only relative links are followed (option `-L'), recursive
retrieving will never span hosts. No time-expensive DNS-lookups
will be performed, and the process will be very fast, with the minimum
strain of the network. This will suit your needs often, especially when
mirroring the output of various x2html
converters, since they
generally output relative links.
The drawback of following the relative links solely is that humans often tend to mix them with absolute links to the very same host, and the very same page. In this mode (which is the default mode for following links) all URLs the that refer to the same host will be retrieved.
The problem with this option are the aliases of the hosts and domains.
Thus there is no way for Wget to know that `regoc.srce.hr' and
`www.srce.hr' are the same host, or that `fly.cc.fer.hr' is
the same as `fly.cc.etf.hr'. Whenever an absolute link is
encountered, the host is DNS-looked-up with gethostbyname
to
check whether we are maybe dealing with the same hosts. Although the
results of gethostbyname
are cached, it is still a great
slowdown, e.g. when dealing with large indices of home pages on different
hosts (because each of the hosts must be and DNS-resolved to see
whether it just might an alias of the starting host).
To avoid the overhead you may use `-nh', which will turn off DNS-resolving and make Wget compare hosts literally. This will make things run much faster, but also much less reliable (e.g. `www.srce.hr' and `regoc.srce.hr' will be flagged as different hosts).
Note that modern HTTP servers allows one IP address to host several
virtual servers, each having its own directory hieratchy. Such
"servers" are distinguished by their hostnames (all of which point to
the same IP address); for this to work, a client must send a Host
header, which is what Wget does. However, in that case Wget must
not try to divine a host's "real" address, nor try to use the same
hostname for each access, i.e. `-nh' must be turned on.
In other words, the `-nh' option must be used to enabling the retrieval from virtual servers distinguished by their hostnames. As the number of such server setups grow, the behavior of `-nh' may become the default in the future.
With the `-D' option you may specify the domains that will be followed. The hosts the domain of which is not in this list will not be DNS-resolved. Thus you can specify `-Dmit.edu' just to make sure that nothing outside of MIT gets looked up. This is very important and useful. It also means that `-D' does not imply `-H' (span all hosts), which must be specified explicitly. Feel free to use this options since it will speed things up, with almost all the reliability of checking for all hosts. Thus you could invoke
wget -r -D.hr http://fly.cc.fer.hr/
to make sure that only the hosts in `.hr' domain get DNS-looked-up for being equal to `fly.cc.fer.hr'. So `fly.cc.etf.hr' will be checked (only once!) and found equal, but `www.gnu.ai.mit.edu' will not even be checked.
Of course, domain acceptance can be used to limit the retrieval to particular domains with spanning of hosts in them, but then you must specify `-H' explicitly. E.g.:
wget -r -H -Dmit.edu,stanford.edu http://www.mit.edu/
will start with `http://www.mit.edu/', following links across MIT and Stanford.
If there are domains you want to exclude specifically, you can do it with `--exclude-domains', which accepts the same type of arguments of `-D', but will exclude all the listed domains. For example, if you want to download all the hosts from `foo.edu' domain, with the exception of `sunsite.foo.edu', you can do it like this:
wget -rH -Dfoo.edu --exclude-domains sunsite.foo.edu http://www.foo.edu/
When `-H' is specified without `-D', all hosts are freely spanned. There are no restrictions whatsoever as to what part of the net Wget will go to fetch documents, other than maximum retrieval depth. If a page references `www.yahoo.com', so be it. Such an option is rarely useful for itself.
When downloading material from the web, you will often want to restrict the retrieval to only certain file types. For example, if you are interested in downloading GIFS, you will not be overjoyed to get loads of Postscript documents, and vice versa.
Wget offers two options to deal with this problem. Each option description lists a short name, a long name, and the equivalent command in `.wgetrc'.
The `-A' and `-R' options may be combined to achieve even better fine-tuning of which files to retrieve. E.g. `wget -A "*zelazny*" -R .ps' will download all the files having `zelazny' as a part of their name, but not the postscript files.
Note that these two options do not affect the downloading of HTML files; Wget must load all the HTMLs to know where to go at all--recursive retrieval would make no sense otherwise.
Regardless of other link-following facilities, it is often useful to place the restriction of what files to retrieve based on the directories those files are placed in. There can be many reasons for this--the home pages may be organized in a reasonable directory structure; or some directories may contain useless information, e.g. `/cgi-bin' or `/dev' directories.
Wget offers three different options to deal with this requirement. Each option description lists a short name, a long name, and the equivalent command in `.wgetrc'.
wget -I /people,/cgi-bin http://host/people/bozo/
wget -r --no-parent http://somehost/~luzer/my-archive/You may rest assured that none of the references to `/~his-girls-homepage/' or `/~luzer/all-my-mpegs/' will be followed. Only the archive you are interested in will be downloaded. Essentially, `--no-parent' is similar to `-I/~luzer/my-archive', only it handles redirections in a more intelligent fashion.
The rules for FTP are somewhat specific, as it is necessary for them to be. FTP links in HTML documents are often included for purposes of reference, and it is often inconvenient to download them by default.
To have FTP links followed from HTML documents, you need to specify the `--follow-ftp' option. Having done that, FTP links will span hosts regardless of `-H' setting. This is logical, as FTP links rarely point to the same host where the HTTP server resides. For similar reasons, the `-L' options has no effect on such downloads. On the other hand, domain acceptance (`-D') and suffix rules (`-A' and `-R') apply normally.
Also note that followed links to FTP directories will not be retrieved recursively further.
Go to the first, previous, next, last section, table of contents.