(Short) Documentation |
This program is an easy way to use HTTrack, through a wizard-like
program.
Launch WinHTTrack, choose an option (Mirror sites, Mirror with wizard [ie semi automatic mode], and Get separated files).
Enter URLs (i.e. Internet adresses, suck as
www.test.fr/~bob/) in the URL list.
Optionally, click to the Filters.. button to
define filters for links.
Optionally, you can specify a limited link depth (if not, the entire site will be mirrored ; e.g. www.test.abs/~mike/ will mirror all Mike's site). You can also specify a proxy (ask your administrator). Do not forget the paths for mirror files (the files retreived) and log files (files indicating errors or actions done)
Click to the NEXT-> button. You can start the mirror by clicking START or define a lot of options.
Tip: You can enter more than one URL, by pressing Control-Enter after each line.
This will mirror several sites together.
Options: Many options can be defined (maximum file size, site size,
building option, timeout etc etc.)
Proxy: Set the proxy field if you want to use it (ask your internet provider if
you do not know the proxy name/or the proxy port)
* | any characters |
*[azerty] or *[a,z,e,r,t,y] | any letters among a,z,e,r,t,y |
*[a-z] | any letters |
*[0-9,azerty] | any characters among 0..9 and a,z,e,r,t,y |
You can now use more than 1 joker. Here are some examples:
www.thisweb.com* | This will refuse/accept this web site (all links located in it will be rejected) |
*.com/* | This will refuse/accept all links that contains .com in them |
*cgi-bin* | This will refuse/accept all links that contains cgi-bin in them |
www.*.com/*.zip | This will refuse/accept all zip files in .com addresses |
*myweb*/*.tar* | This will refuse/accept all tar (or tar.gz etc.) files in hosts containing myweb |
*/*mypage* | This will refuse/accept all links containing mypage (but not in the address) |
Tip: To use WinHTTrack as a spider (for checking links), just set the scan mode as
"Just scan", mark the boxes "Log files" and "Test all links"
and unmark the "Cache "box.
Use combination of all options to have different results.
Tip: In case of troubles/problems during transfer, you can have a look at
the hts-err.txt (and hts-log.txt) file to see what happened. These log files report all
events that may be useful to detect a problem.
Troubleshooting:
When I use HTTrack, nothing is mirrored (no files) What's happening?
Some pages can't be seen, or are displayed with errors!
HTTrack is being idle for a long time without transfering. Whant's
happening?
I am behind a firewall. What can I do?
Retreive options:
I want to mirror a Web site, but there are some files outside the
domain, too. How to retreive them?
I have forgotten some URLs of files during a long mirror.. Should I redo
all?
I just want to retreive all ZIP files or other files in a web site/in a
page. How to do it?
There are ZIP files in a page, but I don't want to transfer them. How to do?
I don't want to load gif files.. but what may happen if I watch the page?
When I use filters, I get too many files!
When I use filters, I can't access another domain, but I have filtered it!
Must I add a '+' or '-' in the filter list when I want to use
filters?
Troubleshooting:
Q: When I use HTTrack, nothing is mirrored (no files) What's
happening?
A: First, be sure that the URL typed is correct. Then, check if you need to use a
proxy server (see proxy options in WinHTTrack or the -P proxy:port option in the
command line program). You can have a look at the hts-err.txt (and hts-log.txt) file to
see what happened.
Q: Some pages can't be seen, or are displayed with errors!
A: Some pages may include javascript or java files that are not recognized. For
example, generated filenames. There may be transfer problems, too (broken pipe, etc.). But
most mirrors do work. We still are working to improve the mirror quality of HTTrack.
Q: HTTrack is being idle for a long time without
transfering. Whant's happening?
A: Maybe you try to reach some very slow sites. Try a lower TimeOut value (see
options, or -Txx option in the command line program).
Q: I am behind a firewall. What can I do?
A: You need to use a proxy, too. Ask your administrator to know the proxy server's
name/port. Then, use the proxy field in HTTrack or use the -P proxy:port option
in the command line program.
Retreive options:
Q: I want to mirror a Web site, but there are some files outside
the domain, too. How to retreive them?
A: If you just want to retreive files that can be reached through links, just activate
the 'get file near links' option. But if you want to retreive html pages too, you can both
use wildcards or explicit addresses ; e.g. add www.myweb.com/* to accept all
files and pages from www.myweb.com.
Q: I have forgotten some URLs of files during a long
mirror.. Should I redo all?
A: No, if you have kept the 'cache' files (in hts-cache), cached files will not be
retransfered.
Q: I just want to retreive all ZIP files or other files in a web
site/in a page. How to do it?
A: You can use different methods. You can use the 'get files near a link' option if
files are in an outside domain. You can use, too, a filter adress: adding *.zip
in the URL list (or in the accept/filter list) will accept all ZIP files, even if these
files are outside the address.
Q: There are ZIP files in a page, but I don't want to transfer
them. How to do?
A: Just filter them: add -*.zip in the URL list of add *.zip in
the exclude filter list.
Q: I don't want to load gif files.. but what may happen if I
watch the page?
A: If you have filtered gif files (-*.gif), links to gif files will be rebuild so that
your browser can find them on the server.
Q: When I use filters, I get too many files!
A: You are using too large filters, for example *.html will get ALL html
files identified. If you want to get all files on an address, use
www.<address>/*.html. There are lots of possibilities using filters.
Q: When I use filters, I can't access another domain, but I
have filtered it!
A: You may have done a mistake declaring filters, for example www.myweb.com
instead of www.myweb.com/*
Q: Must I add a '+' or '-' in the filter list when I want
to use filters?
A: NO. '+' and '-' must be typed only when you place filters directly in the main URL
list, but not in the two filter lists (in the shell).
The command-line program is available for many systems (PC, Linux PC, Sun Solais, AIX) and allows you to control the robot through a command-line. This can be useful for an automatic mirror of a web site.
You are a webmaster, and you would like to make a mirror of a web-site:
Every week (or every day), you can launch (ex: crontab):
httrack --update www.myweb.abc -O /public_html/,/home/root/ |
This will maintain an up-to-date web site into your host.
You are a simple user, and you would like to make a mirror of a web-site for your
own:
Just type:
httrack www.myweb.abc |
When you want to update it, just launch: httrack --update and httrack will
automatically update it.
You want to check links in a site/web page :
Just type:
httrack www.myweb.abc --spider |
And look at the file hts-err.txt : all errors will be reported here.
Comments, problems and bug report are welcome, for the shell and for the robot.