HTTrack
The Web Mirror Utility

(Short) Documentation

back


I- How to use WinHTTrack (on Windows95/98)

This program is an easy way to use HTTrack, through a wizard-like program.

  1. Launch WinHTTrack, choose an option (Mirror sites, Mirror with wizard [ie semi automatic mode], and Get separated files).

  2. Enter URLs (i.e. Internet adresses, suck as www.test.fr/~bob/) in the URL list.
    Optionally, click to the Filters.. button to define filters for links.

  3. Optionally, you can specify a limited link depth (if not, the entire site will be mirrored ; e.g. www.test.abs/~mike/ will mirror all Mike's site). You can also specify a proxy (ask your administrator). Do not forget the paths for mirror files (the files retreived) and log files (files indicating errors or actions done)

  4. Click to the NEXT-> button. You can start the mirror by clicking START or define a lot of options.


Tip: You can enter more than one URL, by pressing Control-Enter after each line. This will mirror several sites together.


Options
: Many options can be defined (maximum file size, site size, building option, timeout etc etc.)


Proxy
: Set the proxy field if you want to use it (ask your internet provider if you do not know the proxy name/or the proxy port)

Filters: By clicking this button, you will be able to fill two list-boxes : one is for forbidden links, the other is for accepted links. You can use jokers (*) to refuse/accept multiple links:

* any characters
*[azerty] or *[a,z,e,r,t,y] any letters among a,z,e,r,t,y
*[a-z] any letters
*[0-9,azerty] any characters among 0..9 and a,z,e,r,t,y


You can now use more than 1 joker. Here are some examples:

www.thisweb.com* This will refuse/accept this web site (all links located in it will be rejected)
*.com/* This will refuse/accept all links that contains .com in them
*cgi-bin* This will refuse/accept all links that contains cgi-bin in them
www.*.com/*.zip This will refuse/accept all zip files in .com addresses
*myweb*/*.tar* This will refuse/accept all tar (or tar.gz etc.) files in hosts containing myweb
*/*mypage* This will refuse/accept all links containing mypage (but not in the address)


Tip
: To use WinHTTrack as a spider (for checking links), just set the scan mode as "Just scan", mark the boxes "Log files" and "Test all links" and unmark the "Cache "box.
Use combination of all options to have different results.

 

IIb- FAQ (WinHTTrack and HTTrack)


Tip: In case of troubles/problems during transfer, you can have a look at the hts-err.txt (and hts-log.txt) file to see what happened. These log files report all events that may be useful to detect a problem.

Troubleshooting:
When I use HTTrack, nothing is mirrored (no files) What's happening?
Some pages can't be seen, or are displayed with errors!
HTTrack is being idle for a long time without transfering. Whant's happening?
I am behind a firewall. What can I do?

Retreive options:
I want to mirror a Web site, but there are some files outside the domain, too. How to retreive them?
I have forgotten some URLs of files during a long mirror.. Should I redo all?
I just want to retreive all ZIP files or other files in a web site/in a page. How to do it?
There are ZIP files in a page, but I don't want to transfer them. How to do?
I don't want to load gif files.. but what may happen if I watch the page?
When I use filters, I get too many files!
When I use filters, I can't access another domain, but I have filtered it!
Must I add a  '+' or '-' in the filter list when I want to use filters?

Troubleshooting:

Q: When I use HTTrack, nothing is mirrored (no files) What's happening?
A: First, be sure that the URL typed is correct. Then, check if you need to use a proxy server (see proxy options in WinHTTrack or the -P proxy:port option in the command line program). You can have a look at the hts-err.txt (and hts-log.txt) file to see what happened.

Q: Some pages can't be seen, or are displayed with errors!
A: Some pages may include javascript or java files that are not recognized. For example, generated filenames. There may be transfer problems, too (broken pipe, etc.). But most mirrors do work. We still are working to improve the mirror quality of HTTrack.

Q: HTTrack is being idle for a long time without transfering. Whant's happening?
A: Maybe you try to reach some very slow sites. Try a lower TimeOut value (see options, or -Txx option in the command line program).

Q: I am behind a firewall. What can I do?
A: You need to use a proxy, too. Ask your administrator to know the proxy server's name/port. Then, use the proxy field in HTTrack or use the -P proxy:port option in the command line program.


Retreive options:

Q: I want to mirror a Web site, but there are some files outside the domain, too. How to retreive them?
A: If you just want to retreive files that can be reached through links, just activate the 'get file near links' option. But if you want to retreive html pages too, you can both use wildcards or explicit addresses ; e.g. add www.myweb.com/* to accept all files and pages from www.myweb.com.

Q: I have forgotten some URLs of files during a long mirror.. Should I redo all?
A: No, if you have kept the 'cache' files (in hts-cache), cached files will not be retransfered.

Q: I just want to retreive all ZIP files or other files in a web site/in a page. How to do it?
A: You can use different methods. You can use the 'get files near a link' option if files are in an outside domain. You can use, too, a filter adress: adding *.zip in the URL list (or in the accept/filter list) will accept all ZIP files, even if these files are outside the address.

Q: There are ZIP files in a page, but I don't want to transfer them. How to do?
A: Just filter them: add -*.zip in the URL list of add *.zip in the exclude filter list.

Q: I don't want to load gif files.. but what may happen if I watch the page?
A: If you have filtered gif files (-*.gif), links to gif files will be rebuild so that your browser can find them on the server.

Q: When I use filters, I get too many files!
A: You are using too large filters, for example *.html will get ALL html files identified. If you want to get all files on an address, use www.<address>/*.html. There are lots of possibilities using filters.

Q: When I use filters, I can't access another domain, but I have filtered it!
A: You may have done a mistake declaring filters, for example www.myweb.com instead of www.myweb.com/*

Q: Must I add a  '+' or '-' in the filter list when I want to use filters?
A: NO. '+' and '-' must be typed only when you place filters directly in the main URL list, but not in the two filter lists (in the shell).


II- How to use HTTrack (the command-line version)

The command-line program is available for many systems (PC, Linux PC, Sun Solais, AIX) and allows you to control the robot through a command-line. This can be useful for an automatic mirror of a web site.

IIb- Example: Use of HTTrack (the command-line version)


You are a webmaster, and you would like to make a mirror of a web-site:
Every week (or every day), you can launch (ex: crontab):

httrack --update www.myweb.abc -O /public_html/,/home/root/

This will maintain an up-to-date web site into your host.


You are a simple user, and you would like to make a mirror of a web-site for your own:
Just type:

httrack www.myweb.abc


When you want to update it, just launch: httrack --update and httrack will automatically update it.


You want to check links in a site/web page :
Just type:

httrack www.myweb.abc --spider

And look at the file hts-err.txt : all errors will be reported here.

 


Comments, problems and bug report are welcome, for the shell and for the robot.