Questions tagged [httrack]

HTTrack (Website copier)

HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License Version 3.

HTTrack allows users to download World Wide Web sites from the Internet to a local computer.[4][5] By default, HTTrack arranges the downloaded site by the original site's relative link-structure. The downloaded (or "mirrored") website can be browsed by opening a page of the site in a browser.

HTTrack can also update an existing mirrored site and resume interrupted downloads. HTTrack is configurable by options and by filters (include/exclude), and has an integrated help system. There is a basic command line version and two GUI versions (WinHTTrack and WebHTTrack); the former can be part of scripts and cron jobs.

HTTrack uses a Web crawler to download a website. Some parts of the website may not be downloaded by default due to the robots exclusion protocol unless disabled during the program. HTTrack can follow links that are generated with basic JavaScript and inside Applets or Flash, but not complex links (generated using functions or expressions) or server-side image maps.

Reference :

http://www.httrack.com/

http://en.wikipedia.org/wiki/HTTrack

74 questions

votes

2 answers

How can I mirror the results of MOSS plagiarism detection?

MOSS is a well-known server for checking software plagiarism. It allows teachers to send homework submissions, calculates the similarity between different submissions, and colors code blocks that are very similar. Here is an example of the results…

asked May 02 '21 at 18:48

Erel Segal-Halevi

33,955
36
114
183

votes

0 answers

Download Webpage with HTTrack executed JavaScript

I want to save a webpage with httrack including the executed JavaScript-Output. I'm using: httrack -r1 URL -O PATH Currently I'm only getting the .js-source: " Is there any option I can add to…

javascript download html httrack

asked Sep 21 '17 at 09:01

night4awk

votes

2 answers

cookies.txt not working on Httrack version 3.49-2

hi guys I'm using httrack and using cookies for the authentication but it seems my cookie doesnt work the syntax i'm using httrack -b1 -%M -r1 −F "Mozilla/4.5 (compatible; MSIE 4.01; Windows NT)" …

curl cookies httrack

asked Oct 23 '19 at 12:16

dmh

votes

0 answers

Unable to login to a website using HTTrack

I am trying to download the content of a website using HTTrack software. The website requires login details. After selecting the directory to save in, I added the URL. "http://*****/login I selected capture URL and temporarily added the temporary…

httrack

asked Jan 22 '19 at 13:27

Olfa Fdhila

votes

3 answers

Using HTTrack to mirror a single page

I've been attempting to use HTTrack to mirror a single page (downloading html + prerequisites: style sheets, images, etc), similar to the question [mirror single page with httrack][1]. However, the accepted answer there doesn't work for me, as I'm…

python http command-line wget httrack

asked Jan 14 '16 at 17:33

Empiromancer

3,778
1
22
53

votes

0 answers

Mirror a website with httrack while executing javascript

I want do save a mirror of www.youtube.com/tv. I obviously do not want to save the videos. I want the code running the website in a local copy, everything else can stay remote. The code I want is mainly contained in 2 files: live.js and…

javascript http download youtube httrack

asked Nov 13 '13 at 15:51

Martin

3,960
7
43
43

vote

0 answers

Using HTTrack to clone secure websites (https://)?

I'm a beginner in webscraping and cloning. I was using HTTrack to clone a website such as classcentral.com (I want to clone the website only 2 levels deep). But that website uses secure protocol i.e., https:// I am doing this as a task to convert…

web-scraping httrack

asked Mar 05 '23 at 11:19

Supreeth N

vote

0 answers

Web scraping Obsidian Published vaults

I'm trying to download Obsidian public vaults like this: https://publish.obsidian.md/bryan-jenks/Z/INDEX I would like to get in each folder all its .md (markdown) notes. I have tried with Httrack and with wget without success, only some files are…

web-scraping wget httrack

asked Nov 17 '21 at 00:01

AMGMNPLK

1,974
3
11
22

vote

1 answer

Using HTTrack to download links only under a certain subdomain (nothing external)

So, this is what I am trying to download - https://www.slader.com/textbook/9781337624183-calculus-9th-edition/ Looks fairly simple, I tried adding a few lines to "scan rules" to force it to download everything under it but for some reason, the…

web download httrack

asked Aug 30 '20 at 08:04

EvilRaceHorse

vote

2 answers

'x86_64-linux-gnu-gcc' error in installing apackage using pip3

When I tried to install httrack in Ubuntu 16.04 I was not able to get those packages: pip3 install httrack-py Collecting httrack-py Using cached…

python pip httrack

asked Jul 20 '18 at 03:38

Neeraj Nair

vote

3 answers

wrong srcset attributes from httrack

I have spidered a website with httracks and a lot of files on different levels are generated. But the website uses picture / source tags with srcset attributes which httrack does not handle, all those pictures does not work well offline. httrack…

bash sed httrack

asked Sep 20 '17 at 11:56

Bernd Wilke πφ

10,390
1
19
38

vote

1 answer

HTTrack gives 404 on unicode urls with german special characters

I've realized that HTTrack can't download files if urls have special characters in them, like german ß - it returns a 404 response. Errors look like on screenshot: Is there any setting in HTTrack to make it able to deal with such characters? ps: I…

url unicode httrack

asked Aug 04 '17 at 13:50

Evgeniy

2,337
2
28
68

vote

2 answers

How do I get httrack to save files with their original names rather than index****.html?

I'm following the HTTrack docs example here: http://httrack.kauler.com/help/User-defined_structure The site I need to scrape has URLs in this…

html web-scraping wget httrack

asked Jul 11 '17 at 19:28

BlueDogRanch

vote

0 answers

Httrack faulty when encountering japanese encoded URLS

I usually don't have any problem with Httrack, but this time, I found out that it doesn't manage to grab pages with non ascii characters like this japanese URL : domain.com/リーク情報の真偽のほ/ ( read by the browser this way :…

url character-encoding httrack

asked Sep 29 '16 at 01:47

majimekun

vote

0 answers

Mirroring websites - 403 Forbidden with user agent strings

I'm working on an application to mirror US university academic catalogs. To do this, I have a cluster of Celery workers that use wget or httrack to mirror the content, styles and scripts, then upload to our S3 bucket. For a small number of…

wget mirroring httrack

asked May 27 '16 at 16:51

Jason

11,263
21
87
181

Prev 1

3 4 5 Next