Questions tagged [httrack]

HTTrack (Website copier)

HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License Version 3.

HTTrack allows users to download World Wide Web sites from the Internet to a local computer.[4][5] By default, HTTrack arranges the downloaded site by the original site's relative link-structure. The downloaded (or "mirrored") website can be browsed by opening a page of the site in a browser.

HTTrack can also update an existing mirrored site and resume interrupted downloads. HTTrack is configurable by options and by filters (include/exclude), and has an integrated help system. There is a basic command line version and two GUI versions (WinHTTrack and WebHTTrack); the former can be part of scripts and cron jobs.

HTTrack uses a Web crawler to download a website. Some parts of the website may not be downloaded by default due to the robots exclusion protocol unless disabled during the program. HTTrack can follow links that are generated with basic JavaScript and inside Applets or Flash, but not complex links (generated using functions or expressions) or server-side image maps.

Reference :

http://www.httrack.com/

http://en.wikipedia.org/wiki/HTTrack

74 questions
3
votes
2 answers

How can I mirror the results of MOSS plagiarism detection?

MOSS is a well-known server for checking software plagiarism. It allows teachers to send homework submissions, calculates the similarity between different submissions, and colors code blocks that are very similar. Here is an example of the results…
Erel Segal-Halevi
  • 33,955
  • 36
  • 114
  • 183
3
votes
0 answers

Download Webpage with HTTrack executed JavaScript

I want to save a webpage with httrack including the executed JavaScript-Output. I'm using: httrack -r1 URL -O PATH Currently I'm only getting the .js-source: " Is there any option I can add to…
night4awk
  • 51
  • 1
  • 7
2
votes
2 answers

cookies.txt not working on Httrack version 3.49-2

hi guys I'm using httrack and using cookies for the authentication but it seems my cookie doesnt work the syntax i'm using httrack -b1 -%M -r1 −F "Mozilla/4.5 (compatible; MSIE 4.01; Windows NT)" …
dmh
  • 430
  • 10
  • 31
2
votes
0 answers

Unable to login to a website using HTTrack

I am trying to download the content of a website using HTTrack software. The website requires login details. After selecting the directory to save in, I added the URL. "http://*****/login I selected capture URL and temporarily added the temporary…
Olfa Fdhila
  • 119
  • 2
  • 9
2
votes
3 answers

Using HTTrack to mirror a single page

I've been attempting to use HTTrack to mirror a single page (downloading html + prerequisites: style sheets, images, etc), similar to the question [mirror single page with httrack][1]. However, the accepted answer there doesn't work for me, as I'm…
Empiromancer
  • 3,778
  • 1
  • 22
  • 53
2
votes
0 answers

Mirror a website with httrack while executing javascript

I want do save a mirror of www.youtube.com/tv. I obviously do not want to save the videos. I want the code running the website in a local copy, everything else can stay remote. The code I want is mainly contained in 2 files: live.js and…
Martin
  • 3,960
  • 7
  • 43
  • 43
1
vote
0 answers

Using HTTrack to clone secure websites (https://)?

I'm a beginner in webscraping and cloning. I was using HTTrack to clone a website such as classcentral.com (I want to clone the website only 2 levels deep). But that website uses secure protocol i.e., https:// I am doing this as a task to convert…
Supreeth N
  • 11
  • 1
1
vote
0 answers

Web scraping Obsidian Published vaults

I'm trying to download Obsidian public vaults like this: https://publish.obsidian.md/bryan-jenks/Z/INDEX I would like to get in each folder all its .md (markdown) notes. I have tried with Httrack and with wget without success, only some files are…
AMGMNPLK
  • 1,974
  • 3
  • 11
  • 22
1
vote
1 answer

Using HTTrack to download links only under a certain subdomain (nothing external)

So, this is what I am trying to download - https://www.slader.com/textbook/9781337624183-calculus-9th-edition/ Looks fairly simple, I tried adding a few lines to "scan rules" to force it to download everything under it but for some reason, the…
1
vote
2 answers

'x86_64-linux-gnu-gcc' error in installing apackage using pip3

When I tried to install httrack in Ubuntu 16.04 I was not able to get those packages: pip3 install httrack-py Collecting httrack-py Using cached…
Neeraj Nair
  • 195
  • 1
  • 10
1
vote
3 answers

wrong srcset attributes from httrack

I have spidered a website with httracks and a lot of files on different levels are generated. But the website uses picture / source tags with srcset attributes which httrack does not handle, all those pictures does not work well offline. httrack…
Bernd Wilke πφ
  • 10,390
  • 1
  • 19
  • 38
1
vote
1 answer

HTTrack gives 404 on unicode urls with german special characters

I've realized that HTTrack can't download files if urls have special characters in them, like german ß - it returns a 404 response. Errors look like on screenshot: Is there any setting in HTTrack to make it able to deal with such characters? ps: I…
Evgeniy
  • 2,337
  • 2
  • 28
  • 68
1
vote
2 answers

How do I get httrack to save files with their original names rather than index****.html?

I'm following the HTTrack docs example here: http://httrack.kauler.com/help/User-defined_structure The site I need to scrape has URLs in this…
BlueDogRanch
  • 721
  • 1
  • 16
  • 43
1
vote
0 answers

Httrack faulty when encountering japanese encoded URLS

I usually don't have any problem with Httrack, but this time, I found out that it doesn't manage to grab pages with non ascii characters like this japanese URL : domain.com/リーク情報の真偽のほ/ ( read by the browser this way :…
majimekun
  • 210
  • 2
  • 10
1
vote
0 answers

Mirroring websites - 403 Forbidden with user agent strings

I'm working on an application to mirror US university academic catalogs. To do this, I have a cluster of Celery workers that use wget or httrack to mirror the content, styles and scripts, then upload to our S3 bucket. For a small number of…
Jason
  • 11,263
  • 21
  • 87
  • 181