Questions tagged [httrack]

HTTrack (Website copier)

HTTrack is a free and open source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License Version 3.

HTTrack allows users to download World Wide Web sites from the Internet to a local computer.[4][5] By default, HTTrack arranges the downloaded site by the original site's relative link-structure. The downloaded (or "mirrored") website can be browsed by opening a page of the site in a browser.

HTTrack can also update an existing mirrored site and resume interrupted downloads. HTTrack is configurable by options and by filters (include/exclude), and has an integrated help system. There is a basic command line version and two GUI versions (WinHTTrack and WebHTTrack); the former can be part of scripts and cron jobs.

HTTrack uses a Web crawler to download a website. Some parts of the website may not be downloaded by default due to the robots exclusion protocol unless disabled during the program. HTTrack can follow links that are generated with basic JavaScript and inside Applets or Flash, but not complex links (generated using functions or expressions) or server-side image maps.

Reference :

http://www.httrack.com/

http://en.wikipedia.org/wiki/HTTrack

74 questions
1
vote
0 answers

HTTrack wait until page search completed

I'm trying to download with HTTrack the results of a search request at the URL here Unfortunately the download starts immediately and doesn't get the search result (as the page is still showing a wheel). Question: is it possible to force a pause…
Tom
  • 1,375
  • 3
  • 24
  • 45
1
vote
1 answer

Using subprocess to run HTTrack from python in Windows

I'm in the process of writing a web scraping python script, and one of the things I'd like it to be able to do is have it take a snapshot of certain pages (all of the html, style sheets, and images necessary to view that particular page properly…
Empiromancer
  • 3,778
  • 1
  • 22
  • 53
1
vote
1 answer

How do I enter the variable value of my bash command into MySQL?

The following code extracts all the domain names from a website and sets them to the value of $domain from a httrack data stream. domain=$(httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | grep -iEo '[[:alnum:]-]+\.(com|net|org)') The…
Wyatt Jackson
  • 303
  • 1
  • 2
  • 11
0
votes
1 answer

When httrack downloads the site, it says forbidden or access denied error code: 1020

when I try to download this site https://www.classcentral.com/ with httrack, I get either forbidden or acces denied error code 1020 error in the index.html file. how can i solve this Tried all the programs I could find but didn't work.
0
votes
0 answers

I am trying download a website for scrapping purpose but i keep getting an error

I am trying to download https://www.classcentral.com/ for scrapping purposes but I am not able to download it as it shows me a mirror error in the log it shows me error 403 forbidden error. I was expecting the website to be downloaded so that I can…
0
votes
1 answer

What depth level should I set in httrack?

I am trying to clone the website: https://www.classcentral.com/ using httrack. I only want to get the main page and the pages of the links belonging to that main page. How should I set "Max Depth" and "Maximum external depth" for this purpose? I…
Diego L
  • 185
  • 1
  • 9
0
votes
1 answer

How can I select all texts on VSCode which are color coded in white? (to ease manual webpage translation)

I am working on a project where it is required to copy a particular website using HTTrack (2 levels - Main Page and 1 link deep web pages only). In the copied website, I need to translate all texts into Hindi. It is recommended to hardcode the…
0
votes
0 answers

Configuring winhttrack to download a website that using local storage access token

I faced a website that uses a localstorage access token when a user login and not using the cookies in the request header. My question is how to configure winhttrack to use the accesstoken from localstorage to download webpages that require…
ben39
  • 11
  • 1
  • 6
0
votes
0 answers

Clone websites with HTTrack that use cloudflare

Everytime I try to download websites that use Cloudflare it HTTrack is giving me an error message that says that the mirror is empty, I suppose this is the case because it tries to copy the bot protection website instead of the real one. Is there…
Syrex
  • 1
  • 3
0
votes
0 answers

Decode Httrack encoded urls in Python?

I have downloaded a full website using Httrack Website Copier and now I want to retrieve all image source ('src') urls using Python 3.7. Already did that but for further use I need those urls to be in plain text but instead they are something like…
YoYoYo
  • 439
  • 2
  • 11
0
votes
1 answer

How do i enable javascript in a .html document?

I wanted to learn a lil bit of website coding so I decided I want to see how a website is written. I used HTTrack Website Copier to copy a website and then i opend the index.html document. Now I saw a row where it says "-- Please enable Javascript…
Mr G
  • 1
0
votes
0 answers

Using HTTRACK on shopify to clone a shopify store

I am trying to clone a shopify store but the images are coming from a cdn from shopify. its not downloading images. how to fix that? Tried adding in scan rules but still its not working.
0
votes
0 answers

HTTrack doesn't get iframe game assets

I am using HTTrack to download a simple web page that we host ourself which has an iframe with a simple game. I notice that it gets all the content besides some of the content loaded in our game. I added domain filters to reach the iframe domain.…
anonymous-dev
  • 2,897
  • 9
  • 48
  • 112
0
votes
0 answers

How to force httrack to create directory names that aren't valid domains?

I am using the following to save a webpage: httrack "someurl.com/foo.html" -O "./SaveToPath" --replace-external -v -s0 --depth=1 -n It is saving everything locally as expected however all images on someurl.com are stored in folder called…
SoOhNo
  • 41
  • 4
0
votes
1 answer

How can I grab this website content without losing javascript content

I want to download this website Abd i tried idm and httrack but didn't work for javascript content http://websdr.uk:8074/ Anyone can help me to download this frequency streaming content, Thank you
qvws
  • 1
  • 1