I am looking to crawl the entire website and save it locally offline. It should have 2 parts:
- Authentication
This needs to be implemented using Java and I need to override HttpsURLConnection logic to add couple lines of authentication (Hadoop) in order to fetch the url response (keytabs). Something like below:
AuthenticatedURL.Token token = new AuthenticatedURL.Token();
URL ur = new URL(url);
//HttpsURLConnection.setDefaultHostnameVerifier(new HostnameVerifierSSL());
HttpsURLConnection con = (HttpsURLConnection) new AuthenticatedURL().openConnection(ur, token);
- Once all the links go through the above authentication, we need to crawl the entre website until depth =3 and save it locally offline as a zip.
Let me know possible solutions.