Questions tagged [websphinx]

WebSPHINX is a Java class library for building web crawlers.

4 questions
10
votes
6 answers

How to crawl entire Wikipedia?

I've tried WebSphinx application. I realize if I put wikipedia.org as the starting URL, it will not crawl further. Hence, how to actually crawl the entire Wikipedia? Can anyone gimme some guidelines? Do I need to specifically go and find those URLs…
Mr CooL
  • 1,529
  • 8
  • 23
  • 27
1
vote
0 answers

Use Java to Crawl and download entire website overriding the HttpsURLConnection

I am looking to crawl the entire website and save it locally offline. It should have 2 parts: Authentication This needs to be implemented using Java and I need to override HttpsURLConnection logic to add couple lines of authentication (Hadoop) in…
Spartan
  • 11
  • 2
0
votes
1 answer

How to do form authentication by entering username and password while web crawler is crawling pages

I have downloaded websphinx to do this but i need it to ask me username and password of website and then submit the username and password to the website and once authenticated it should start crawling the internal links and sublinks and save the…
saum22
  • 884
  • 12
  • 28
-2
votes
1 answer

Regex Working on the test program but not on WebSprinx crwaler

Here is my code for Regex matching which worked for a webpage: public class RegexTestHarness { public static void main(String[] args) { File aFile = new File("/home/darshan/Desktop/test.txt"); FileInputStream inFile = null; …
darshan
  • 1,230
  • 1
  • 11
  • 17