WebSPHINX is a Java class library for building web crawlers.
Questions tagged [websphinx]
4 questions
10
votes
6 answers
How to crawl entire Wikipedia?
I've tried WebSphinx application.
I realize if I put wikipedia.org as the starting URL, it will not crawl further.
Hence, how to actually crawl the entire Wikipedia? Can anyone gimme some guidelines? Do I need to specifically go and find those URLs…

Mr CooL
- 1,529
- 8
- 23
- 27
1
vote
0 answers
Use Java to Crawl and download entire website overriding the HttpsURLConnection
I am looking to crawl the entire website and save it locally offline. It should have 2 parts:
Authentication
This needs to be implemented using Java and I need to override HttpsURLConnection logic to add couple lines of authentication (Hadoop) in…

Spartan
- 11
- 2
0
votes
1 answer
How to do form authentication by entering username and password while web crawler is crawling pages
I have downloaded websphinx to do this but i need it to ask me username and password of website and then submit the username and password to the website and once authenticated it should start crawling the internal links and sublinks and save the…

saum22
- 884
- 12
- 28
-2
votes
1 answer
Regex Working on the test program but not on WebSprinx crwaler
Here is my code for Regex matching which worked for a webpage:
public class RegexTestHarness {
public static void main(String[] args) {
File aFile = new File("/home/darshan/Desktop/test.txt");
FileInputStream inFile = null;
…

darshan
- 1,230
- 1
- 11
- 17