I'm trying to use JWikiDocs as a focused crawler for downloading Wikipedia pages as text documents. I'm executing it within a VirtualBox running Ubuntu 17.10.1.
I have cleaned and compiled JWikiDocs using
$ make clean
and
$ make all
Then as per the README
file, I specify a seed URL and the maximum number of documents to download within the options.txt file. For example:
totalPages=100
seedURL=http://en.wikipedia.org/wiki/Microsoft
This file is contained within the JWikiDocs/data/Microsoft
directory.
I then execute JWikiDocs with the following command from the Ubuntu Terminal:
$ java -classpath lib/htmlparser/htmlparser.jar:lib/jwikidocs.jar jwikidocs.JWikiDocs -d data/Microsoft
The problem I am having is that only the seed page is downloaded as a document. Even though I have specified 100 documents for crawling, it seems as though it does not crawl URLs contained in the seed page and just terminates at the end.
I have tried various values for the totalPages
parameter, as well as changing the value of maxDepth
within Option.java
from the default value of 4
. I have also tried changing the sleep period from 500
to 2000
ms.
I also notice that in executing $ make test
, the test directory results do the same thing; only the first document is actually updated. The test directories do contain 100 output documents in their respective folders, but these come packaged with the downloadable tar
file and don't update during testing. I tried deleting them and running $ make test
again, and they aren't reproduced.
Does anyone know how to fix this so that JWikiDocs crawls URLs within the specified seed page? I have contacted the publisher but I figured SO might be able to help quicker.
EDIT:
I've included the retrieval log so that all the crawling options are visible. As you can see, it processes the seed URL and then terminates. I suspect that the issue lies within the underlying Java somewhere.
RetrievalLog.txt
Root directory: ../data/Microsoft
Log file: ../data/Microsoft/retrievallog.txt
Data directory: ../data/Microsoft/data
Retrieval max depth: 4
Total number of retrieved pages: 100
Time sleep between transactions: 500 milliseconds
Time sleep after an error transaction: 5000 milliseconds
seedURL=http://en.wikipedia.org/wiki/Microsoft
Output encoding: UTF-8
Text including hyperlinks: true
Current queue size: 0
Downloading and processing URL: http://en.wikipedia.org/wiki/Microsoft ...
Downloading and processing completed! Save docID: 1