2

I'm trying to use JWikiDocs as a focused crawler for downloading Wikipedia pages as text documents. I'm executing it within a VirtualBox running Ubuntu 17.10.1.

I have cleaned and compiled JWikiDocs using

$ make clean

and

$ make all 

Then as per the README file, I specify a seed URL and the maximum number of documents to download within the options.txt file. For example:

totalPages=100
seedURL=http://en.wikipedia.org/wiki/Microsoft

This file is contained within the JWikiDocs/data/Microsoft directory.

I then execute JWikiDocs with the following command from the Ubuntu Terminal:

$ java -classpath lib/htmlparser/htmlparser.jar:lib/jwikidocs.jar jwikidocs.JWikiDocs -d data/Microsoft

The problem I am having is that only the seed page is downloaded as a document. Even though I have specified 100 documents for crawling, it seems as though it does not crawl URLs contained in the seed page and just terminates at the end.

I have tried various values for the totalPages parameter, as well as changing the value of maxDepth within Option.javafrom the default value of 4. I have also tried changing the sleep period from 500 to 2000 ms.

I also notice that in executing $ make test, the test directory results do the same thing; only the first document is actually updated. The test directories do contain 100 output documents in their respective folders, but these come packaged with the downloadable tar file and don't update during testing. I tried deleting them and running $ make test again, and they aren't reproduced.

Does anyone know how to fix this so that JWikiDocs crawls URLs within the specified seed page? I have contacted the publisher but I figured SO might be able to help quicker.

EDIT:

I've included the retrieval log so that all the crawling options are visible. As you can see, it processes the seed URL and then terminates. I suspect that the issue lies within the underlying Java somewhere.

RetrievalLog.txt

Root directory: ../data/Microsoft

Log file: ../data/Microsoft/retrievallog.txt

Data directory: ../data/Microsoft/data

Retrieval max depth: 4

Total number of retrieved pages: 100

Time sleep between transactions: 500 milliseconds

Time sleep after an error transaction: 5000 milliseconds

seedURL=http://en.wikipedia.org/wiki/Microsoft

Output encoding: UTF-8

Text including hyperlinks: true

Current queue size: 0

Downloading and processing URL: http://en.wikipedia.org/wiki/Microsoft ...

Downloading and processing completed! Save docID: 1

MWiesner
  • 8,868
  • 11
  • 36
  • 70
Scott
  • 1,863
  • 2
  • 24
  • 43
  • I'm not sure if this should have a Java tag or not. If it should, feel free to edit and add it. – Scott Jan 16 '18 at 10:49

1 Answers1

3

The issue is related to the year 2018 in which we all use https for safe browsing.

The pre-historic code is hard restricted to http urls only, which does not work with wikipedia in 2018.

In order to be able to crawl this, you have to perform some modifications in the orignal source.

  1. Change the URLChecker to consider https as valid. This can be achieved with the following code fragment:

    public static String wikiURL = "https://en.wikipedia.org"
    
  2. Modify the class Engine. Replace <http: in line 108 by <https: as well as in line 154. Note well, we parse only for https as opening tags here, as they are issued by Wikipedia nowadays. However, the closing wiki-tag must still be </http> - do not change it simply by search and replace.

  3. Modify the option.txt to contain an https-seed URL. For example, I used this file:

    totalPages=100
    seedURL=https://en.wikipedia.org/wiki/Microsoft
    
  4. Conduct make clean and make all again and rerun as it is advised by the documentation.

I tested it locally and it started crawling the pages, as you can see in the attached retrieval log output:

Current queue size: 0 Downloading and processing URL: https://en.wikipedia.org/wiki/microsoft ... Downloading and processing completed! Save docID: 1

Current queue size: 859 Downloading and processing URL: https://en.wikipedia.org/wiki/microsoft_redmond_campus ... Downloading and processing completed! Save docID: 2

Current queue size: 858 Downloading and processing URL: https://en.wikipedia.org/wiki/redmond,_washington ... Downloading and processing completed! Save docID: 3

Current queue size: 857 Downloading and processing URL: https://en.wikipedia.org/wiki/list_of_business_entities ... Downloading and processing completed! Save docID: 4

Current queue size: 1360 Downloading and processing URL: https://en.wikipedia.org/wiki/public_company ... Downloading and processing completed! Save docID: 5

The steps of the above solution work, yet it needs your local intervention.

My personal remarks on the code-base as provided by the original author:

It needs so many modernizations in terms of code-quality, code-style and performance (it is really slow compared to multi-threaded crawler4j). Use it only for re-constructing your experiments. Do not use this in production.

rzo1
  • 5,561
  • 3
  • 25
  • 64
  • 1
    I was so close to solving this myself then! I changed to https in the option.txt file, but i didn't think to check the protocols in the Java. Thanks very much, good work! – Scott Jan 18 '18 at 17:18