0

I would like to open the ClueWeb09 warc file in Python3, i was able to open it in python2 using this library, but I need to open it in the other python version since i need other library that are present just in python3.

I have tried to adapt this code to python 3 but I didn't obtain a working solution. I have tried as well to use warcio library and warc3-wet but none of this two works with ClueWeb09 format.

My final goal is to extract some features from this collections

  • The warcio library works for a lot of people, perhaps you could give a few more details as to what didn't work with it? – Greg Lindahl May 03 '19 at 15:42
  • The main problem is that the ClueWeb09 collections is a warc 0.18 file, and warcio just support the 1.0 and 1.1 version. Moreover, I have read tha it doesn't use the standard \r\n end-of-line markers and some of its records are ill-formed. – Roberta Parisi May 07 '19 at 06:08
  • 1
    That's good to know, I am extending warcio to handle records with bad formats! However, it's not sufficiently openly available for me to actually look at it. – Greg Lindahl May 08 '19 at 05:48

0 Answers0