How to handle binary data in commoncrawl using python

Question

I have to analyze commoncrawl. For that I am using python 2.7. I have observed some warc files, there is some binary data in warc.gz files. I have to parse html source using bs4. But how I can detect that this is the textual data and this is binary. For example there is a URL regest that contains binary data. http://aa-download.avg.com/filedir/inst/avg_free_x86_all_2015_5315a8160.exe

How I can skip binary data and can get just process textual data in python?

score 0 · Answer 1 · answered Jan 13 '17 at 12:45

You could use python-magic to identify stuff.

In [1]: import magic

In [2]: magic.from_file('places.sqlite')
Out[2]: b'SQLite 3.x database, user version 33, last written using SQLite version 3015001'

In [3]: magic.from_file('installed-port-list.txt')
Out[3]: b'ASCII text'

In [4]: magic.from_file('quotes.gz')
Out[4]: b'gzip compressed data, was "quotes", last modified: Tue Dec  6 20:35:44 2016, from Unix'

Note that while these examples use the from_file function, python-magic also has a from_buffer function.

How to handle binary data in commoncrawl using python

1 Answers1