I have to analyze commoncrawl. For that I am using python 2.7. I have observed some warc files, there is some binary data in warc.gz files. I have to parse html source using bs4. But how I can detect that this is the textual data and this is binary. For example there is a URL regest that contains binary data. http://aa-download.avg.com/filedir/inst/avg_free_x86_all_2015_5315a8160.exe
How I can skip binary data and can get just process textual data in python?