Removing all non-printable ASCII UTF-8 characters in a text file

Question

I have a text file that I'm trying to view in notepad++. The file contains a list of reptile subspecies names in one column, and then I have DNA sequence IDs that are separated by what I thought was just white space. But when I open the text file, it appears that the space is occupied with unprintable characters such as GS and VT.

For example:

Subspecies name, unprintable characters, sequence ID, unprintable char, sequence ID... and so forth until the next line:

Ablepharus bivittatus GSGSGSGSGS 36660VT4560VT23400

Is there a way I can remove all of these unprintable GS and VT characters from my text file? When I try to print every line in the file, I keep getting weird spacing due to these control characters that I see in notepad++. Any way I can make it just print normally without all of the spacing disruptions from the unprintable characters?

Updates:

I used user312016's advice and installed chardet for Python. I found out the file is encoded in UTF-16LE.

I got the file from a professor off of dropbox. The file was zipped, and all I did was unzip the file. It was a .txt file and I'm sure he didn't mention using another script to parse the data.

When I click on the unzipped .txt file to open it in regular notepad, it displays weird symbols that I assume are the GSs and VTs that I see when I open the file in notepad++.

How are `GS` and `VT` unreadable characters? Also what you see in notepad does not really have anything to do with python — Padraic Cunningham, Apr 24 '16 at 22:08
UTF-8 is backward compatible with ascii just like many other encodings/codepages. All ascii characters (char codes <= 127) are left untouched by UTF-8 and only character codes above 128 are encoded. Also all byte values used by UTF-8 for encoding are >=128) so a lot of functions that care only about ascii characters will behave correctly with utf-8 encoded byte arrays. For this reason you could simply read in your file as a byte array (without caring about encoding) and process the ascii 'GS's and 'VT's. — pasztorpisti, Apr 24 '16 at 22:09
Okay so I used user312016's advice and installed chardet. I found out the file is encoded in UTF-16LE. When I try to print every line in the file I keep getting weird spacing due to these unreadable characters that I see in notepad++. Any way I can make it just print normally without all of the spacing disruptions from the unreadable characters? Thanks — SweetJD14, Apr 24 '16 at 22:13
@PadraicCunningham: If SweetJD14 was referring to Group Separator (`'\x0b'` or `^K`) and Vertical Tab (`'\x1d'` or `^]`), then he or she probably meant "unprintable" or "not _human_-readable". Wikipedia's got them all (with some dodgy links to details) on its [ASCII control code chart](https://en.wikipedia.org/wiki/ASCII#ASCII_control_code_chart). — Kevin J. Chase, Apr 24 '16 at 22:15
As an aside, this is probably the first time I've heard of anyone encountering ASCII characters like the group and record separators used for their _intended purpose_ in real life. — Kevin J. Chase, Apr 24 '16 at 22:19
I just don't understand how to target and eliminate the unreadable characters. I'm sure that it's UTF-16LE encoded because I used chardet to confirm. — SweetJD14, Apr 24 '16 at 22:24
I got the file from a professor off of dropbox. The file was zipped, and all I did was unzip the file. When I click on the unzipped file to open regular notepad it returns weird symbols which I'm assuming is the GS's and VT's that I see when I open the file on notepad++ — SweetJD14, Apr 24 '16 at 22:27
You might want to ask your professor how they encoded the file — Padraic Cunningham, Apr 24 '16 at 22:40
@SweetJD14: What was the filename's extension? Did it end in `.txt` or something else? It's possible that it was intended as input for some program that already exists and that your professor assumed you would use, instead of expecting you to write your own parser. — Kevin J. Chase, Apr 24 '16 at 22:50
It was a `.txt` file and I'm sure he didn't mention another script to parse the data. — SweetJD14, Apr 24 '16 at 22:52

Kevin J. Chase · Answer 1 · 2016-04-25T00:37:38.630

When encountering strange characters in a "text" file, the right thing to do is to contact whoever created the file (possibly just by reading elsewhere on their Web site) to find out what they were trying to send you. Meta-information like character encoding, let alone more complex ideas like file and record format, are mostly transmitted out-of-band, meaning at best you will find only hints of them in the file itself.

In this case, however, you might have a "plain text" file that uses some of the more obscure ASCII control codes to separate records and fields in a table.

The Group Separator you've encountered, along with its siblings, were intended to separate fields and rows (and weirder subdivisions) of ASCII text data like what you have. Here's the relevant rows from the Wikipedia chart I linked above, stripped down some:

       Python
Dec    String    Abbr  Keyboard  Name
--------------------------------------
11     '\x0b'    VT    Ctrl-K    Vertical Tab

28     '\x1c'    FS    Ctrl-\    File Separator
29     '\x1d'    GS    Ctrl-]    Group Separator
30     '\x1e'    RS    Ctrl-^    Record Separator
31     '\x1f'    US    Ctrl-_    Unit Separator

That string of Group Separators you encountered could indicate a bunch of empty groups next to each other, in the same way that a bunch of commas next to each other ('Obama,Barack,,,,44') indicate empty cells in the CSV representation of a spreadsheet. The Vertical Tabs might separate "rows" (instead of, or in addition to, one of the separators above).

But this is all just guesswork. It's just as likely this file is not "plain text" at all, but the export format of some database or spreadsheet program. Again, whoever published the data ought to have also explained the file format somewhere... If not, and if you can't contact them, then educated guesswork is all you've got.

Thanks for the information, I emailed the professor. So now we wait — SweetJD14, Apr 24 '16 at 22:44
@SweetJD14: I'm kind of surprised he or she didn't just tell you what format the file was in to begin with. — Kevin J. Chase, Apr 24 '16 at 22:46
@SweetJD14 Did your professor ever reply? What kind of file was it? Should this question still be open? — Kevin J. Chase, May 30 '16 at 12:52

Pierre Barre · Answer 2 · 2016-04-24T22:10:48.143

1

You have to know in which encoding your file was encoded. Your issue comes from the fact you are decoding your file in a different and incompatible encoding as it was written on the storage device.

Then, you will just have to do something like this:

with open('file.txt', rb) as f:
    file_decoded = f.read().decode('the_encoding_of_the_file')

If you don't know the encoding, there is no way to do this reliabely. But you can still use a library that is going to try to determine the encoding like chardet.

edited Apr 24 '16 at 22:10

answered Apr 24 '16 at 22:03

Pierre Barre

2,174
1
11
23

1

Ah, how would I go about finding how the file was encoded? It was in a zip file and then I unzipped it and saved as a text file. Is there someway I can work backwards to find out how the file was encoded? – SweetJD14 Apr 24 '16 at 22:06
How would this change what the op sees in notepad? – Padraic Cunningham Apr 24 '16 at 22:12
1

@PadraicCunningham That's pretty much the same issue, the op just have to change notepad settings to open the file with the correct encoding. – Pierre Barre Apr 24 '16 at 22:13
Notepad++ doesn't encode for UTF-16LE. What IDE should I use instead? Thanks – SweetJD14 Apr 24 '16 at 22:15
1

@SweetJD14 Are you sure your file is encoded using an unicode family encoding ? – Pierre Barre Apr 24 '16 at 22:17
I wonder if `chardet` saw a bunch of rarely-used ASCII control codes and guessed at a "least wrong" UTF encoding, instead of identifying it as "ASCII text that actually _uses_ some of those weird data separators". (I'm not claiming it _isn't_ UTF-16LE at this point, or even "WordStar 3.0 for DOS off a professor's slightly corrupted 3.5″ DOS floppy".) – Kevin J. Chase Apr 24 '16 at 22:57

Removing all non-printable ASCII UTF-8 characters in a text file

2 Answers2