remove CP932 encoded text from data file comments with np.genfromtxt

Question

I am trying to import a data file using np.genfromtxt

The data file contains a large commented header, each line beginning with the comment character *

When using the comments='*' kw in genfromtxt, I am still raising an error due to a weird encoding. After some googling the encoding is CP932, japanese characters.

An example of this is: b'*HW_ATTACHMENT_NAME "\x95W\x8f\x80|Standard"\r\n' This can be decoded with _.decode('cp932') to '*HW_ATTACHMENT_NAME "標準|Standard"\r\n'

However, genfromtxt does not recognize cp932 as an encoding (passing encoding='cp932') and still raises a UnicodeDecodeError.

So, is there a way to force genfromtxt to not read these characters? If not, is there a way to remove all cp932 encoded characters?

Something like this

with open(file) as f:
    #some code to remove cp932 encoded text
    data = np.genfromtxt(f, comments='*', dtype='float')

edit: This does not work, perhaps an incorrect way to go about it.

with open(file) as f:  
    lines = (line.decode('cp932') for line in f)  
    data = np.genfromtxt(lines, comments='*', dtype='float')

Have you tried passing the encoding to `genfromtext`? https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.genfromtxt.html — tkausl, Jan 22 '19 at 16:45
Yes, that's what I meant by genfromtxt not recognizing cp932 as an encoding. I'll edit my post to make that more clear — Brooks, Jan 22 '19 at 16:54
Alright. According to the documentation, `genfromtext` also accepts a generator, so you could write a generator to read and decode the file on-the-fly. — tkausl, Jan 22 '19 at 16:56
`genfromtxt` accepts anything that feeds it lines. For demos I just make a list of lines, but it could just as well be a function that reads lines of `f`, tweaks them, and passes them on. — hpaulj, Jan 22 '19 at 17:06
What is wrong with your last attempt? Some sort of error? Keep in mind that most of us don't have a `csv` file like yours, so we can't replicate your problem. You have to provide all the debugging details. — hpaulj, Jan 22 '19 at 18:09
Gives the same error as before. It's as if it is still reading the whole file, which is obviously not true. I tried to pass only the data as well with the statement: — Brooks, Jan 22 '19 at 21:38
`lines = (line.decode('ascii', 'ignore') for line in f if not line.startswith('*'))` — Brooks, Jan 22 '19 at 21:52

remove CP932 encoded text from data file comments with np.genfromtxt

0 Answers0