numpy genfromtxt issues with .txt input

Question

I'm trying to import a txt with strings and number columns using numpy.genfromtxt function. Essentially I need an array of strings. Here is a sample txt giving me trouble:

    H2S 1.4
    C1  3.6

The txt is codified as unicode. Here's the code I'm using:

import numpy as np          
decodf= lambda x: x.decode('utf-16')
sample = np.genfromtxt(('ztest.txt'), dtype=str,
                        converters = {0:decodf, 1:decodf},
                                     delimiter='\t',
                                     usecols=0)
print(sample)

Here's the output:

['H2S' 'None']

I've tried several ways to fix this issue. By putting dtype=None and eliminating the converter, I get:

[b'\xff\xfeH\x002\x00S' b'\x00g\x00\xe8\x00n']

I also tried eliminating the converter and putting dtype=str and got:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

I understand this is a troublesome function. I saw different options (eg: here) but couldn't get anyone to work.

What am I doing wrong? In the meantime, I'm looking into Pandas... Thanks in advance

Warren Weckesser · Answer 1 · 2015-12-03T14:53:00.937

1

Your file is encoded as UTF-16, and the first two characters are the BOM.

Try this (with python 2.7):

import io
import numpy as np

with io.open('ztest.txt', 'r', encoding='UTF-16') as f:
    data = np.genfromtxt(f, delimiter='\t', dtype=None, usecols=[0])  # or dtype=str

genfromtxt has some issues when run in python 3 with Unicode files. As a work-around, you could simply encode the lines before before passing them to genfromtxt. For example, the following encodes each line as latin-1 before passing the lines to genfromtxt:

import io
import numpy as np

with io.open('ztest.txt', 'r', encoding='UTF-16') as f:
    lines = [line.encode('latin-1') for line in f]
    data = np.genfromtxt(lines, delimiter='\t', dtype=None, usecols=[0])

edited Dec 03 '15 at 14:53

answered Dec 01 '15 at 15:06

Warren Weckesser

110,654
19
194
214

Hi, thanks for answering. Your code yields `TypeError: Can't convert 'bytes' object to str implicitly` – Tarifazo Dec 03 '15 at 08:46
Ah, right. I was using python 2.7. I get the same error when I use python 3.4. – Warren Weckesser Dec 03 '15 at 14:16

numpy genfromtxt issues with .txt input

1 Answers1