0

I'm trying to import a txt with strings and number columns using numpy.genfromtxt function. Essentially I need an array of strings. Here is a sample txt giving me trouble:

    H2S 1.4
    C1  3.6

The txt is codified as unicode. Here's the code I'm using:

import numpy as np          
decodf= lambda x: x.decode('utf-16')
sample = np.genfromtxt(('ztest.txt'), dtype=str,
                        converters = {0:decodf, 1:decodf},
                                     delimiter='\t',
                                     usecols=0)
print(sample)

Here's the output:

['H2S' 'None']

I've tried several ways to fix this issue. By putting dtype=None and eliminating the converter, I get:

[b'\xff\xfeH\x002\x00S' b'\x00g\x00\xe8\x00n']

I also tried eliminating the converter and putting dtype=str and got:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

I understand this is a troublesome function. I saw different options (eg: here) but couldn't get anyone to work.

What am I doing wrong? In the meantime, I'm looking into Pandas... Thanks in advance

Community
  • 1
  • 1
Tarifazo
  • 4,118
  • 1
  • 9
  • 22

1 Answers1

1

Your file is encoded as UTF-16, and the first two characters are the BOM.

Try this (with python 2.7):

import io
import numpy as np

with io.open('ztest.txt', 'r', encoding='UTF-16') as f:
    data = np.genfromtxt(f, delimiter='\t', dtype=None, usecols=[0])  # or dtype=str

genfromtxt has some issues when run in python 3 with Unicode files. As a work-around, you could simply encode the lines before before passing them to genfromtxt. For example, the following encodes each line as latin-1 before passing the lines to genfromtxt:

import io
import numpy as np

with io.open('ztest.txt', 'r', encoding='UTF-16') as f:
    lines = [line.encode('latin-1') for line in f]
    data = np.genfromtxt(lines, delimiter='\t', dtype=None, usecols=[0])
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214