0

I am trying to read a unicode data file to a few lists. I have a mixed unicode/integer/float data file of this format:

Է   1335    1.1
դ   1380    1.2
    32  1.3
ն   1398    1.4
ե   1381    1.5
ր   1408    1.6

I am reading the file with numpy genfromtxt according to this question numpy.genfromtxt:

decodef = lambda x: x.decode("utf-8")
arr = np.genfromtxt("./data_files/data", delimiter="\t", dtype="U1, i4, f8", converters={0: decodef})

This gives me a numpy.ndarray not containing spaces, but empty elements for spaces in the first column:

('Է', 1335, 1.1)
('դ', 1380, 1.2)
('', 32, 1.3)
('ն', 1398, 1.4)
('ե', 1381, 1.5)
('ր', 1408, 1.6)

I have already tried to solve the space issue with autostrip=False (the default value), missing_values=" ", replace_space='_' parameters, but still get the same array with empty items for the spaces. I guess all this parameters are intended only for delimiter manipulation?!

Any ideas how to overcome this?

Python version 3.4.5 is being used.

  • What is the problem? This is a structured array. The empty string in the 3rd record? Given the dtype the array display looks normal. – hpaulj Jan 03 '17 at 14:58
  • Yes, the empty string in the third record. For other symbols everything works as expected. Edited that part to clarify. ) –  Jan 03 '17 at 15:00
  • Some parameters apply to field names, not values. Is there a fill value parameter? – hpaulj Jan 03 '17 at 15:12
  • No, no fill value parameter is set. –  Jan 03 '17 at 15:21

1 Answers1

1

Apparently the genfromtxt method somehow removes the space.

If you use

decodef = lambda x: x.decode("utf-8") if x != '' else " "
arr = np.genfromtxt("text", delimiter="\t", dtype="U1, i4, f8",converters={0: decodef})

It works. I still do not exactly understand why though.