1

I am trying to import data using numpy's genfromtxt with header names and non-homogeneous data types. Every time I run the program I get the error:

Traceback (most recent call last):
    raise ValueError(errmsg)
ValueError: Some errors were detected !
    Line #8 (got 6 columns instead of 1)
    Line #9 (got 6 columns instead of 1)
    Line #10 (got 6 columns instead of 1)
    Line #11 (got 6 columns instead of 1)
    Line #12 (got 6 columns instead of 1)

I have already gone through this question but it didn't solve my problem. It is a very simple problem, but I can't figure out what is wrong. The code and data is included:

Code

import numpy as np
data = np.genfromtxt('Data.dat', comments='#', delimiter='\t', names=True, dtype=None).transpose()
print data

Tab-separated data

# -----
# -----
# -----
# -----
# -----
# -----
# -----
column_1    column_2    column_3    column_4    column_5    column_6
1   2   3   A   1   F
4   3   2   B   2   G
1   4   3   C   3   H
5   6   4   D   4   I

Update

In short what I require is a way of converting the first valid line after skip_header to be the first uncommented valid line with the optional argument names=True.

Community
  • 1
  • 1
Tom Kurushingal
  • 6,086
  • 20
  • 54
  • 86

3 Answers3

2

When names=True, genfromtxt expects the first line (after skip_header lines) to contain the field names, even if that line is a comment. Apparently it is pretty common for field names to be specified in a comment. If you have a variable number of comments before your uncommented field names, you'll have to work around this quirk of genfromtxt. The following shows one way you could do this.

Here's my test file. (The file is space-delimited. Add delimiter='\t' in the call to genfromtxt for a tab-delimited file).

In [12]: cat with_comments.dat
# Some
# comments
# here
foo bar baz
1.0 2.0 3.0
4.0 5.0 6.0
7.0 8.0 9.0

Open the file, and read lines until the line is not a comment:

In [13]: f = open("with_comments.dat", "r")

In [14]: line = f.readline()

In [15]: while line.startswith('#'):
   ....:     line = f.readline()
   ....: 

line now holds the line of field names:

In [16]: line
Out[16]: 'foo bar baz\n'

Convert it to a list of names:

In [17]: names = line.split()

Give those names to genfromtxt, and read the rest of the file:

In [18]: data = genfromtxt(f, names=names)

In [19]: data
Out[19]: 
array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0), (7.0, 8.0, 9.0)], 
      dtype=[('foo', '<f8'), ('bar', '<f8'), ('baz', '<f8')])

Don't forget to close the file (or better, use with("with_comments.dat", "r") as f: instead):

In [20]: f.close()
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • How did you get Out[19]? And now, how do I convert the structured array to numpy array? – Tom Kurushingal Mar 02 '15 at 18:44
  • I'm not sure what you mean by "get Out[19]". That just shows the value of `data`. (I was working in an ipython (http://ipython.org/) shell when I did this example.) Your question about converting the structured array (which *is* a numpy array) to a "numpy array" should be a new question. But search for similar questions first--it has been asked before. – Warren Weckesser Mar 02 '15 at 19:24
0

Ok, a little prodding around has revealed the answer. From the genfromtxt() documentation (http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html):

Note: There is one notable exception to this behavior: if the optional argument names=True, the first commented line will be examined for names.

Therefore, to make your code work, your data should be in the following format:

#column_1   column_2    column_3    column_4    column_5    column_6
#   -----
#   -----
#   -----
#   -----
#   -----
#   -----
1   2   3   A   1   F
4   3   2   B   2   G
1   4   3   C   3   H
5   6   4   D   4   I

Alternatively, if you have a variable number of header/comments rows, but the columns are all the same, then you could define the column names in the genfromtxt arguments:

data = np.genfromtxt(
    path, comments='#', delimiter='\t', 
    names='column_1,column_2,column_3,column_4,column_5,column_6',
    dtype=None
)

However, by using the comments keyword, genfromtxt will read the first row after the last comment row, which comprises your column headers. It will assume that it is part of the data, and therefore that your dtype should be string, so your data at this stage will look like this:

array([('column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6'),
       ('1', '2', '3', 'A', '1', 'F'), ('4', '3', '2', 'B', '2', 'G'),
       ('1', '4', '3', 'C', '3', 'H'), ('5', '6', '4', 'D', '4', 'I')], 
      dtype=[('column_1', 'S8'), ('column_2', 'S8'), ('column_3', 'S8'), ('column_4', 'S8'), ('column_5', 'S8'), ('column_6', 'S8')])

If you know what datatype your columns should be, you first take a slice excluding the first row:

data1 = data[1:]

Then modify the dtypes:

data1.astype(np.dtype([('column_1', 'i4'),('column_2', 'i4'), ('column_3', 'i4'), ('column_4', 'S10'), ('column_5', 'i4'), ('column_6', 'S10')]))

Output:

array([(1, 2, 3, 'A', 1, 'F'), (4, 3, 2, 'B', 2, 'G'),
       (1, 4, 3, 'C', 3, 'H'), (5, 6, 4, 'D', 4, 'I')], 
      dtype=[('column_1', '<i4'), ('column_2', '<i4'), ('column_3', '<i4'), ('column_4', 'S10'), ('column_5', '<i4'), ('column_6', 'S10')])
j-i-l
  • 10,281
  • 3
  • 53
  • 70
FuzzyDuck
  • 1,492
  • 12
  • 14
0

According to the documentation of genfromtxt:

If names is True, the field names are read from the first valid line after the first skip_header lines.

In your example, you could add skip_header=7 to the genfromtxt call to make it work.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
  • I have variable commented lines depending on the data file, in other words the skip header values are variable. I have more than 5000 files, and is impossible with this method. – Tom Kurushingal Mar 02 '15 at 16:41