4

So, I've been writing up code to read in a dataset from a file and separate it out for analysis.

The data in question is read from a .dat file, and looks like this:

14        HO2       O3        OH        O2        O2
15        HO2       HO2       H2O2      O2
16        H2O2      OH        HO2       H2O
17        O         O         O2
18        O         O2        O3
19        O         O3        O2        O2

The code I've written looks like this:

edge_data=np.genfromtxt('Early_earth_reaction.dat', dtype = str, 
missing_values=True, filling_values=bool)

The plan was that I'd then run the values from the dataset and build a paired list from them.

edge_list=[]
for i in range(360):
    edge_list.append((edge_data[i,0],edge_data[i,2]))
    edge_list.append((edge_data[i,1],edge_data[i,2]))
    print edge_data[i,0]
    if edge_data[i,3] != None:
        edge_list.append((edge_data[i,0],edge_data[i,3]))
        edge_list.append((edge_data[i,1],edge_data[i,3]))
    if edge_data[i,4]!= None:
        edge_list.append((edge_data[i,0],edge_data[i,4]))
        edge_list.append((edge_data[i,1,edge_data[i,4]))

However, upon running it, I get the error message

File "read_early_earth.py", line 52, in main
edge_data=np.genfromtxt('Early_earth_reaction.dat', dtype = str,  
usecols=(1,2,3,4,5), missing_values=True, filling_values=bool)
File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 1667, 
in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #6 (got 4 columns instead of 5)
Line #14 (got 6 columns instead of 5)
Line #17 (got 4 columns instead of 5)

And so on and so forth. As far as I can tell, this is happening because there are rows where not all the columns have values in them, which apparently throws numpy for a loop.

Is there a work-around for this in numpy? Alternatively, is there another way to accomplish this task? I know, worse comes to worse, I can torture some regular expressions into doing the job, but I'd prefer a method that's a bit more efficient if at all possible.

Thanks!

Tessa
  • 79
  • 1
  • 11

2 Answers2

2

Looks like you've already read genfromtxt about missing values. Does it say anything about the use of delimiters?

I think it can handle missing values with lines like

'one, 1, 234.4, , ,'
'two, 3, , 4, 5'

but when the delimiter is the default 'white-space' it can't. One of the first steps after reading a line is

 strings = line.split(delimiter)

And objects if len(strings) doesn't match with the initial target. Apparently it does not try to guess that you want to pad the line with n-len(strings) missing values.

Options that come to mind:

  • try Pandas; it may make more effort to guess your intentions

  • write your own reader. Pandas is compiled; genfromtxt is plain numpy Python. It reads the file line by line, splits and converts fields, and appends the list to a master list. It converts that list of lists into array at the end. Your own reader should be just as efficient.

  • preprocess your file to add the missing values or change the delimiter. genfromtxt accepts anything that feeds it lines. So it works with a list of strings, a file reader that yields modified lines, etc. This may be simplest.

    def foo(astr): strs=astr.split() if len(strs)<6: strs.extend([b' ']*(6-len(strs))) return b','.join(strs)

Simulating with a list of strings (in Py3):

In [139]: txt=b"""14        HO2       O3        OH        O2        O2
     ...: 15        HO2       HO2       H2O2      O2
     ...: 16        H2O2      OH        HO2       H2O
     ...: 17        O         O         O2
     ...: 18        O         O2        O3
     ...: 19        O         O3        O2        O2""".splitlines()

In [140]: [foo(l) for l in txt]
Out[140]: 
[b'14,HO2,O3,OH,O2,O2',
 b'15,HO2,HO2,H2O2,O2, ',
 b'16,H2O2,OH,HO2,H2O, ',
 b'17,O,O,O2, , ',
 b'18,O,O2,O3, , ',
 b'19,O,O3,O2,O2, ']

In [141]: np.genfromtxt([foo(l) for l in txt], dtype=None, delimiter=',')
Out[141]: 
array([(14, b'HO2', b'O3', b'OH', b'O2', b'O2'),
       (15, b'HO2', b'HO2', b'H2O2', b'O2', b''),
       (16, b'H2O2', b'OH', b'HO2', b'H2O', b''),
       (17, b'O', b'O', b'O2', b' ', b''),
       (18, b'O', b'O2', b'O3', b' ', b''),
       (19, b'O', b'O3', b'O2', b'O2', b'')], 
      dtype=[('f0', '<i4'), ('f1', 'S4'), ('f2', 'S3'), ('f3', 'S4'), ('f4', 'S3'), ('f5', 'S2')])
hpaulj
  • 221,503
  • 14
  • 230
  • 353
0

It looks like your data is nicely aligned in fields of exactly 10 characters. If that is always the case, you can tell genfromtxt the field widths to use by specifying the sequence of field widths in the delimiter argument.

Here's an example.

First, your data file:

In [20]: !cat reaction.dat
14        HO2       O3        OH        O2        O2
15        HO2       HO2       H2O2      O2
16        H2O2      OH        HO2       H2O
17        O         O         O2
18        O         O2        O3
19        O         O3        O2        O2

For convenience, I'll define the number of fields and the field width here. (In general, it is not necessary that all the fields have the same width.)

In [21]: numfields = 6

In [22]: fieldwidth = 10

Tell genfromtxt that the data is in fixed width columns by passing in the argument delimiter=(10, 10, 10, 10, 10, 10):

In [23]: data = genfromtxt('reaction.dat', dtype='S%d' % fieldwidth, delimiter=(fieldwidth,)*numfields)

Here's the result. Note that "missing" fields are empty strings. Also note that non-empty fields include the white space, and the last non-empty field in each row includes the newline character:

In [24]: data
Out[24]: 
array([[b'14        ', b'HO2       ', b'O3        ', b'OH        ',
        b'O2        ', b'O2\n'],
       [b'15        ', b'HO2       ', b'HO2       ', b'H2O2      ',
        b'O2\n', b''],
       [b'16        ', b'H2O2      ', b'OH        ', b'HO2       ',
        b'H2O\n', b''],
       [b'17        ', b'O         ', b'O         ', b'O2\n', b'', b''],
       [b'18        ', b'O         ', b'O2        ', b'O3\n', b'', b''],
       [b'19        ', b'O         ', b'O3        ', b'O2        ',
        b'O2\n', b'']], 
      dtype='|S10')

In [25]: data[1]
Out[25]: 
array([b'15        ', b'HO2       ', b'HO2       ', b'H2O2      ', b'O2\n',
       b''], 
      dtype='|S10')

We could clean up the strings in a second step, or we can have genfromtxt do it by providing a converter for each field that simply strips the white space from the field:

In [26]: data = genfromtxt('reaction.dat', dtype='S%d' % fieldwidth, delimiter=(fieldwidth,)*numfields, converters={k: lambda s: s.
    ...: strip() for k in range(numfields)})

In [27]: data
Out[27]: 
array([[b'14', b'HO2', b'O3', b'OH', b'O2', b'O2'],
       [b'15', b'HO2', b'HO2', b'H2O2', b'O2', b''],
       [b'16', b'H2O2', b'OH', b'HO2', b'H2O', b''],
       [b'17', b'O', b'O', b'O2', b'', b''],
       [b'18', b'O', b'O2', b'O3', b'', b''],
       [b'19', b'O', b'O3', b'O2', b'O2', b'']], 
      dtype='|S10')

In [28]: data[:,0]
Out[28]: 
array([b'14', b'15', b'16', b'17', b'18', b'19'], 
      dtype='|S10')

In [29]: data[:,5]
Out[29]: 
array([b'O2', b'', b'', b'', b'', b''], 
      dtype='|S10')
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214