0

Here is a snippet of a large data set I am working with:

# p*  T*     P*         U*          P*_cs    U*_cs  Steps  dt*
0.1   6.0    0.6499(6) -0.478(2)    0.6525  -0.452  30000  0.002
0.2   6.0    1.442(1)  -0.942(2)    1.452   -0.890  30000  0.002
0.3   6.0    2.465(3)  -1.376(1)    2.489   -1.298  30000  0.002
0.4   6.0    3.838(5)  -1.785(3)    3.880   -1.681  20000  0.002
0.5   6.0    5.77(1)   -2.131(3)    5.84    -2.000  20000  0.002
0.6   6.0    8.51(2)   -2.382(5)    8.60    -2.225  20000  0.002
0.7   6.0   12.43(2)   -2.501(4)   12.56    -2.318  20000  0.002
0.8   6.0   18.05(2)   -2.416(4)   18.22    -2.207  20000  0.002
0.9   6.0   26.00(2)   -2.058(4)   26.21    -1.823  20000  0.004
1.0   6.0   37.06(3)   -1.361(6)   37.32    -1.100  20000  0.002
1.1   6.0   52.25(2)   -0.216(4)   52.57     0.072  20000  0.002
1.2   6.0   72.90(5)    1.502(9)   73.28     1.816  20000  0.002
1.25  6.0   85.71(5)    2.612(8)   86.12     2.939  20000  0.002

Loading in this data set using np.loadtxt fails because of the uncertainties for the P* and U* values. Is there a built-in tool for handling this to avoid manually editing the data files?

I am looking at the uncertainties package as a possible solution but I wonder if numpy already has something for this.

Steven C. Howell
  • 16,902
  • 15
  • 72
  • 97
  • You could write a converter that would strip off that () part. An alternative is a file/line preprocessor that splits the column into two. – hpaulj Oct 01 '16 at 03:05

2 Answers2

2
In [1]: txt=b"""# p*  T*     P*         U*          P*_cs    U*_cs  Steps  dt*
   ...: 0.1   6.0    0.6499(6) -0.478(2)    0.6525  -0.452  30000  0.002
   ...: 0.2   6.0    1.442(1)  -0.942(2)    1.452   -0.890  30000  0.002
   ...: 0.3   6.0    2.465(3)  -1.376(1)    2.489   -1.298  30000  0.002"""
In [2]: txt=txt.splitlines()

txt is a file substitue (bytestring in PY3)

In [3]: data=np.genfromtxt(txt, dtype=None, names=True)
In [4]: data
Out[4]: 
array([(0.1, 6.0, b'0.6499(6)', b'-0.478(2)', 0.6525, -0.452, 30000, 0.002),
       (0.2, 6.0, b'1.442(1)', b'-0.942(2)', 1.452, -0.89, 30000, 0.002),
       (0.3, 6.0, b'2.465(3)', b'-1.376(1)', 2.489, -1.298, 30000, 0.002)], 
      dtype=[('p', '<f8'), ('T', '<f8'), ('P', 'S9'), ('U', 'S9'), ('P_cs', '<f8'), ('U_cs', '<f8'), ('Steps', '<i4'), ('dt', '<f8')])

'P' and 'U' are loaded as strings because they can't be parsed as numbers.

Now define a converter that strips off the () part (again with bytestrings)

def rmvpar(astr):
   return float(astr.split(b'(')[0])

In [9]: data=np.genfromtxt(txt, dtype=None, names=True, 
      converters={2:rmvpar, 3:rmvpar})
In [10]: data
Out[10]: 
array([(0.1, 6.0, 0.6499, -0.478, 0.6525, -0.452, 30000, 0.002),
       (0.2, 6.0, 1.442, -0.942, 1.452, -0.89, 30000, 0.002),
       (0.3, 6.0, 2.465, -1.376, 2.489, -1.298, 30000, 0.002)], 
      dtype=[('p', '<f8'), ('T', '<f8'), ('P', '<f8'), ('U', '<f8'), ('P_cs', '<f8'), ('U_cs', '<f8'), ('Steps', '<i4'), ('dt', '<f8')])

Now those two fields are floats.

But converters can't return two numbers, so I can't keep the uncertainty this way.

Another approach is to pass the lines through a filter function

def splt(astr):
    strs=astr.split()
    def foo(astr):
        if b'(' in astr:
            astr = astr.strip(b')').split(b'(')
            return b','.join(astr)
        return astr
    return b','.join([foo(a) for a in strs])

In [26]: [splt(line) for line in txt]
Out[26]: 
[b'#,p*,T*,P*,U*,P*_cs,U*_cs,Steps,dt*',
 b'0.1,6.0,0.6499,6,-0.478,2,0.6525,-0.452,30000,0.002',
 b'0.2,6.0,1.442,1,-0.942,2,1.452,-0.890,30000,0.002',
 b'0.3,6.0,2.465,3,-1.376,1,2.489,-1.298,30000,0.002']

To use this I have to skip the header because the new lines have two added columns

In [28]: data=np.genfromtxt([splt(line) for line in txt], delimiter=',',dtype=None, skip_header=1)
In [29]: data
Out[29]: 
array([(0.1, 6.0, 0.6499, 6, -0.478, 2, 0.6525, -0.452, 30000, 0.002),
       (0.2, 6.0, 1.442, 1, -0.942, 2, 1.452, -0.89, 30000, 0.002),
       (0.3, 6.0, 2.465, 3, -1.376, 1, 2.489, -1.298, 30000, 0.002)], 
      dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<i4'), 
             ('f4', '<f8'), ('f5', '<i4'), ('f6', '<f8'), ('f7', '<f8'), 
             ('f8', '<i4'), ('f9', '<f8')])

But I could modify the original dtype to make the 2 fields (sub)arrays:

In [30]: dt=np.dtype([('p', '<f8'), ('T', '<f8'), ('P', '<f8',(2,)), 
                ('U', '<f8',(2,)), ('P_cs', '<f8'), ('U_cs', '<f8'), 
                ('Steps', '<i4'), ('dt', '<f8')])

In [31]: data = np.genfromtxt((splt(line) for line in txt), delimiter=',',dtype=dt, skip_header=1)
In [32]: data
Out[32]: 
array([(0.1, 6.0, [0.6499, 6.0], [-0.478, 2.0], 0.6525, -0.452, 30000, 0.002),
       (0.2, 6.0, [1.442, 1.0], [-0.942, 2.0], 1.452, -0.89, 30000, 0.002),
       (0.3, 6.0, [2.465, 3.0], [-1.376, 1.0], 2.489, -1.298, 30000, 0.002)], 
      dtype=[('p', '<f8'), ('T', '<f8'), ('P', '<f8', (2,)), ('U', '<f8', (2,)), 
             ('P_cs', '<f8'), ('U_cs', '<f8'), ('Steps', '<i4'), ('dt', '<f8')])

Such a field would look like:

In [33]: data['P']
Out[33]: 
array([[ 0.6499,  6.    ],
       [ 1.442 ,  1.    ],
       [ 2.465 ,  3.    ]])

I could define other dtypes, just as long as the number of fields match.

With a file, rather than these text lines, I would use something like (not tested):

with open(filename,'wb') as f:
    data = np.genfromtxt((splt(line) for line in f),...

Here, and above, I'm using the generator expression (splt(line) for line in x), though the list comprehension would be fine. Any code that opens the file and yields/returns the modified lines will work.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • This provides an effective way to strip the uncertainty. The approach to keep the uncertainty would need to be modified slightly, to designate the decimal value of the uncertainty (i.e., the first value, 0.6499(6), should be converted to 0.6499 +/- 0.0006). Though this is a little complicated as the decimal place is not consistent across the values, it could be deduced. This is great, thanks! – Steven C. Howell Oct 03 '16 at 14:37
  • I have not previously seen the use of the data labels to index a NumPy array. Much more like Pandas. Nice! – Steven C. Howell Oct 03 '16 at 16:41
1

No, there is nothing like that in NumPy. You will either need an external package (even Pandas won't do it), or you can load the columns as strings instead of numbers and process them yourself. For the string approach, the str methods in Pandas would be of some use, e.g. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html

All that said, it's not even clear how you would want to represent this once the data are loaded. Do you want to simply ignore the parentheses? Do you want to record the number of uncertain digits in an additional column? Both are possible, and probably easier in Pandas than NumPy.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • I would be nice to either ignore it or add it to another column. Another column seems the norm when reporting error. This would be a nice addition, to either NumPy or Pandas. – Steven C. Howell Oct 03 '16 at 14:40
  • @stvn66: I can tell you with some certainty that this will never be added to NumPy, and it is very unlikely to be added to Pandas, unless you can show that it is a widely adopted standard or convention for a large number of users. – John Zwinck Oct 04 '16 at 01:34