numpy.genfromtxt , are uneven spaces between columns causing dtype errors?

Question

The data I'm working with can be found at this gist,

And looks like:

07-11-2018 18:34:35 -2.001   5571.036 -1.987
07-11-2018 18:34:50 -1.999   5570.916 -1.988

image of code and output in Jupyter Notebook

When calling

TB_CAL_array = np.genfromtxt('calbath_data/TB118192.TXT',
                            skip_header = 10,
                            dtype = ([("date", "<U10"), ("time","<U8"), ("bathtemp", "<f8"), 
                                    ("SBEfreq", "<f8"), ("SBEtemp", "<f8")])

                               )

Output of array is:

array([('07-11-2018', '18:34:35', -2.001e+00, 5571.036, -1.987),
   ('07-11-2018', '18:34:50', -1.999e+00, 5570.916, -1.988),

The data is output as a structured ndarray of tuples and is a non-homogenous array because it contains both strings and floats. numpy.genfromtxt produces array of what looks like tuples, not a 2D array—why?

NOTE: The third column of data output has been treated as something other than the dtype specified.

The output should be -2.001 but instead it is -2.001e+00

NOTE: Notice that the fifth column has the same input format and dtype designation, however no data transformation occurred there during the genfromtxt function...

The only difference I can find between "bathtemp" and "SBEtemp" is that there are two extra blank spaces after the "bathtemp" column...

However based on the numpy.genfromtxt IO documentation this shouldn't matter because consecutive whitespace should automatically be treated as a delimiter.:

delimiter : str, int, or sequence, optional The string used to separate values. By default, any consecutive whitespaces act as delimiter. An integer or sequence of integers can also be provided as width(s) of each field.

Is the extra whitespace after the "bathtemp" column causing the error? If so how do I work around it?

I don't see any errors. You got an array that matches the dtype. — hpaulj, Dec 10 '19 at 03:46
https://stackoverflow.com/questions/59275231/pandas-suppress-scientific-notation - on pandas and numpy scientific notation controls — hpaulj, Dec 11 '19 at 01:51

score 0 · Answer 1 · answered Dec 09 '19 at 21:32

I was able to get the output I was looking for by switching to pd.read_csv because of the skipinitialspace=True optional input (see here for reference):

skipinitialspace : bool, default False Skip spaces after delimiter.

Input

colnames = ['date', 'time', 'bathtemp', 'SBEfreq', 'SBEtemp']
TB_CAL   = pd.read_csv("calbath_data/TB118192.CAL", header=None, skiprows=10, delimiter=" ", skipinitialspace=True, names=colnames )

Output

    date    time    bathtemp    SBEfreq SBEtemp
0   07-11-2018  18:34:35    -2.001  5571.036    -1.987
1   07-11-2018  18:34:50    -1.999  5570.916    -1.988
2   07-11-2018  18:35:06    -1.997  5571.058    -1.987

score 0 · Accepted Answer · answered Dec 10 '19 at 07:00

With your sample:

In [136]: txt="""07-11-2018 18:34:35 -2.001   5571.036 -1.987 
     ...: 07-11-2018 18:34:50 -1.999   5570.916 -1.988"""                       
In [137]: np.genfromtxt(txt.splitlines(), dtype=None, encoding=None)            
Out[137]: 
array([('07-11-2018', '18:34:35', -2.001, 5571.036, -1.987),
       ('07-11-2018', '18:34:50', -1.999, 5570.916, -1.988)],
      dtype=[('f0', '<U10'), ('f1', '<U8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')])

and with your dtype:

In [139]: np.genfromtxt(txt.splitlines(), dtype= ([("date", "<U10"), ("time","<U
     ...: 8"), ("bathtemp", "<f8"),  
     ...:                                     ("SBEfreq", "<f8"), ("SBEtemp", "<
     ...: f8")]) 
     ...: , encoding=None)                                                      
Out[139]: 
array([('07-11-2018', '18:34:35', -2.001, 5571.036, -1.987),
       ('07-11-2018', '18:34:50', -1.999, 5570.916, -1.988)],
      dtype=[('date', '<U10'), ('time', '<U8'), ('bathtemp', '<f8'), ('SBEfreq', '<f8'), ('SBEtemp', '<f8')])

Values like -2.001e+00 are the same as -2.001. numpy chooses to use scientific notation when the range of values is wide enough, or some values are too small to show well otherwise.

For example, if I change one of the values to something much smaller:

In [140]: data = _                                                              
In [141]: data['bathtemp']                                                      
Out[141]: array([-2.001, -1.999])
In [142]: data['bathtemp'][1] *= 0.001                                          
In [143]: data['bathtemp']                                                      
Out[143]: array([-2.001e+00, -1.999e-03])

The -2.001 is unchanged (except display style).

My guess is that some of the bathtemp values (that you don't show) are much closer to zero.

Thank you, that explanation makes a lot of sense! I do have values at 0.000 in the third column. Strange that numpy automatically uses scientific notation and pandas does not... — Sarah, Dec 10 '19 at 23:27

numpy.genfromtxt , are uneven spaces between columns causing dtype errors?

2 Answers2