String formatting issue (parantheses vs underline)

Question

I got a text file containing all my data

data = 'B:/tempfiles/bla.dat'

from the text file I'm listing the column header and their types with

col_headers = [('VW_3_Avg','<f8'),('Lvl_Max(1)','<f8')]

Then creating a dictionary variable holding the options:

kwargs = dict(delimiter=',',\
              deletechars=' ',\
              dtype=col_headers,\
              skip_header=4,\
              skip_footer=0,\
              filling_values='NaN',\
              missing_values={'\"NAN\"'}\
              )

Now importing the data to the variable datafile

datafile = scipy.genfromtxt(datafile, **kwargs)

Then I assign the data with

VW1 = datafile['VW_3_Avg']
Lv1 = datafile['Lvl_Max(1)']

It works perfectly with the first one (containing underlines), not with the second (parentheses). I get an Error, not only with this entry, but with all that contain parentheses:

ValueError: field named Lvl_Max(1) not found

When I change those parentheses in the text file to underlines, it works perfectly. But I can't say why it won't let me use parentheses - and I can't change the text file formatting as this is produced externally. Of course I could change the parentheses to underlines with a script, but I think it shouldn't be a big issue to get this right. Where and why am I missing the correct formatting precedence in this case?

laike9m: my text file containing all the data. I will add this information in the initial post. Thought that was clear, sorry. — GeoEki, Sep 12 '15 at 11:31
[How to create a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) — Ashwini Chaudhary, Sep 12 '15 at 11:32
@PadraicCunningham: That does work too, it just won't work with parantheses. And I can't change the text file formatting as this is a standard output unfortunately. — GeoEki, Sep 12 '15 at 11:42
Wow...wait. Why does this work? Is this related with the options given in the dict variable? — GeoEki, Sep 12 '15 at 11:51
Please stop adding "Solved" to your *questions*. It invalidates Stack Overflow's status of a *question and answer site*. If a particular comment solved your issue, ask if the commenter can submit it as a proper answer. Otherwise, if an answer was helpful, mark it as "accepted" and don't edit your question to include it. (You've done this a number of times before.) — Jongware, Sep 12 '15 at 12:00

hpaulj · Answer 1 · 2015-09-12T22:54:12.193

When you have problems with genfromtxt the first thing you should do is print the shape and dtype.

Why do you have to use () in col_headers = [('VW_3_Avg','<f8'),('Lvl_Max(1)','<f8')]?

Is it because the file has those names in the header?

If you are giving your own dtype and using skip_header it doesn't matter what's on the file. It's the field names in the dtype that count, not the ones on the file.

We could dig in to the dtype documentation and find just what characters are allowed. Field names that would work as Python variable names certainly will work. I'm not surprised the () would be disallowed or have problems, though I haven't tested that.

Actually 'Lvl_Max(1)' is acceptable as a dtype field name:

In [235]: col_headers = [('VW_3_Avg','<f8'),('Lvl_Max(1)','<f8')]
In [236]: A=np.zeros((3,),dtype=col_headers)
In [237]: A
Out[237]: 
array([(0.0, 0.0), (0.0, 0.0), (0.0, 0.0)], 
      dtype=[('VW_3_Avg', '<f8'), ('Lvl_Max(1)', '<f8')])
In [238]: A['Lvl_Max(1)']
Out[238]: array([ 0.,  0.,  0.])

What you should have done, right from the start, is show us datafile.shape and datafile.dtype. 90% of these genfromtxt problems stem from a misunderstanding of the function returns.

Let's try a simple fileread with this dtype:

In [239]: txt=b"""1 2
   .....: 3 4
   .....: 5 6
   .....: """
In [240]: np.genfromtxt(txt.splitlines(),dtype=col_headers)
Out[240]: 
array([(1.0, 2.0), (3.0, 4.0), (5.0, 6.0)], 
      dtype=[('VW_3_Avg', '<f8'), ('Lvl_Max1', '<f8')])

Look at the dtype. genfromtxt has stripped off the '(1)'. Looks like genfromtxt 'sanitizes' the field names, no doubt because names on text file could have all kinds of funny stuff.

From the genfromtxt docs:

Numpy arrays with a structured dtype can also be viewed as recarray, where a field can be accessed as if it were an attribute. For that reason, we may need to make sure that the field name doesn’t contain any space or invalid character, or that it does not correspond to the name of a standard attribute (like size or shape), which would confuse the interpreter.

genfromtxt takes a deletechars parameter that should let you control which characters are deleted from the field names. But it's application is inconsistent.

In [282]: np.genfromtxt(txt.splitlines(),names=np.dtype(col_headers).names,deletechars=set(b' '),dtype=None)
Out[282]: 
array([(1, 2), (3, 4), (5, 6)], 
      dtype=[('VW_3_Avg', '<i4'), ('Lvl_Max(1)', '<i4')])

In [283]: np.genfromtxt(txt.splitlines(),names=np.dtype(col_headers).names,deletechars=set(b' '))
Out[283]: 
array([(1.0, 2.0), (3.0, 4.0), (5.0, 6.0)], 
      dtype=[('VW_3_Avg', '<f8'), ('Lvl_Max1', '<f8')])

dtype=None is required for this to work.

The default set is large:

defaultdeletechars = set("""~!@#$%^&*()-=+~\|]}[{';: /?.>,<""")

The problem is that deletechars is passed to the validator:

validate_names = NameValidator(...
                               deletechars=deletechars,...)

which is used to clean names from the header and the names parameter. But then the names (and dtype) are passed through

dtype = easy_dtype(dtype, defaultfmt=defaultfmt, names=names)

without the deletechars parameter. This issue was addressed about a year ago, https://github.com/numpy/numpy/pull/4649, so may be fixed in new(est) versions.

Padraic Cunningham · Accepted Answer · 2015-09-12T19:36:00.240

The behaviour is documented, the NameValidator class in lib/_iotools.py which parses the names passed in to genfromtxt:

class NameValidator(object):
    """
    Object to validate a list of strings to use as field names.
    The strings are stripped of any non alphanumeric character, and spaces
    are replaced by '_'. During instantiation, the user can define a list
    of names to exclude, as well as a list of invalid characters. Names in
    the exclusion list are appended a '_' character.
    Once an instance has been created, it can be called with a list of
    names, and a list of valid names will be created.  The `__call__`
    method accepts an optional keyword "default" that sets the default name
    in case of ambiguity. By default this is 'f', so that names will
    default to `f0`, `f1`, etc.

The relevant line in your case being The strings are stripped of any non alphanumeric character

You can see the behaviour by calling the NameValidator.validate on a list with other non alphanumeric characters in the names:

In [17]: from numpy.lib._iotools import NameValidator

In [18]: l = ["foo(1)","bar!!!","foo bar??"]

In [19]: NameValidator().validate(l)
Out[19]: ('foo1', 'bar', 'foo_bar')

And the same using genfromtxt:

In [24]: datafile = np.genfromtxt("foo.txt", dtype=[('foo!! bar??', '<f8'), ('foo bar bar$', '<f8')], delimiter=",",defaultfmt="%")

In [25]: datafile.dtype
Out[25]: dtype([('foo_bar', '<f8'), ('foo_bar_bar', '<f8')])

String formatting issue (parantheses vs underline)

2 Answers2

Linked