1

I am working on importing CSV files with numpy.genfromtxt.

The data to be imported has a header of column names, and some of those column names contain characters that genfromtxt considers invalid. Specifically, some of the names contain "#" and " ". The input data cannot be changed as it is generated by other sources that I do not control.

Using names=True and comments=None, I am unable to bring in all of the column names that I need.

I've tried overriding numpy.lib.NameValidator.deletechars=None, but this does not affect the NameValidator class instance that is actually in use.

I understand that deletechars exists due to the recarray potential to access a field as if it were an attribute. However, I simply must be able to read in column names that include invalid characters, even if the characters are stripped off when read in.

Is there a way to force the NameValidator to not check for invalid characters, or to modify the characters it checks for? I am unable to modify numpy/lib/_iotools.py as I am not root and it would be bad to modify a shared installation.

ecatmur
  • 152,476
  • 27
  • 293
  • 366
  • 1
    Couldn't you extract the header by yourself and then skip it for pure data extraction by `genfromtxt`? – Jakob S. Aug 07 '12 at 06:32
  • @JakobS. - I'm currently experimenting with that, reading in the header row and then using regex to find which names contain the invalid characters and replace them. However, this doesn't feel like a homogenous solution to me and I'm hoping that numpy has a provision to bypass the NameValidator or at least redefine the deletechars. – freakinschweet Aug 07 '12 at 06:48
  • Hi, while I have given an answer based on my best guess of problem, it would probably help if you gave a simplified but full example of the csv file. – xubuntix Aug 07 '12 at 08:30

3 Answers3

2

You do not explicitly state that numpy.genfromtxt is a hard requirement, so let me suggest that you try asciitable.

This module has a way to replace certain entries before parsing: http://cxc.harvard.edu/contrib/asciitable/#replace-bad-or-missing-values

And you can also define your own readers based on the existing ones: http://cxc.harvard.edu/contrib/asciitable/#advanced-table-reading

The output of asciitable reader are numpy arrays, so you should be able to replace the functions you currently use more or less directly with asciitable.

xubuntix
  • 2,333
  • 18
  • 19
1

NameValidator will use its default set for deletechars if constructed with deletechars=None, but if you pass in a non-None set then it will use that. And np.genfromtext takes a deletechars parameter which it passes to NameValidator.

So, you should be able to write

np.genfromtxt(..., deletechars=set())

for an empty set, or some subset of the default set("""~!@#$%^&*()-=+~\|]}[{';: /?.>,<"""):

deletechars = np.lib._iotools.NameValidator.defaultdeletechars - set("# ")
np.genfromtxt(..., deletechars=deletechars)
ecatmur
  • 152,476
  • 27
  • 293
  • 366
1

IMHO, genfromtxt is often used in cases where some simpler solutions would do.

So, unless you have some troublesome datasets (missing entries, multiple unknown column types), you're better off coding a quick and dirty parser (ie, skip some rows, parse the header, read the rest and reorganize at the end).

Now, if you really need genfromtxt, @ecatmur pointed justly that the deletechars argument of genfromtxt is sent to _iotools.NameValidator to constructs the set of characters to delete. Using deletechars=None tells NameValidator to use a default set. A first thing to try is to just not use deletechars=None, but an empty set or ''.

Note that no matter what, double quotes " and ending spaces will be deleted and similar names will be differentiated:

>>> fields = ["blah", "'blah'", "\"blah\"", "#blah", "blah "]
>>> np.lib._iotools.NameValidator(deletechars='').validate(fields)
... ('blah', "'blah'", 'blah_1', '#blah', 'blah_2')

The third and last entries would result in three columns named blah, so we have to rename them.

If this doesn't suit you, I'm afraid you're hitting a block: there's no current way to tell genfromtxt to accept a customized NameValidator. That could be a good idea, though, so you may want to raise the point on numpy's mailing list.

Pierre GM
  • 19,809
  • 3
  • 56
  • 67
  • @xubuntix - I will check my installation to see if asciitable is available. Unfortunately in my work environment, the process of adding a new module is very difficult. @ecatmur @Pierre GM - I have also tried `deletechars=set()` in `genfromtxt` but the NameValidator class simply performs a `set.extend("""~!@#$%^&*()-=+~\|]}[{';: /?.>,<""")` with whatever is passed in. As you can imagine, setting `deletechars=''` raises a ValueError due to no string.extend() method. Also, attempting to modify `np.lib._iotools` won't work because `genfromtxt` creates a new instance. Thank you all for the help! – freakinschweet Aug 07 '12 at 13:51
  • @user1580983 As a quick solution: make you own function with the code of `np.genfromtxt` and get rid of the `NameValidator`. That won't be portable but will keep you going. I would recommend to post about this issue on the numpy mailing list and eventually open a ticket. – Pierre GM Aug 07 '12 at 14:08