Pass regex to delimiter field in python's csv module or numpy's genfromtxt / loadtxt?

Question

I have tabulated data with some strange delimination (i.e. groups of values separated by commas, seperated from other values by tabs):

A,345,567   56  67  test

Is there a clean and clever way of handling multiple delimiters in any of the following: csv module, numpy.genfromtxt, or numpy.loadtxt?

I have found methods such as this, but I'm hoping there is a better solution out there. Ideally I'd like to use a genfromtxt and a regex for the delimiter.

Will using both tab and `,` as delimiter work? Check your data if it should be delimited by tab first or comma first, or anything goes. — nhahtdh, Dec 23 '12 at 15:25

score 4 · Accepted Answer · answered Dec 23 '12 at 15:31

4

I’m afraid the answer is no in the three packages you asked for. However, you can just do replace('\t', ',') (or the reverse). For example:

from StringIO import StringIO # py3k: from io import StringIO
import csv
with open('./file') as fh:
    io = StringIO(fh.read().replace('\t', ','))

reader = csv.reader(io)

for row in reader:
    print(row)

answered Dec 23 '12 at 15:31

Chris Warrick

1,571
1
16
27

Thanks. Is there another package that allows this functionality, preferably with a numpy array as output? I found another that has been sugested ([asciitable](http://cxc.harvard.edu/contrib/asciitable/)), but it also appears to not support such a thing. – ryanjdillon Dec 23 '12 at 15:39
This is assuming that `,` and `\t` are equivalent. Then again, only the OP knows whether they are actually equivalent or not. – nhahtdh Dec 23 '12 at 15:42
In this particular case I am working on, they are equivalent. Is this approach resource intensive, though? – ryanjdillon Dec 23 '12 at 15:47
2

@shootingstars: you can use [pandas.read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html); `pandas` has quickly become one of the most useful Python tools for data munging. It accepts regex delimiters; I've used them myself. And you can easily convert pandas `DataFrames` to `ndarray`s. – DSM Dec 23 '12 at 15:47
@DSM I have yet to look into `pandas`, despite a couple of other suggestions, but this would be another great reason for me to try it out. – ryanjdillon Dec 23 '12 at 15:49
@shootingstars Right from the docs you linked: `data = numpy.loadtxt(io)` or `data = numpy.genfromtxt(io)` in place of the `reader = ` line. – Chris Warrick Dec 23 '12 at 17:18
@Kwpolska Sorry, I mean another that supports regex input such as DSM mentioned. Perhaps you're saying this is the same thing. – ryanjdillon Dec 23 '12 at 18:39

score 0 · Answer 2 · answered May 06 '22 at 08:32

This is quite an old questions, but numpy now has numpy.fromregex, which should handle this.

From my example provided, an almost working example would be:

>>> with TemporaryDirectory() as tmp_dir:
...     fp_csv = Path(tmp_dir, "temp.csv")
...     with fp_csv.open("w") as fh:
...         fh.write("A,345,567   56  67  test")
...         fh.write("B,345,567   56  67  test")
...     a = np.fromregex(str(fp_csv), r"([a-zA-Z\d.]+)|(\r\n|\n)", dtype="S3")
array([[b'A', b''],
       [b'345', b''],
       [b'567', b''],
       [b'56', b''],
       [b'67', b''],
       [b'tes', b''],
       [b'345', b''],
       [b'567', b''],
       [b'56', b''],
       [b'67', b''],
       [b'tes', b'']], dtype='|S3')

Pass regex to delimiter field in python's csv module or numpy's genfromtxt / loadtxt?

2 Answers2

Linked

Related