6

I have tabulated data with some strange delimination (i.e. groups of values separated by commas, seperated from other values by tabs):

A,345,567   56  67  test

Is there a clean and clever way of handling multiple delimiters in any of the following: csv module, numpy.genfromtxt, or numpy.loadtxt?

I have found methods such as this, but I'm hoping there is a better solution out there. Ideally I'd like to use a genfromtxt and a regex for the delimiter.

Community
  • 1
  • 1
ryanjdillon
  • 17,658
  • 9
  • 85
  • 110
  • Will using both tab and `,` as delimiter work? Check your data if it should be delimited by tab first or comma first, or anything goes. – nhahtdh Dec 23 '12 at 15:25

2 Answers2

4

I’m afraid the answer is no in the three packages you asked for. However, you can just do replace('\t', ',') (or the reverse). For example:

from StringIO import StringIO # py3k: from io import StringIO
import csv
with open('./file') as fh:
    io = StringIO(fh.read().replace('\t', ','))

reader = csv.reader(io)

for row in reader:
    print(row)
Chris Warrick
  • 1,571
  • 1
  • 16
  • 27
  • Thanks. Is there another package that allows this functionality, preferably with a numpy array as output? I found another that has been sugested ([asciitable](http://cxc.harvard.edu/contrib/asciitable/)), but it also appears to not support such a thing. – ryanjdillon Dec 23 '12 at 15:39
  • This is assuming that `,` and `\t` are equivalent. Then again, only the OP knows whether they are actually equivalent or not. – nhahtdh Dec 23 '12 at 15:42
  • In this particular case I am working on, they are equivalent. Is this approach resource intensive, though? – ryanjdillon Dec 23 '12 at 15:47
  • 2
    @shootingstars: you can use [pandas.read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html); `pandas` has quickly become one of the most useful Python tools for data munging. It accepts regex delimiters; I've used them myself. And you can easily convert pandas `DataFrames` to `ndarray`s. – DSM Dec 23 '12 at 15:47
  • @DSM I have yet to look into `pandas`, despite a couple of other suggestions, but this would be another great reason for me to try it out. – ryanjdillon Dec 23 '12 at 15:49
  • @shootingstars Right from the docs you linked: `data = numpy.loadtxt(io)` or `data = numpy.genfromtxt(io)` in place of the `reader = ` line. – Chris Warrick Dec 23 '12 at 17:18
  • @Kwpolska Sorry, I mean another that supports regex input such as DSM mentioned. Perhaps you're saying this is the same thing. – ryanjdillon Dec 23 '12 at 18:39
0

This is quite an old questions, but numpy now has numpy.fromregex, which should handle this.

From my example provided, an almost working example would be:

>>> with TemporaryDirectory() as tmp_dir:
...     fp_csv = Path(tmp_dir, "temp.csv")
...     with fp_csv.open("w") as fh:
...         fh.write("A,345,567   56  67  test")
...         fh.write("B,345,567   56  67  test")
...     a = np.fromregex(str(fp_csv), r"([a-zA-Z\d.]+)|(\r\n|\n)", dtype="S3")
array([[b'A', b''],
       [b'345', b''],
       [b'567', b''],
       [b'56', b''],
       [b'67', b''],
       [b'tes', b''],
       [b'345', b''],
       [b'567', b''],
       [b'56', b''],
       [b'67', b''],
       [b'tes', b'']], dtype='|S3')
ryanjdillon
  • 17,658
  • 9
  • 85
  • 110