3

I want to read data from a (very large, whitespace separated, two-column) text file into a Python dictionary. I tried to do this with a for-loop but that was too slow. MUCH faster is reading it with numpy loadtxt into a struct array and then converting it to a dictionary:

data = np.loadtxt('filename.txt', dtype=[('field1', 'a20'), ('field2', int)], ndmin=1)
result = dict(data)

But this is surely not the best way? Any advice?

The main reason I need something else, is that the following does not work:

data[0]['field1'].split(sep='-')

It leads to the error message:

TypeError: Type str doesn't support the buffer API

If the split() method exists, why can't I use it? Should I use a different dtype? Or is there a different (fast) way to read the text file? Is there anything else I am missing?

Versions: python version 3.3.2 numpy version 1.7.1

Edit: changed data['field1'].split(sep='-') to data[0]['field1'].split(sep='-')

Louic
  • 2,403
  • 3
  • 19
  • 34
  • One of these days I am going to have to try and understand unicode... By the way, the right thing to do is to write the answer as a proper answer and accept it, not to include it within your question. – Jaime Jul 30 '13 at 19:45

2 Answers2

3

The standard library split returns a variable number of arguments, depending on how many times the separator is found in the string, and is therefore not very suitable for array operations. My char numpy arrays (I'm running 1.7) do not have a split method, by the way.

You do have np.core.defchararray.partition, which is similar but poses no problems for vectorization, as well as all the other string operations:

>>> a = np.array(['a - b', 'c - d', 'e - f'], dtype=np.string_)
>>> a
array(['a - b', 'c - d', 'e - f'], 
      dtype='|S5')
>>> np.core.defchararray.partition(a, '-')
array([['a ', '-', ' b'],
       ['c ', '-', ' d'],
       ['e ', '-', ' f']], 
      dtype='|S2')
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • Thank you for your answer Jaime. What I meant was `data**[0]**['field1'].split(sep='-')`, not `data['field1'].split(sep='-') ` although the latter would be brilliant if it existed and was fast. I edited my above post accordingly. – Louic Jul 30 '13 at 18:11
  • With my made up exmaple I can run `a[0].split('-')`, which should be equivalent to `data['field1'][0].split(sep='-')`, so reversing the order of your indices. How many `-` are you expecting in your strings? – Jaime Jul 30 '13 at 18:18
  • With your example I get: `>>> np.core.defchararray.partition(a, '-') Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.3/site-packages/numpy/core/defchararray.py", line 1090, in partition _vec_string(a, object_, 'partition', (sep,))) TypeError: expected bytes, bytearray or buffer compatible object ` – Louic Jul 30 '13 at 18:46
  • Then go with `partition`, and split all your strings with a single call. – Jaime Jul 30 '13 at 18:47
  • Thanks, but it does not work, see above. I seem to end up with a different data type then you (that b in front of the strings). `>>> a = np.array(['a - b', 'c - d', 'e - f'], dtype=np.string_) >>> a array([b'a - b', b'c - d', b'e - f'], dtype='|S5') ` – Louic Jul 30 '13 at 18:56
  • It appears to be an issue with this being a string of bytes `b'a-b'`. The solution is `b'a-b'.decode('utf-8').split('-')`. Thanks for testing Jamie comparing to your output helped me solve this! Edited original post to include solution. – Louic Jul 30 '13 at 19:09
  • 1
    actually, just `b'a-b'.split(b'-')` is OK. – Louic Jul 30 '13 at 19:24
1

Because: type(data[0]['field1']) gives <class 'numpy.bytes_'> , the split() method does not work when it has a "normal" string as argument (is this a bug?)

the way I solved it: data[0]['field1'].split(sep=b'-') (the key to this is to put the b in front of '-')

And of course Jaime's suggestion to use the following was very helpful: np.core.defchararray.partition(a, '-') but also in this case b'-' is needed to make it work.

In fact, a related question was answered here: Type str doesn't support the buffer API although at first sight I did not realise this was the same issue.

Community
  • 1
  • 1
Louic
  • 2,403
  • 3
  • 19
  • 34