2

I am working on a python utility to get data from the Tycho 2 star catalogue. One of the functions I am working on queries the catalogue and returns all the information for a given star id (or set of star ids).

I'm currently doing this by looping through the lines of the catalogue file and then attempting to parse the line into a numpy structured array if it was queried. (note if there is a better way to do this you can let me know even though this is not what this question is about -- I'm doing it this way because the catalogue is too big to load all of it into memory at one time)

Anyway, once I have identified a record that I want to keep I've run into a problem... I can't figure out how to parse it into a structured array.

For instance, say the record I want to keep is:

record = '0002 00038 1| |  3.64121230|  1.08701186|   14.1|  -23.0| 69| 82| 1.8| 1.9|1968.56|1957.30| 3|1.0|3.0|0.9|3.0|12.444|0.213|11.907|0.189|999| |         |  3.64117944|  1.08706861|1.83|1.73| 81.0|104.7| | 0.0'

Now, I am trying to parse this into a numpy structured array with dtype:

        dform = [('starid', [('TYC1', int), ('TYC2', int), ('TYC3', int)]),
             ('pflag', str),
             ('starBearing', [('rightAscension', float), ('declination', float)]),
             ('properMotion', [('rightAscension', float), ('declination', float)]),
             ('uncertainty', [('rightAscension', int), ('declination', int), ('pmRA', float), ('pmDc', float)]),
             ('meanEpoch', [('rightAscension', float), ('declination', float)]),
             ('numPos', int),
             ('fitGoodness', [('rightAscension', float), ('declination', float), ('pmRA', float), ('pmDc', float)]),
             ('magnitude', [('BT', [('mag', float), ('err', float)]), ('VT', [('mag', float), ('err', float)])]),
             ('starProximity', int),
             ('tycho1flag', str),
             ('hipparcosNumber', str),
             ('observedPos', [('rightAscension', float), ('declination', float)]),
             ('observedEpoch', [('rightAscension', float), ('declination', float)]),
             ('observedError', [('rightAscension', float), ('declination', float)]),
             ('solutionType', str),
             ('correlation', float)]

This seems like it should be a fairly simple thing to do but everything I try breaks...

I've tried:

np.genfromtxt(BytesIO(record.encode()),dtype=dform,delimiter=(' ','|'))
np.genfromtxt(BytesIO(record.encode()),dtype=dform,delimiter=(' ','|'),missing_values=' ',filling_values=None)

both of which gives me

{TypeError}cannot perform accumulate with flexible type

which makes no sense since it shouldn't be doing any accumulation.

I've also tried

np.array(re.split('\|| ',record),dtype=dform)

which complains

{TypeError}a bytes-like object is required, not 'str'

and another variant

np.array([x.encode() for x in re.split('\|| ',record)],dtype=dform)

which doesn't throw an error but also certainly doesn't return the correct results:

[ ((842018864, 0, 0), '', (0.0, 0.0), (0.0, 0.0), (0, 0, 0.0, 0.0), (0.0, 0.0), 0, (0.0, 0.0, 0.0, 0.0), ((0.0, 0.0), (0.0, 0.0)), 0, '', '', (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), '', 0.0)...

So how can I do this? I think the genfromtxt option is the way to go (especially since there may be missing data occasionally) but I don't understand why it isn't working. Is this something that I'm just going to have to write a parser for on my own?

Andrew
  • 693
  • 6
  • 19

1 Answers1

3

Sorry, this answer is long and rambling, but that's what it took to figure out what is going on. The complexity of the dtype in particular was hidden by its length.


I get the TypeError: cannot perform accumulate with flexible type error when I try your list for delimiter. The details show the error occurs in LineSplitter. Without getting into details, the delimiter should be one character (or the default 'whitespace').

From the genfromtxt docs:

delimiter : str, int, or sequence, optional The string used to separate values. By default, any consecutive whitespaces act as delimiter. An integer or sequence of integers can also be provided as width(s) of each field.

The genfromtxt splitter is a little more powerful than the string .split that loadtxt uses, but not as general as the re splitter.

As for the {TypeError}a bytes-like object is required, not 'str', you specify, for a couple of the fields, dtype 'str'. That's byte string, where as your record is unicode string (in Py3). But you've already realized that with BytesIO(record.encode()).

I like to test genfromtxt cases with:

record = b'....'
np.genfromtxt([record], ....)

Or better yet

records = b"""one line
tow line
three line
"""
np.genfromtxt(records.splitlines(), ....)

If I let genfromtxt deduce field types, and just use the one delimiter, I get 32 fields:

In [19]: A=np.genfromtxt([record],dtype=None,delimiter='|')
In [20]: len(A.dtype)
Out[20]: 32
In [21]: A
Out[21]: 
array((b'0002 00038 1', False, 3.6412123, 1.08701186, 14.1, -23.0, 69, 82, 1.8, 1.9, 1968.56, 1957.3, 3, 1.0, 3.0, 0.9, 3.0, 12.444, 0.213, 11.907, 0.189, 999, False, False, 3.64117944, 1.08706861, 1.83, 1.73, 81.0, 104.7, False, 0.0), 
      dtype=[('f0', 'S12'), ('f1', '?'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ... ('f26', '<f8'), ('f27', '<f8'), ('f28', '<f8'), ('f29', '<f8'), ('f30', '?'), ('f31', '<f8')])

When we get the whole byte and delimiter issues worked out

np.array([x for x in re.split(b'\|| ',record)],dtype=dform)

does run. I now see that your dform is complex, with nested compound fields.

But to define a structured array, you to give it a list of records, e.g.

np.array([(record1...), (record2...), ....], dtype([(field1),(field2 ),...]))

Here you are trying to create one record. I could wrap your list in a tuple, but then I get a mismatch between that length and dform length, 66 v 17. If you count all the subfields dform might take 66 values, but we can't just do that with one tuple.

I've never tried to create an array from such a complex dtype, so I'm fishing around for ways to make it work.

In [41]: np.zeros((1,),dform)
Out[41]: 
array([ ((0, 0, 0), '', (0.0, 0.0), (0.0, 0.0), (0, 0, 0.0, 0.0), (0.0, 0.0), 0, (0.0, 0.0, 0.0, 0.0), ((0.0, 0.0), (0.0, 0.0)), 0, '', '', (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), '', 0.0)], 
      dtype=[('starid', [('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]), ('pflag', '<U'), ('starBearing', [('rightAscension', '<f8'), ('declination', '<f8')]), ('properMotion', [('rightAscension', '<f8'), ('declination', '<f8')]), ('uncertainty', [('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]), ('meanEpoch', ....('solutionType', '<U'), ('correlation', '<f8')])

In [64]: for name in A.dtype.names:
    print(A[name].dtype)
   ....:     
[('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]
<U1
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
int32
[('rightAscension', '<f8'), ('declination', '<f8'), ('pmRA', '<f8'), ('pmDc', '<f8')]
[('BT', [('mag', '<f8'), ('err', '<f8')]), ('VT', [('mag', '<f8'), ('err', '<f8')])]
int32
<U1
<U1
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
<U1
float64

I count 34 primitive dtype fields. Most are 'scalar', some 2-4 terms, one has a further level of nesting.

If I replace the first 2 spliting spaces with |, record.split(b'|') gives me 34 strings.

Lets try that in genfromtxt:

In [79]: np.genfromtxt([record],delimiter='|',dtype=dform)
Out[79]: 
array(((2, 38, 1), '', (3.6412123, 1.08701186), (14.1, -23.0), 
   (69, 82, 1.8, 1.9), (1968.56, 1957.3), 3, (1.0, 3.0, 0.9, 3.0),
   ((12.444, 0.213), (11.907, 0.189)), 999, '', '', 
   (3.64117944, 1.08706861), (1.83, 1.73), (81.0, 104.7), '', 0.0), 
      dtype=[('starid', [('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]), 
 ('pflag', '<U'), 
 ('starBearing', [('rightAscension', '<f8'), ('declination', '<f8')]),  
 ('properMotion', [('rightAscension', '<f8'), ('declination', '<f8')]),
 ('uncertainty', [('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]), 
 ('meanEpoch', [('rightAscension', '<f8'), ('declination', '<f8')]),   
 ('numPos', '<i4'), 
 ('fitGoodness', [('rightAscension', '<f8'), ('declination', '<f8'), ('pmRA', '<f8'), ('pmDc', '<f8')]), 
 ('magnitude', [('BT', [('mag', '<f8'), ('err', '<f8')]), ('VT', [('mag', '<f8'), ('err', '<f8')])]), 
 ('starProximity', '<i4'), ('tycho1flag', '<U'), ('hipparcosNumber', '<U'), 
 ('observedPos', [('rightAscension', '<f8'), ('declination', '<f8')]),
 ('observedEpoch', [('rightAscension', '<f8'), ('declination', '<f8')]), 
 ('observedError', [('rightAscension', '<f8'), ('declination', '<f8')]), ('solutionType', '<U'), ('correlation', '<f8')])

That almost looks reasonable. genfromtxt can actually split the values up among the compound fields. That's more that what I'd want to try with np.array().

So if you get the delimiters and byte/unicode worked out, genfromtxt can handle this mess.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks for the time in writing this out. It was really helpful. The problem was (as you stated) that genfromtxt can only handle a single delimiter. Things seem to be mostly working now except for another issue which I will ask in another question. – Andrew Dec 22 '15 at 14:30