1

I need to put a formated string into a structured array (the string is a JSON formated 2D table, where all columns are objects). Right now, I do so:

import json
import numpy
json_string  = '{"SYM": ["this_string","this_string","this_string"],"DATE": ["NaN","NaN","NaN"],"YEST": ["NaN","NaN","NaN"],"other_DATE": ["NaN","NaN","NaN"],"SIZE": ["NaN","NaN","NaN"],"ACTIVITY": ["2019-09-27 14:18:28.000700 UTC","2019-09-27 14:18:28.000700 UTC","2019-09-27 14:18:28.000600 UTC"]}'
all_content  = json.loads(json_string)
dtype        = numpy.dtype(dict(names = list(all_content.keys()), formats = ['O'] * len(all_content.keys())))
this_bucket  = numpy.empty(shape = [len(all_content[next(iter(all_content.keys()))]), ], 
                                dtype = dtype)
for key in all_content.keys():
    this_bucket[key][:] = all_content[key]

but that seems extremely verbose. Is there a direct way?

user189035
  • 5,589
  • 13
  • 52
  • 112
  • You could consider using [`pandas`](https://pandas.pydata.org/pandas-docs/stable/index.html), which has build-in methods such as [`json_normalize`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html) and [`read_json`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html) for loading JSON data. – GZ0 Oct 07 '19 at 23:14
  • In this case, I would like to not use pandas. I need to keep everything in straight numpy. – user189035 Oct 07 '19 at 23:20
  • 1
    Looks pretty good to me. Since the source is a dictionary you pretty well have to iterate on the keys. – hpaulj Oct 07 '19 at 23:47

1 Answers1

1

There are essentially two ways of setting values of a structured array - assign values field by field (which you do), and using a list of tuples, which I'll demonstrate:

In [180]: all_content                                                           
Out[180]: 
{'SYM': ['this_string', 'this_string', 'this_string'],
 'DATE': ['NaN', 'NaN', 'NaN'],
 'YEST': ['NaN', 'NaN', 'NaN'],
 'other_DATE': ['NaN', 'NaN', 'NaN'],
 'SIZE': ['NaN', 'NaN', 'NaN'],
 'ACTIVITY': ['2019-09-27 14:18:28.000700 UTC',
  '2019-09-27 14:18:28.000700 UTC',
  '2019-09-27 14:18:28.000600 UTC']}

Make an object dtype array, mainly for the 'column' indexing convenience.

In [181]: arr = np.array(list(all_content.items()))                             
In [182]: arr                                                                   
Out[182]: 
array([['SYM', list(['this_string', 'this_string', 'this_string'])],
       ['DATE', list(['NaN', 'NaN', 'NaN'])],
       ['YEST', list(['NaN', 'NaN', 'NaN'])],
       ['other_DATE', list(['NaN', 'NaN', 'NaN'])],
       ['SIZE', list(['NaN', 'NaN', 'NaN'])],
       ['ACTIVITY',
        list(['2019-09-27 14:18:28.000700 UTC', '2019-09-27 14:18:28.000700 UTC', '2019-09-27 14:18:28.000600 UTC'])]],
      dtype=object)

Define the dtype - as you do, or with:

In [183]: dt = np.dtype(list(zip(arr[:,0],['O']*arr.shape[0])))                 
In [184]: dt                                                                    
Out[184]: dtype([('SYM', 'O'), ('DATE', 'O'), ('YEST', 'O'), ('other_DATE', 'O'), ('SIZE', 'O'), ('ACTIVITY', 'O')])

List 'transpose' produces a list of tuples:

In [185]: list(zip(*arr[:,1]))                                                  
Out[185]: 
[('this_string', 'NaN', 'NaN', 'NaN', 'NaN', '2019-09-27 14:18:28.000700 UTC'),
 ('this_string', 'NaN', 'NaN', 'NaN', 'NaN', '2019-09-27 14:18:28.000700 UTC'),
 ('this_string', 'NaN', 'NaN', 'NaN', 'NaN', '2019-09-27 14:18:28.000600 UTC')]

This list is suitable as the data input:

In [186]: np.array(list(zip(*arr[:,1])),dtype=dt)                               
Out[186]: 
array([('this_string', 'NaN', 'NaN', 'NaN', 'NaN', '2019-09-27 14:18:28.000700 UTC'),
       ('this_string', 'NaN', 'NaN', 'NaN', 'NaN', '2019-09-27 14:18:28.000700 UTC'),
       ('this_string', 'NaN', 'NaN', 'NaN', 'NaN', '2019-09-27 14:18:28.000600 UTC')],
      dtype=[('SYM', 'O'), ('DATE', 'O'), ('YEST', 'O'), ('other_DATE', 'O'), ('SIZE', 'O'), ('ACTIVITY', 'O')])

You can simplify getting the number of keys/fields with:

In [187]: len(all_content)                                                      
Out[187]: 6

Another way to get the number of 'records' is

In [188]: first,*rest=all_content.values()                                      
In [189]: first                                                                 
Out[189]: ['this_string', 'this_string', 'this_string']

Your next(iter...) is probably as good.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thank you very much, it will take me a bit of time to go over your answer in detail but I already greatly appreciate the efforts you put into writing it. – user189035 Oct 08 '19 at 08:36