0

I've been struggling with this one for the whole day which i want to turn to a csv.

It represents the officers attached to company whose number is "OC418979" in the UK Company House API.

I've already truncated the json to contain just 2 objects inside "items".

What I would like to get is a csv like this

OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
...

There are 2 extra complication: there are 2 types of "officers", some are people, some are companies, so not all key in people are present in the other and viceversa. I'd like these entries to be 'null'. Second complication is those nested objects like "name" which contains a comma in it! or address, which contains several sub-objects (which I guess I could flatten in pandas tho).

{
  "total_results": 13,
  "resigned_count": 9,
  "links": {
    "self": "/company/OC418979/officers"
  },
  "items_per_page": 35,
  "etag": "bc7955679916b089445c9dfb4bc597aa0daaf17d",
  "kind": "officer-list",
  "active_count": 4,
  "inactive_count": 0,
  "start_index": 0,
  "items": [
    {
      "officer_role": "llp-designated-member",
      "name": "BARRICK, David James",
      "date_of_birth": {
        "year": 1984,
        "month": 1
      },
      "appointed_on": "2017-09-15",
      "country_of_residence": "England",
      "address": {
        "country": "United Kingdom",
        "address_line_1": "Old Gloucester Street",
        "locality": "London",
        "premises": "27",
        "postal_code": "WC1N 3AX"
      },
      "links": {
        "officer": {
          "appointments": "/officers/d_PT9xVxze6rpzYwkN_6b7og9-k/appointments"
        }
      }
    },
    {
      "links": {
        "officer": {
          "appointments": "/officers/M2Ndc7ZjpyrjzCXdFZyFsykJn-U/appointments"
        }
      },
      "address": {
        "locality": "Tadcaster",
        "country": "United Kingdom",
        "address_line_1": "Westgate",
        "postal_code": "LS24 9AB",
        "premises": "5a"
      },
      "identification": {
        "legal_authority": "UK",
        "identification_type": "non-eea",
        "legal_form": "UK"
      },
      "name": "PREMIER DRIVER LIMITED",
      "officer_role": "corporate-llp-designated-member",
      "appointed_on": "2017-09-15"
    }
  ]
}

What I've been doing is creating new json objects extracting the fields I needed like this:

{officer_address:.items[]?.address, appointed_on:.items[]?.appointed_on, country_of_residence:.items[]?.country_of_residence, officer_role:.items[]?.officer_role, officer_dob:items.date_of_birth, officer_nationality:.items[]?.nationality, officer_occupation:.items[]?.occupation}

But the query runs for hours - and I am sure there is a quicker way.

Right now I am trying this new approach - creating a json whose root is the company number and as argument a list of its officers.

{(.links.self | split("/")[2]): .items[]}
Tytire Recubans
  • 967
  • 10
  • 27

2 Answers2

2

Using jq, it's easier to extract values from the top-level object that will be shared and generate the desired rows. You'll want to limit the amounts of times you go through the items to at most once.

$ jq -r '(.links.self | split("/")[2]) as $companyCode 
   | .items[]
   | [ $companyCode, .country_of_residence, .officer_role, .appointed_on ]
   | @csv
' input.json
peak
  • 105,803
  • 17
  • 152
  • 177
Jeff Mercado
  • 129,526
  • 32
  • 251
  • 272
  • That's it! I had gotten closer enough to the `[.key1, .key2, .key3] | @ csv` but didn't know how to not lose the company_code when piping ! Thank you so much! – Tytire Recubans Feb 28 '19 at 21:13
1

Ok, you want to scan the list of officers, extract some fields from there if they are present and write that in csv format.

First part is to extract the data from the json. Assuming you loaded it is a data Python object, you have:

print(data['items'][0]['officer_role'], data['items'][0]['appointed_on'],
      data['items'][0]['country_of_residence'])

gives:

llp-designated-member 2017-09-15 England

Time to put everything together with the csv module:

import csv
...
with open('output.csv', 'w', newline='') as fd:
    wr = csv.writer(fd)
    for officer in data['items']:
        _ = wr.writerow(('OC418979',
                 officer.get('country_of_residence',''),
                 officer.get('officer_role', ''),
                 officer.get('appointed_on', '')
                 ))

The get method on a dictionnary allows to use a default value (here the empty string) if the key is not present, and the csv module ensures that if a field contains a comma, it will be enclosed in quotation marks.

With your example input, it gives:

OC418979,England,llp-designated-member,2017-09-15
OC418979,,corporate-llp-designated-member,2017-09-15
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252