2

I have a complex JSON file (~8GB) containing publically available data for businesses. We have decided to split the files up into multiple CSV files (or tabs in a .xlsx), so clients can easily consume the data. These files will be linked by the NZBN column/key.

I'm using R and jsonlite to read a small sample in (before scaling up to the full file). I'm guessing I need some way to specify what key/columns go in each file (i.e, the first file will have headers: australianBusinessNumber, australianCompanyNumber, australianServiceAddress, the second file will have headers: annualReturnFilingMonth, annualReturnLastFiled, countryOfOrigin...)

Here's a sample of two businesses/entities (I've bunged some of the data as well so ignore the actual values): test file

I've read almost every post on s/o of similar questions and none seem to be giving me any luck. I've tried variations of purrr, *apply commands, custom flattening functions and jqr (an r version of 'jq' - looks promising but I can't seem to run it).

Here's an attempt at creating my separate files, but I'm unsure how to include the linking identifier (NZBN) + I keep running into further nested lists (i'm unsure how many levels of nesting there are)

bulk <- jsonlite::fromJSON("bd_test.json")

coreEntity <- data.frame(bulk$companies)
coreEntity <- coreEntity[,sapply(coreEntity, is.list)==FALSE] 

company <- bulk$companies$entity$company
company <- purrr::reduce(company, dplyr::bind_rows)

shareholding <- company$shareholding
shareholding <- purrr::reduce(shareholding, dplyr::bind_rows)

shareAllocation <- shareholding$shareAllocation
shareAllocation <- purrr::reduce(shareAllocation, dplyr::bind_rows)

I'm not sure if it's easier to split the files up during the flattening/wrangling process, or just completely flatten the whole file so I just have one line per business/entity (and then gather columns as needed) - my only concern is that I need to scale this up to ~1.3million nodes (8GB JSON file).

Ideally I would want the csv files split every time there is a new collection, and the values in the collection would become the columns for the new csv/tab.

Any help or tips would be much appreciated.

------- UPDATE ------

Updated as my question was a little vague I think all I need is some code to produce one of the csv's/tabs and I replicate for the other collections.

Say for example, I wanted to create a csv of the following elements:

  • entityName (unique linking identifier)
  • nzbn (unique linking identifier)
  • emailAddress__uniqueIdentifier
  • emailAddress__emailAddress
  • emailAddress__emailPurpose
  • emailAddress__emailPurposeDescription
  • emailAddress__startDate

How would I go about that?

2 Answers2

1

i'm unsure how many levels of nesting there are

This will provide an answer to that quite efficiently:

jq '
  def max(s): reduce s as $s (null; 
    if . == null then $s elif $s > . then $s else . end);
   max(paths|length)' input.json

(With the test file, the answer is 14.)

To get an overall view (schema) of the data, you could run:

 jq 'include "schema"; schema' input.json

where schema.jq is available at this gist. This will produce a structural schema.

"Say for example, I wanted to create a csv of the following elements:"

Here's a jq solution, apart from the headers:

.companies.entity[]
| [.entityName, .nzbn]
  + (.emailAddress[] | [.uniqueIdentifier, .emailAddress, .emailPurpose, .emailPurposeDescription, .startDate])
| @csv

shareholding

The shareholding data is complex, so in the following I've used the to_table function defined elsewhere on this page.

The sample data does not include a "company name" field so in the following, I've added a 0-based "company index" field:

  .companies.entity[]
  | [.entityName, .nzbn] as $ix
  | .company
  | range(0;length) as $cix
  | .[$cix]
  | $ix + [$cix] + (.shareholding[] | to_table(false))

jqr

The above solutions use the standalone jq executable, but all going well, it should be trivial to use the same filters with jqr, though to use jq's include, it might be simplest to specify the path explicitly, as for example:

include "schema" {search: "~/.jq"};
peak
  • 105,803
  • 17
  • 152
  • 177
  • I don't have jq unfortunately - I think I need admin rights and this is a work PC. jqr (the R implementation of jq) doesn't give me what you get using the code above. Is the language roughly the same? – Oliver Mills Jul 04 '19 at 19:07
  • Thank you for the in-depth answer, including the to_table function below. I have not been able to get it working, but your jq solution above for emailAddress has worked with the other collections. cheers – Oliver Mills Jul 10 '19 at 16:36
1

If the input JSON is sufficiently regular, you might find the following flattening function helpful, especially as it can emit a header in the form of an array of strings based on the "paths" to the leaf elements of the input, which can be arbitrarily nested:

# to_table produces a flat array.
# If hdr == true, then ONLY emit a header line (in prettified form, i.e. as an array of strings);
# if hdr is an array, it should be the prettified form and is used to check consistency.
def to_table(hdr):
  def prettify: map( (map(tostring)|join(":") ));
  def composite: type == "object" or type == "array";

  def check:
     select(hdr|type == "array") 
     | if prettify == hdr then empty
       else error("expected head is \(hdr) but imputed header is \(.)")
       end ;

  . as $in
  | [paths(composite|not)]           # the paths in array-of-array form
  | if hdr==true then prettify
    else check, map(. as $p | $in | getpath($p))
    end;


For example, to produce the desired table (without headers) for .emailAddress, one could write:

.companies.entity[]
| [.entityName, .nzbn] as $ix
| $ix + (.emailAddress[] | to_table(false))
| @tsv

(Adding the headers and checking for consistency, are left as an exercise for now, but are dealt with below.)

Generating multiple files

More interestingly, you could select the level you want, and produce multiple tables automagically. One way to partition the output into separate files efficiently would be to use awk. For example, you could pipe the output obtained using this jq filter:

["entityName", "nzbn"] as $common
| .companies.entity[]
| [.entityName, .nzbn] as $ix
| (to_entries[] | select(.value | type == "array") | .key) as $key
| ($ix + [$key] | join("-")) as $filename
| (.[$key][0]|to_table(true)) as $header

# First emit the line giving all the headers:
| $filename, ($common + $header | @tsv),
# Then emit the rows of the table:
  (.[$key][]
   | ($filename,  ($ix + to_table(false) | @tsv)))

to

awk -F\\t 'fn {print >> fn; fn=0;next} {fn=$1".tsv"}'

This will produce headers in each file; if you want consistency checking, change to_table(false) to to_table($header).

peak
  • 105,803
  • 17
  • 152
  • 177
  • Hi, I'm having some trouble with the shareholding part (a bit more complex than the emailAddress example): .companies.entity[] | [.entityName, .nzbn] + (.company.shareholding[] | .numberOfShares) doesn't yield anything from the file – Oliver Mills Jul 11 '19 at 10:51
  • I've added a "shareholding" section to the other response, as your jq suggests you are not trying to using `to_table`. – peak Jul 11 '19 at 11:26
  • Thanks alot - wow that's powerful. There's one last piece of complexity that I can't seem to figure out, within each shareholding, there can be multiple shareholders (up to 50). Your to_table function seems to just extend these out on the same line (as expected) - however, this results in really wide csv's for companies that have lots of shareholders/shareholdings. Is there anyway you can make it so multiple shareholders/shareholdings are repeated in lines below (same entity name and nzbn). Here's an updated test file: https://www.dropbox.com/s/vlm7bj2ogme0bvn/bd_test%20%281%29.json?dl=0 – Oliver Mills Jul 11 '19 at 15:43
  • The section "Generating multiple files" shows how to avoid the stretching problem, but you have to decide at which point to begin the creation of new files. That is, the "Generating multiple files" section is illustrative -- you would have to understand it in order to tailor it to your needs. The point of `to_table` is to deal with whatever stretching may be needed to avoid losing data, assuming sufficient regularity of the JSON. – peak Jul 11 '19 at 16:00