Left join for D3 mapping returns null that prevents next step of data processing (ndjson-join or jq) command line

Question

I am doing D3 mapping on a state level. Here one problem that i met in data processing. For example, map data are like this, (dat1.ndjson)

{state: a, code: aa}
{state: b, code: bb}
{state: c, code: cc}

But usually the information we have are not complete, for example, there are no information in Antarctica usually but we still need to draw its contour when we do mapping. Information data is like, (dat2.ndjson)

{state: a, code: aa, count: 1}
{state: b, code: bb, count: 2}

So, when i try to do left join on these two data, it will returns (dat3.ndjson)

[{state: a, code: aa},{state: a, code: aa, count: 1}]
[{state: b, code: bb},{state: b, code: bb, count: 2}]
[{state: c, code: cc},null]

This is returned by

ndjson-join --left 'd.code' dat1.ndjson dat2.ndjson < merge.ndjson

The purpose is to connect this 'count' information to map data, so usually I will first assign all items a count = 0 in dat1.ndjson, like this, (dat11.ndjson)

{state: a, code: aa, count: 0}
{state: b, code: bb, count: 0}
{state: c, code: cc, count: 0}

and then use this left join method like the one I showed before to get something like this, (dat33.ndjson)

[{state: a, code: aa, count: 0},{state: a, code: aa, count: 1}]
[{state: b, code: bb, count: 0},{state: b, code: bb, count: 2}]
[{state: c, code: cc, count: 0},null]

But here comes the problem. If i use the following command to add all values together, it will return an error because of that null in the third line.

ndjson-map '{state: d[0].state, code: d[0].code, count: d[0].count + 
d[1].count}' < dat33.ndjson > merge.ndjson

Now I have to do this data processing in R which takes a lot of time as I need to do transformation between .ndjson and .csv. So I am looking for a better way to do this. I think there might be some ways by using 'ndjson-cli', 'jq' or 'awk' and 'sed' and etc.

Anyone have ideas? Thank you! :)

E.

peak · Accepted Answer · 2017-12-09T16:33:05.870

Here is a solution that has several parts:

Transforming your input into valid JSON.
Library functions for performing joins.
Running jq to produce the desired output, assuming your version of jq is sufficiently recent.
What to do if you only have access to jq 1.5

To illustrate how straightforward everything is once you have the fluff issues resolved, here is the "main" jq program:

join(.state) | .count //= 0

In effect, this says: perform the join using .state as the join-key, and then ensure .count is set.

The output from the above one-liner will be the NDJSON:

{"state":"a","code":"aa","count":1}
{"state":"b","code":"bb","count":2}
{"state":"c","code":"cc","count":0}

Part 1: dat1.json and dat2.json

I am going to assume that you can produce valid JSON from your inputs. For the sample data, I used sed:

for i in 1 2 ; do
  sed -e 's/state/"state"/' -e 's/code/"code"/' -e 's/count/"count"/' \
      -e 's/ \([a-z]*\)\([,}]\)/ "\1"\2/g' dat$i.ndjson > dat$i.json
done

The following at any rate assumes you have two files, dat1.json and dat2.json, containing streams of valid JSON.

Part 2: `join`

Here is a small library of filters for producing joins: the first works on streams, and the others on arrays. These definitions assume your jq has INDEX/2. See Part 4 if that is not the case.

def joins(s1; s2; filter1; filter2):
  # combine two dictionaries using `add`
    def merge: . as $in
    | reduce (add|keys_unsorted[]) as $k ({}; .[$k] = ([$in[] | .[$k]] | add));
  [INDEX(s1; filter1 | select(. != null)), INDEX(s2; filter2 | select(. != null))]
  | merge[] ;

def join(filter1; filter2):
  joins(.[0][]; .[1][]; filter1; filter2);

def join(f): join(f; f);

Part 3. Solution

First, let's keep things simple. If you place the above definitions for join and joins in a file, say d3.jq, followed by the one-line program given in the preamble, then the following invocation will do the trick, assuming your jq has INDEX:

jq -c -s -f d3.jq <(jq -s . dat1.json) <(jq -s . dat2.json)

This assumes you are using a shell that supports process substitution. If not, then you can first run the "." programs separately, e.g. if you have sponge:

 for i in 1 2 ; do jq -s . dat$i.json | sponge dat$i.json ; done

Using `include`

If your jq supports include, and if you have the above definitions of join in a private standard library such as ~/.jq/jq/jq.jq then your main jq program becomes the two-liner:

include "jq";
join(.state) | .count //= 0'

This means you could dispense with d3.jq and use the invocation:

jq -c -s 'include "jq"; join(.state)|.count //= 0' \
   <(jq -s . dat1.json) <(jq -s . dat2.json)

Part 4: `INDEX`

Here is a copy of INDEX as provided by recent versions of jq. You could add these definitions to d3.jq (before the "main" part of the program), or to your library file, and so on:

def INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row|idx_expr|
      if type != "string" then tojson
      else .
      end] |= $row);

def INDEX(idx_expr): INDEX(.[]; idx_expr);

Left join for D3 mapping returns null that prevents next step of data processing (ndjson-join or jq) command line

1 Answers1

Part 1: dat1.json and dat2.json

Part 2: join

Part 3. Solution

Using include

Part 4: INDEX

Part 2: `join`

Using `include`

Part 4: `INDEX`