Is there a way to use jq to split a JSON file by its common keys?

Question

I have a set of pricing data for a lot of stocks (around 1.1 million lines).

I'm having trouble parsing all of this data in memory so I'd like to split it by stock symbol into individual files and only import the data as it is needed.

From:

stockprices.json

To:

AAPL.json
ACN.json
...

etc.

stockprices.json has this structure currently:

[{
    "date": "2016-03-22 00:00:00",
    "symbol": "ACN",
    "open": "121.029999",
    "close": "121.470001",
    "low": "120.720001",
    "high": "122.910004",
    "volume": "711400.0"
},
{
    "date": "2016-03-23 00:00:00",
    "symbol": "AAPL",
    "open": "121.470001",
    "close": "119.379997",
    "low": "119.099998",
    "high": "121.470001",
    "volume": "444200.0"
},
{
    "date": "2016-03-24 00:00:00",
    "symbol": "AAPL",
    "open": "118.889999",
    "close": "119.410004",
    "low": "117.639999",
    "high": "119.440002",
    "volume": "534100.0"
},
...{}....]

I believe that jq is the right tool for the job but I'm having trouble understanding it.

How would I take the data above and use jq to split it by the symbol field?

For example I'd like to end up with:

AAPL.json:

[{
    "date": "2016-03-23 00:00:00",
    "symbol": "AAPL",
    "open": "121.470001",
    "close": "119.379997",
    "low": "119.099998",
    "high": "121.470001",
    "volume": "444200.0"
},
{
    "date": "2016-03-24 00:00:00",
    "symbol": "AAPL",
    "open": "118.889999",
    "close": "119.410004",
    "low": "117.639999",
    "high": "119.440002",
    "volume": "534100.0"
}]

and ACN.json:

[{
    "date": "2016-03-22 00:00:00",
    "symbol": "ACN",
    "open": "121.029999",
    "close": "121.470001",
    "low": "120.720001",
    "high": "122.910004",
    "volume": "711400.0"
},
    {
    "date": "2016-03-22 00:00:00",
    "symbol": "ACN",
    "open": "121.029999",
    "close": "121.470001",
    "low": "120.720001",
    "high": "122.910004",
    "volume": "711400.0"
}
]

hek2mgl · Answer 1 · 2019-01-30T23:25:41.700

2

You could use a little shell loop:

#!/bin/bash
jq -r '.[].symbol' stockprices.json | while read -r symbol ; do
    jq --arg s "${symbol}" \
        'map(if .symbol == $s then . else empty end)' \
    stockprices.json > "${symbol}".json
done

edited Jan 30 '19 at 23:25

answered Jan 30 '19 at 23:11

hek2mgl

152,036
28
249
266

Thank you, that was quick! I'm running your script right now and I'll feedback with the result! – Matt Jan 30 '19 at 23:14
Ok. Make sure that you don't place any space after the \ chars. Those are line continuations for better readability. They must occur direct in front of the newline – hek2mgl Jan 30 '19 at 23:15
The output is coming through now. It's worked perfectly. Thank you for you answer. – Matt Jan 30 '19 at 23:19

peak · Answer 2 · 2019-01-31T07:14:59.110

Here's a one-pass solution that assumes your RAM is big enough. The solution eschews using group_by, as that entails a sort operation, which is unnecessary and potentially costly in terms of time and memory.

To create the output files, awk is used here for efficiency, but is inessential to the approach.

split.jq

def aggregate_by(s; f; g):
  reduce s as $x  (null; .[$x|f] += [$x|g]);

aggregate_by(.[]; .symbol; .)
| keys_unsorted[] as $k
| $k, .[$k]

Invocation using awk

jq -f split.jq stockprices.json | awk '
  substr($0,1,1) == "\"" {
    if (fn) {close(fn)};
    gsub(/^"|"$/,"",$0); fn=$0 ".json"; next;
  }
  {print >> fn}'

score 2 · Accepted Answer · answered Jan 31 '19 at 01:47

2

You would need a loop, but it could be done in a single invocation:

jq -rc 'group_by(.symbol)[] | "\(.[0].symbol)\t\(.)"' stockprices.json |
while IFS=$'\t' read -r symbol content; do
    echo "${content}" > "${symbol}.json"
done

answered Jan 31 '19 at 01:47

Jeff Mercado

129,526
32
251
272

Hey, this was super fast! – Matt Jan 31 '19 at 07:05
For documents of non trivial size and and/or significant amount of keys this might be only practical solution. +1 – hek2mgl Jan 31 '19 at 16:01
1

@hek2mgl - the use of group_by, which entails a sort, is potentially problematic for large datasets. Also, if the original data is too large to fit into memory, a preliminary step (using jq’s streaming parser for example) would be needed. – peak Jan 31 '19 at 16:32
Yeah, that is true as I learned [here](https://stackoverflow.com/a/54451262/171318). I'll keep that in mind an star this thread (next to many others :) ) – hek2mgl Jan 31 '19 at 16:48

Is there a way to use jq to split a JSON file by its common keys?

3 Answers3

split.jq

Invocation using awk