3

I have a set of pricing data for a lot of stocks (around 1.1 million lines).

I'm having trouble parsing all of this data in memory so I'd like to split it by stock symbol into individual files and only import the data as it is needed.

From:

stockprices.json

To:

AAPL.json
ACN.json
...

etc.

stockprices.json has this structure currently:

[{
    "date": "2016-03-22 00:00:00",
    "symbol": "ACN",
    "open": "121.029999",
    "close": "121.470001",
    "low": "120.720001",
    "high": "122.910004",
    "volume": "711400.0"
},
{
    "date": "2016-03-23 00:00:00",
    "symbol": "AAPL",
    "open": "121.470001",
    "close": "119.379997",
    "low": "119.099998",
    "high": "121.470001",
    "volume": "444200.0"
},
{
    "date": "2016-03-24 00:00:00",
    "symbol": "AAPL",
    "open": "118.889999",
    "close": "119.410004",
    "low": "117.639999",
    "high": "119.440002",
    "volume": "534100.0"
},
...{}....]

I believe that jq is the right tool for the job but I'm having trouble understanding it.

How would I take the data above and use jq to split it by the symbol field?

For example I'd like to end up with:

AAPL.json:

[{
    "date": "2016-03-23 00:00:00",
    "symbol": "AAPL",
    "open": "121.470001",
    "close": "119.379997",
    "low": "119.099998",
    "high": "121.470001",
    "volume": "444200.0"
},
{
    "date": "2016-03-24 00:00:00",
    "symbol": "AAPL",
    "open": "118.889999",
    "close": "119.410004",
    "low": "117.639999",
    "high": "119.440002",
    "volume": "534100.0"
}]

and ACN.json:

[{
    "date": "2016-03-22 00:00:00",
    "symbol": "ACN",
    "open": "121.029999",
    "close": "121.470001",
    "low": "120.720001",
    "high": "122.910004",
    "volume": "711400.0"
},
    {
    "date": "2016-03-22 00:00:00",
    "symbol": "ACN",
    "open": "121.029999",
    "close": "121.470001",
    "low": "120.720001",
    "high": "122.910004",
    "volume": "711400.0"
}
]
peak
  • 105,803
  • 17
  • 152
  • 177
Matt
  • 65
  • 4

3 Answers3

2

You could use a little shell loop:

#!/bin/bash
jq -r '.[].symbol' stockprices.json | while read -r symbol ; do
    jq --arg s "${symbol}" \
        'map(if .symbol == $s then . else empty end)' \
    stockprices.json > "${symbol}".json
done 
hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • Thank you, that was quick! I'm running your script right now and I'll feedback with the result! – Matt Jan 30 '19 at 23:14
  • Ok. Make sure that you don't place any space after the \ chars. Those are line continuations for better readability. They must occur direct in front of the newline – hek2mgl Jan 30 '19 at 23:15
  • The output is coming through now. It's worked perfectly. Thank you for you answer. – Matt Jan 30 '19 at 23:19
2

Here's a one-pass solution that assumes your RAM is big enough. The solution eschews using group_by, as that entails a sort operation, which is unnecessary and potentially costly in terms of time and memory.

To create the output files, awk is used here for efficiency, but is inessential to the approach.

split.jq

def aggregate_by(s; f; g):
  reduce s as $x  (null; .[$x|f] += [$x|g]);

aggregate_by(.[]; .symbol; .)
| keys_unsorted[] as $k
| $k, .[$k]

Invocation using awk

jq -f split.jq stockprices.json | awk '
  substr($0,1,1) == "\"" {
    if (fn) {close(fn)};
    gsub(/^"|"$/,"",$0); fn=$0 ".json"; next;
  }
  {print >> fn}'
peak
  • 105,803
  • 17
  • 152
  • 177
2

You would need a loop, but it could be done in a single invocation:

jq -rc 'group_by(.symbol)[] | "\(.[0].symbol)\t\(.)"' stockprices.json |
while IFS=$'\t' read -r symbol content; do
    echo "${content}" > "${symbol}.json"
done
Jeff Mercado
  • 129,526
  • 32
  • 251
  • 272
  • Hey, this was super fast! – Matt Jan 31 '19 at 07:05
  • For documents of non trivial size and and/or significant amount of keys this might be only practical solution. +1 – hek2mgl Jan 31 '19 at 16:01
  • 1
    @hek2mgl - the use of group_by, which entails a sort, is potentially problematic for large datasets. Also, if the original data is too large to fit into memory, a preliminary step (using jq’s streaming parser for example) would be needed. – peak Jan 31 '19 at 16:32
  • Yeah, that is true as I learned [here](https://stackoverflow.com/a/54451262/171318). I'll keep that in mind an star this thread (next to many others :) ) – hek2mgl Jan 31 '19 at 16:48