1

I want to efficiently rewrite a large json, which has always the same field names, into a csv, ignoring its keys.

To give a concrete example, here is a large JSON file (tempST.json): https://gist.githubusercontent.com/pedro-roberto/b81672a89368bc8674dae21af3173e68/raw/e4afc62b9aa3092c8722cdbc4b4b4b6d5bbc1b4b/tempST.json

If I rewrite just fields time, ancestorcount and descendantcount from this JSON into a CSV I should get:

1535995526,1,1
1535974524,1,1
1535974528,1,2
...
1535997274,1,1

The following script tempSpeedTest.sh writes the value of the fields time, ancestorcount and descendantcount into each line of the csv:

rm tempOutput.csv
jq -c '.[]' < tempST.json | while read line; do 
descendantcount=$(echo $line | jq '.descendantcount')
ancestorcount=$(echo $line | jq '.ancestorcount')
time=$(echo $line | jq '.time')
echo "${time},${ancestorcount},${descendantcount}" >> tempOutput.csv
done

However the script takes around 3 minutes to run, which is unsatisfying:

>time bash tempSpeedTest.sh
real    2m50.254s
user    2m43.128s
sys 0m34.811s

What is a faster way to achieve the same result?

Pedro
  • 355
  • 4
  • 18
  • BTW, `echo $line` is inherently buggy -- if your line contains a whitespace-surrounded `*`, for example, that would be replaced with a list of files in the current directory. Always quote -- `echo "$line"` -- if not using a more appropriate approach like `<<<"$line"`; see also [BashPitfalls #14](http://mywiki.wooledge.org/BashPitfalls#echo_.24foo). – Charles Duffy Sep 05 '18 at 01:10
  • The `<<<"$line"` trick is good to know, but the script still takes the same time than with echo – Pedro Sep 05 '18 at 01:30
  • Sure -- that's why this is a comment, not an answer. The point to changing `echo $line` to either `echo "$line"` or `<<<"$line"` is correctness, not performance. (When `$TMPDIR` is on a high-performance filesystem or a ramdisk, `<<<` is a fair bit faster than setting up a pipeline, but not so much so you'd notice in most cases). – Charles Duffy Sep 05 '18 at 01:40

1 Answers1

1
jq -r '.[] | [.time, .descendantcount, .ancestorcount] | @csv' <tempST.json >tempOutput.csv

See this running at https://jqplay.org/s/QJz5FCmuc9

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • tremendous improvement, thanks. By the way, is there a way to improve it keeping the while loop? I have another processing to do where I will run a command on the values of each entry... – Pedro Sep 05 '18 at 01:13
  • The key to fixing it is having only one `jq` instance. You can have a `while` loop in front of `jq` generating its input, or you can have a `while` loop *after* `jq` consuming its output, but you can't invoke `jq` once per loop iteration and have anything that'll perform reasonably. – Charles Duffy Sep 05 '18 at 01:41
  • Anyhow, `jq` is a powerful enough language that whatever manipulation you might be thinking has to be done in native bash can almost certainly be implemented in `jq` itself. – Charles Duffy Sep 05 '18 at 01:42