5

I am trying to load a large 3 GB JSON file. Currently, with JQ utility I can load the entire file in nearly 40 mins. Now, I want to know how I can use parallelism/multi threading approach in JQ in order to complete the process in less amount of time. I am using v1.5

Command Used:

JQ.exe -r -s "map(.\"results\" | map({\"ID\": (((.\"body\"?.\"party\"?.\"xrefs\"?.\"xref\"//[] | map(select(ID))[]?.\"id\"?))//null), \"Name\": (((.\"body\"?.\"party\"?.\"general-info\"?.\"full-name\"?))//null)} | [(.\"ID\"//\"\"|tostring), (.\"Name\"//\"\"|tostring)])) | add[] | join(\"~\")" "\C:\InputFile.txt" >"\C:\OutputFile.txt"

My data:

{
  "results": [
    {
      "_id": "0000001",
      "body": {
        "party": {
          "related-parties": {},
          "general-info": {
            "last-update-ts": "2011-02-14T08:21:51.000-05:00",
            "full-name": "Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades",
            "status": "ACTIVE",
            "last-update-user": "TS42922",
            "create-date": "2011-02-14T08:21:51.000-05:00",
            "classifications": {
              "classification": [
                {
                  "code": "PENS"
                }
              ]
            }
          },
          "xrefs": {
            "xref": [
              {
                "type": "LOCCU1",
                "id": "X00893X"
              },
              {
                "type": "ID",
                "id": "1012227139"
              }
            ]
          }
        }
      }
    },
    {
      "_id": "000002",
      "body": {
        "party": {
          "related-parties": {},
          "general-info": {
            "last-update-ts": "2015-05-21T15:10:45.174-04:00",
            "full-name": "Innova Capital Sp zoo",
            "status": "ACTIVE",
            "last-update-user": "jw74592",
            "create-date": "1994-08-31T00:00:00.000-04:00",
            "classifications": {
              "classification": [
                {
                  "code": "CORP"
                }
              ]
            }
          },
          "xrefs": {
            "xref": [
              {
                "type": "ULTDUN",
                "id": "144349875"
              },
              {
                "type": "AVID",
                "id": "6098743"
              },
              {
                "type": "LOCCU1",
                "id": "1001210218"
              },
              {
                "type": "ID",
                "id": "1001210218"
              },
              {
                "type": "BLMBRG",
                "id": "10009050"
              },
              {
                "type": "REG_CO",
                "id": "0000068508"
              },
              {
                "type": "SMCI",
                "id": "13159"
              }
            ]
          }
        }
      }
    }
  ]
}

Can someone please help me which command I need to use in v1.5 in order to achieve parallelism/multithreading.

jq170727
  • 13,159
  • 3
  • 46
  • 56
Saikat Saha
  • 51
  • 1
  • 3
  • 2
    You're not going to achieve that using JQ alone... that's not what it does. You'd have to break out the data in a way so that you can process each part in separate JQ processes. – Jeff Mercado Jun 22 '15 at 16:08
  • @JeffMercado Is it possible to break the data from a single file into separate files using JQ or any other utility? Do you have any reference for the functionality. – Saikat Saha Jun 22 '15 at 16:55
  • I don't think it's possible to output to multiple files using JQ all at once, You'll have to use other tools to do that. – Jeff Mercado Jun 22 '15 at 17:07
  • Are you using a Windows environment or do you have access to a bash-like environment. I could see how this could be done relatively easily with bash, windows might take a little bit of work. – Jeff Mercado Jun 22 '15 at 17:18
  • I am using windows environment. Are you aware of any other tool which can be used to split the big JSON file into multiple small files? – Saikat Saha Jun 22 '15 at 17:27
  • 1
    You could use jq to select individual results, then pipe that out to a tool that can split the files. Bash has `split` that could do that. I'd suggest looking into installing something like MINGW to help out here. You could script this out in bash. Windows on the other hand you'd have to write out or get programs to a lot of these for you. – Jeff Mercado Jun 22 '15 at 17:35

2 Answers2

3

Here is a streaming approach which assumes your 3GB data file is in data.json and the following filter is in filter1.jq:

  select(length==2)
| . as [$p, $v]
| {r:$p[1]}
| if   $p[2:6] == ["body","party","general-info","full-name"]       then .name = $v
  elif $p[2:6] == ["body","party","xrefs","xref"] and $p[7] == "id" then .id   = $v
  else  empty
  end      

When you run jq with

$ jq -M -c --stream -f filter1.jq data.json

jq will produce a stream of results with minimal details you need

{"r":0,"name":"Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades"}
{"r":0,"id":"X00893X"}
{"r":0,"id":"1012227139"}
{"r":1,"name":"Innova Capital Sp zoo"}
{"r":1,"id":"144349875"}
{"r":1,"id":"6098743"}
{"r":1,"id":"1001210218"}
{"r":1,"id":"1001210218"}
{"r":1,"id":"10009050"}
{"r":1,"id":"0000068508"}
{"r":1,"id":"13159"}

which you can convert to your desired format by using a second filter2.jq:

foreach .[] as $i (
     {c: null, r:null, id:null, name:null}

   ; .c = $i
   | if .r != .c.r then .id=null | .name=null | .r=.c.r else . end   # control break
   | .id   = if .c.id == null   then .id   else .c.id   end
   | .name = if .c.name == null then .name else .c.name end

   ; [.id, .name]
   | if contains([null]) then empty else . end
   | join("~")
)

which consumes the output of the first filter when run with

$ jq -M -c --stream -f filter1.jq data.json | jq -M -s -r -f filter2.jq

and produces

X00893X~Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades
1012227139~Ibercaja Gestion SGIIC SAPensiones Nuevas Oportunidades
144349875~Innova Capital Sp zoo
6098743~Innova Capital Sp zoo
1001210218~Innova Capital Sp zoo
1001210218~Innova Capital Sp zoo
10009050~Innova Capital Sp zoo
0000068508~Innova Capital Sp zoo
13159~Innova Capital Sp zoo

This might be all you need using just two jq processes. If you need more parallelism you could use the record number (r) as to partition the data and process the partitions in parallel. For example, if you save the intermediate output into a temp.json file

$ jq -M -c --stream -f filter1.jq data.json > temp.json

then you could process temp.json in parallel with filters such as

$ jq -M 'select(0==.r%3)' temp.json | jq -M -s -r -f filter2.jq > result0.out &
$ jq -M 'select(1==.r%3)' temp.json | jq -M -s -r -f filter2.jq > result1.out &
$ jq -M 'select(2==.r%3)' temp.json | jq -M -s -r -f filter2.jq > result2.out &

and concatenate your partitions into a single result at the end if necessary. This example uses 3 partitions but you could easily extend this approach to any number of partitions if you need more parallelism.

GNU parallel is also a good option. As mentioned in the JQ Cookbook, jq-hopkok's parallelism folder has some good examples

jq170727
  • 13,159
  • 3
  • 46
  • 56
0

For a file of this size, you need to stream the file in and process one item at a time. First seek to '"results": [' then use a function called something like 'readItem' that uses a stack to match braces, until your opening brace closes, appending each character to your buffer, then deserializes the item once the closing brace is found.

I recommend node.js + lodash for implementation language.

Eric Hartford
  • 16,464
  • 4
  • 33
  • 50