1

I Have the following nested json file which I want to parse with jq tool and print in table form like I show at the end

The input.json structure is like this:

{
 "document":{
  "page":[
     {
        "@index":"0",
        "image":{
           "@data":"ABC",
           "@format":"png",
           "@height":"620.00",
           "@type":"base64encoded",
           "@width":"450.00",
           "@x":"85.00",
           "@y":"85.00"
        }
     },
     {
        "@index":"1",
        "row":[
           {
              "column":[
                 {
                    "text":""
                 },
                 {
                    "text":{
                       "#text":"Text1",
                       "@fontName":"Arial",
                       "@fontSize":"12.0",
                       "@height":"12.00",
                       "@width":"71.04",
                       "@x":"121.10",
                       "@y":"83.42"
                    }
                 }
              ]
           },
           {
              "column":[
                 {
                    "text":""
                 },
                 {
                    "text":{
                       "#text":"Text2",
                       "@fontName":"Arial",
                       "@fontSize":"12.0",
                       "@height":"12.00",
                       "@width":"101.07",
                       "@x":"121.10",
                       "@y":"124.82"
                    }
                 }
              ]
           }
        ]
     },
     {
        "@index":"2",
        "row":[
           {
              "column":{
                 "text":{
                    "#text":"Text3",
                    "@fontName":"Arial",
                    "@fontSize":"12.0",
                    "@height":"12.00",
                    "@width":"363.44",
                    "@x":"85.10",
                    "@y":"69.62"
                 }
              }
           },
           {
              "column":{
                 "text":{
                    "#text":"Text4",
                    "@fontName":"Arial",
                    "@fontSize":"12.0",
                    "@height":"12.00",
                    "@width":"382.36",
                    "@x":"85.10",
                    "@y":"83.42"
                 }
              }
           },
           {
              "column":{
                 "text":{
                    "#text":"Text5",
                    "@fontName":"Arial",
                    "@fontSize":"12.0",
                    "@height":"12.00",
                    "@width":"435.05",
                    "@x":"85.10",
                    "@y":"97.22"
                 }
              }
           }
        ]
     },
     {
        "@index":"3"
     }
  ]
 }
}

Following the answers of the following question (Parsing nested json with jq) I've tried this code but doesn't work

$ cat file.json | jq .document.page[].row | ["#text", "@x", "@y"] | @csv

The output I'm trying to get is:

#text @x     @y
Text1 121.10 83.42
Text2 121.10 124.82
Text3 65.10  69.62
Text4 85.10  83.42
Text5 85.10  97.22

How can achieve this?

Thanks

UPDATE

Thanks so much for the help. I've tried with a real file a bit longer.

I was able to adapt the first peak's solution like below:

["#text", "@data", "@fontName", "@fontSize", "@format", "@height", "@type", "@width", "@x", "@y"], 
( .. 
| objects 
| select(has("#text","@data")) 
| [.["#text", "@data", "@fontName", "@fontSize", "@format", "@height", "@type", "@width", "@x", "@y"]]
)  
| @tsv

and with new input I get this table:

+---------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| #text         | @data | @fontName | @fontSize | @format | @height | @type         | @width | @x     | @y     |
+---------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
|               | ABC   |           |           | png     | 620     | base64encoded | 450    | 85     | 85     |
+---------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| Text ä 1      |       | Tahoma    | 12        |         | 12      |               | 427.79 | 85.1   | 69.62  |
+---------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| Text ¢76      |       | Tahoma    | 12        |         | 12      |               | 270.5  | 85.1   | 690.72 |
+---------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| Text % 5      |       | Tahoma    | 12        |         | 12      |               | 130.84 | 358.86 | 690.72 |
+---------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| Text 7Ç8      |       | Tahoma    | 12        |         | 12      |               | 115.95 | 85.1   | 704.52 |
+---------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| Text • 2 Wñ79 |       | Tahoma    | 8         |         | 8.04    |               | 398.16 | 121.1  | 68.06  |
+---------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| Text          |       | Tahoma    | 12        |         | 12      |               | 101.5  | 85.1   | 83.42  |
|   » 1 A\\\\CÓ |       |           |           |         |         |               |        |        |        |
+---------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| Text 12       |       | Tahoma    | 12        |         | 12      |               | 312.26 | 189.83 | 83.42  |
+---------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| Text 82       |       | Tahoma    | 12        |         | 12      |               | 44.99  | 85.1   | 97.22  |
+---------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| Text 31       |       | Tahoma    | 8         |         | 8.04    |               | 381.83 | 133.1  | 95.66  |
+---------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+

If possible, how to add the follwing 3 columns (counter, page and row) to know the corresponding page and row for each line?

The expected output would be like this:

+-------+------+-----+-------------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| counter | page | row | #text             | @data | @fontName | @fontSize | @format | @height | @type         | @width | @x     | @y     |
+-------+------+-----+-------------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| 1     | 0    |     |                   | ABC   |           |           | png     | 620     | base64encoded | 450    | 85     | 85     |
+-------+------+-----+-------------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| 2     | 1    | 0   | Text ä 1          |       | Tahoma    | 12        |         | 12      |               | 427.79 | 85.1   | 69.62  |
+-------+------+-----+-------------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| 3     | 1    | 1   | Text ¢76          |       | Tahoma    | 12        |         | 12      |               | 270.5  | 85.1   | 690.72 |
+-------+------+-----+-------------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| 4     | 1    | 1   | Text % 5          |       | Tahoma    | 12        |         | 12      |               | 130.84 | 358.86 | 690.72 |
+-------+------+-----+-------------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| 5     | 2    | 2   | Text 7Ç8          |       | Tahoma    | 12        |         | 12      |               | 115.95 | 85.1   | 704.52 |
+-------+------+-----+-------------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| 6     | 2    | 0   | Text • 2 Wñ79     |       | Tahoma    | 8         |         | 8.04    |               | 398.16 | 121.1  | 68.06  |
+-------+------+-----+-------------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| 7     | 2    | 1   | Text  » 1 A\\\\CÓ |       | Tahoma    | 12        |         | 12      |               | 101.5  | 85.1   | 83.42  |
+-------+------+-----+-------------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| 8     | 2    | 1   | Text 12           |       | Tahoma    | 12        |         | 12      |               | 312.26 | 189.83 | 83.42  |
+-------+------+-----+-------------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| 9     | 2    | 2   | Text 82           |       | Tahoma    | 12        |         | 12      |               | 44.99  | 85.1   | 97.22  |
+-------+------+-----+-------------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+
| 10    | 2    | 2   | Text 31           |       | Tahoma    | 8         |         | 8.04    |               | 381.83 | 133.1  | 95.66  |
+-------+------+-----+-------------------+-------+-----------+-----------+---------+---------+---------------+--------+--------+--------+

This is new more representative input file input2.json.

And seeing the Json structure in image below gives and idea about page number and row number present in json file and values within them.

enter image description here

Ger Cas
  • 2,188
  • 2
  • 18
  • 45
  • The sample output clearly isn't CSV and yet your program uses `@csv`. It might be helpful if you resolved the discrepancy. Thanks. – peak May 30 '19 at 06:40

5 Answers5

2

Here's a simple (perhaps too simple?) approach that focuses on the embedded JSON objects that have a "#text" attribute:

["#text", "@x", "@y"],       # the header
( ..
  | objects
  | select(has("#text"))  
  | [.["#text", "@x", "@y"]] # a row
) 
| @csv

When given this program and the sample input, an invocation of jq using the -r option would produce:

"#text","@x","@y"
"Text1","121.10","83.42"
"Text2","121.10","124.82"
"Text3","85.10","69.62"
"Text4","85.10","83.42"
"Text5","85.10","97.22"

If you don't want the quotation marks and are willing to risk that the output is not strictly CSV, then one option would be to use join(",") instead of @csv at the end of the pipeline.

Variants

You might want to use @tsv instead of @csv.

If a more restrictive approach to selecting the relevant embedded objects is needed, then perhaps replacing .. with .. | .text? will suffice.

If not, additional filters can be added depending on the detailed requirements.

peak
  • 105,803
  • 17
  • 152
  • 177
  • Thanks so much for your help. I see I was going completely for another road. It works juste perfect. I was wondering if you can help me with one more thing. Adding a counter, page and row related with each line for the new input that is more representative to a real file, please see below my UPDATE in original post. Thanks again – Ger Cas May 30 '19 at 19:18
1

Here's a solution that uses a "drill-down" and therefore rather tedious approach:

["#text", "@x", "@y"],
( .document.page[]
  | .row[]?
  | .column
  | (if type == "array" then .[] else . end)
  | .text
  | objects
  | [.["#text", "@x", "@y"]]
)
| @tsv

This would be used in conjunction with the -r command-line option.

I've used @tsv as this produces output that resembles the given expected output. As mentioned elsewhere on this page, there are other alternatives, e.g. using join/1.

peak
  • 105,803
  • 17
  • 152
  • 177
1

for those who are interested in the alternative solutions, here's how to achieve the same ask using a walk-path unix tool for JSON: jtc.

bash $ jtc -qq -w'<>a' -T'"#text\t@x\t@y"' -w'<@x>l:<x>v[-1][@y]<y>v[-1][#text]' -T'"{}\t{x}\t{y}"' file.json
#text   @x      @y
Text1   121.10  83.42
Text2   121.10  124.82
Text3   85.10   69.62
Text4   85.10   83.42
Text5   85.10   97.22
bash $ 

walk path (-w) breakdown:

  • <@x>l: <x>v find each label @x and memorize found JSON value in the namespace x
  • [-1][@y]<y>v address a parent (from last found value), then address JSON by label @y and memorize its value in the namespace y
  • [-1][#text]do the same for #text label (note: not memorizing the last value)

- -T'"{}\t{x}\t{y}"': apply template with interpolation ({} will interpolate the last found value, hence there was no need memorizing it in the namespace)

- -qq will unquote the resulting JSON string (dropping quotation marks and translating \t into tabs)

- first walk (-w'<>a') is just a dummy one to trigger template interpolation for the header line.

PS> Disclosure: I'm the creator of the jtc - shell cli tool for JSON operations

Dmitry
  • 1,275
  • 1
  • 5
  • 14
  • Thanks for your help. I'll try to test your solution since currently I only have installed jq. But nice to know there are more tools for this task that I didn't know. – Ger Cas May 30 '19 at 19:24
  • @Ger Cas: sure, `jtc` was developed and published relatively recently (about half a year) and still under active development – Dmitry May 30 '19 at 23:01
1

Handling input2.json

Since some context-dependent information is required for the second set of requirements corresponding to input2.json, the context cannot be ignored, and so the following solution uses a "drill-down" approach. The following will be a bit difficult to understand unless you understand foreach, so let me just mention that the approach essentially uses a state variable {counter, page, row} to keep track of the three counters.

["counter", "page", "row", "#text", "@data", "@fontName", "@fontSize", "@format", "@height", "@type", "@width", "@x", "@y"], 
(foreach (.document.page[] | objects) as $page ({page: -1, counter: 0};
  .page += 1
  | foreach ($page | .row[]?) as $row (.row=-1;
    .row += 1
    | foreach ($row | (.column | (if type == "array" then .[] else . end )) | .text | objects) as $x (.;
      .counter += 1
      | .out = [.counter, .page, .row, $x["#text", "@data", "@fontName", "@fontSize", "@format", "@height", "@type", "@width", "@x", "@y"]]
      ; . )
      ; . )
      ; .out )
)
| @tsv

This produces the desired TSV except for the very first line of data as that has no row. One way to include the first line is shown in my answer at Relate elements in table form from Json file with jq

peak
  • 105,803
  • 17
  • 152
  • 177
-1

In this command:

$ cat file.json | jq .document.page[].row | ["#text", "@x", "@y"] | @csv

Everything after the jq is supposed to be the first argument to jq, which means you need to enclose it in quotation marks. Moreover, cat file.json | is here a Useless Use of Cat; just pass the filename as an argument to jq. Hence, the correct command is:

$ jq '.document.page[].row | ["#text", "@x", "@y"] | @csv' file.json
jwodder
  • 54,758
  • 12
  • 108
  • 124
  • 1
    Hello. Thanks for answer. I tested as you said and I'm getting as output 4 lines containing this ` "\"#text\",\"@x\",\"@y\""` . What could be missing? – Ger Cas May 30 '19 at 03:21