0

I want to extract contents from large JSON files that appear to editors as one line (so I can't operate on a line basis), e.g.

{"license": 2, "file_name": "COCO_test2014_000000523573.jpg", "coco_url": "http://mscoco.org/images/523573", "height": 500, "width": 423, "date_captured": "2013-11-14 12:21:59", "id": 523573}, {"license . . .

For example, is there a way (sed, grep, ...?) I can search for the word 000000523573 and print the 100 characters preceding and 200 characters succeeding occurrences of the word?

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
Ben
  • 79
  • 1
  • 1
  • 4
  • 1
    Could you please include the code you've produced so far, the results you're getting, and an example of the results you're after? Check out the [MCVE](http://stackoverflow.com/help/mcve) description and SO's "[How To Ask](http://stackoverflow.com/help/how-to-ask) for guides on how to make this a great question. – ghoti Dec 31 '16 at 01:31

3 Answers3

2

jq is the tool you want to use to parse JSON natively. If it's a structured format, don't treat it like random text.

$ jq . < input.json
{
  "license": 2,
  "file_name": "COCO_test2014_000000523573.jpg",
  "coco_url": "http://mscoco.org/images/523573",
  "height": 500,
  "width": 423,
  "date_captured": "2013-11-14 12:21:59",
  "id": 523573
}
$ jq .height < input.json
500

To search for a particular JSON record that contains a particular string in the file_name record, you might do something like this:

jq 'select(.file_name|contains("000000523573"))' < input.json

The notation here is ... longer to explain than makes sense for a single SO answer. Do have a look at the JQ query structure if you're interested in using this tool.

ghoti
  • 45,319
  • 8
  • 65
  • 104
  • `+1` nice solution... please notice that jq has to be installed, as it's not there by default in any system – Flash Thunder Dec 30 '16 at 23:51
  • @FlashThunder - yes, absolutely. One of the reasons I provided the link. :) (I don't know what platform you're on, but I expect you should be able to find jq in your friendly neighbourhood package repository.) – ghoti Dec 30 '16 at 23:52
0

data.txt:

{"license": 2, "file_name": "COCO_test2014_000000523573.jpg", "coco_url": "http://mscoco.org/images/523573", "height": 500, "width": 423, "date_captured": "2013-11-14 12:21:59", "id": 523573}, {"license": 2, "file_name": "COCO_test2014_000000523574.jpg", "coco_url": "http://mscoco.org/images/523574", "height": 500, "width": 423, "date_captured": "2013-11-14 12:21:59", "id": 523574}

command:

cat data.txt | sed 's/\},\s{/}\n{/g' | grep "000000523573"

output:

{"license": 2, "file_name": "COCO_test2014_000000523573.jpg", "coco_url": "http://mscoco.org/images/523573", "height": 500, "width": 423, "date_captured": "2013-11-14 12:21:59", "id": 523573}
Flash Thunder
  • 11,672
  • 8
  • 47
  • 91
0

As demonstrated in ghoti's answer, jq is definitely your best bet.

As for your exact question ("search for the word 000000523573 and print the 100 characters preceding and 200 characters succeeding"): you could use grep -o as follows:

grep -Eo '.{100}000000523573.{200}' infile

This has a few drawbacks:

  • If 000000523573 occurs earlier than 100 characters from the beginning of the file or later than 200 characters from its end, it will be ignored.
  • If the distance between two occurrences is less than 300 characters, the later occurrence will be ignored (overlapping occurrences are not accounted for by grep -o).

These can be alleviated somewhat by loosening the requirements to "print up to 100/200 characters before/after occurrences":

grep -Eo '.{,100}000000523573.{,200}' infile

But, again, the proper approach is to use jq. See also this question about command line JSON parsing.

Community
  • 1
  • 1
Benjamin W.
  • 46,058
  • 19
  • 106
  • 116