How to extract contents of a large text file that appears to editors as only one line

Question

I want to extract contents from large JSON files that appear to editors as one line (so I can't operate on a line basis), e.g.

{"license": 2, "file_name": "COCO_test2014_000000523573.jpg", "coco_url": "http://mscoco.org/images/523573", "height": 500, "width": 423, "date_captured": "2013-11-14 12:21:59", "id": 523573}, {"license . . .

For example, is there a way (sed, grep, ...?) I can search for the word 000000523573 and print the 100 characters preceding and 200 characters succeeding occurrences of the word?

Could you please include the code you've produced so far, the results you're getting, and an example of the results you're after? Check out the [MCVE](http://stackoverflow.com/help/mcve) description and SO's "[How To Ask](http://stackoverflow.com/help/how-to-ask) for guides on how to make this a great question. — ghoti, Dec 31 '16 at 01:31

ghoti · Answer 1 · 2016-12-31T01:29:23.557

2

jq is the tool you want to use to parse JSON natively. If it's a structured format, don't treat it like random text.

$ jq . < input.json
{
  "license": 2,
  "file_name": "COCO_test2014_000000523573.jpg",
  "coco_url": "http://mscoco.org/images/523573",
  "height": 500,
  "width": 423,
  "date_captured": "2013-11-14 12:21:59",
  "id": 523573
}
$ jq .height < input.json
500

To search for a particular JSON record that contains a particular string in the file_name record, you might do something like this:

jq 'select(.file_name|contains("000000523573"))' < input.json

The notation here is ... longer to explain than makes sense for a single SO answer. Do have a look at the JQ query structure if you're interested in using this tool.

edited Dec 31 '16 at 01:29

answered Dec 30 '16 at 23:50

ghoti

45,319
8
65
104

`+1` nice solution... please notice that jq has to be installed, as it's not there by default in any system – Flash Thunder Dec 30 '16 at 23:51
@FlashThunder - yes, absolutely. One of the reasons I provided the link. :) (I don't know what platform you're on, but I expect you should be able to find jq in your friendly neighbourhood package repository.) – ghoti Dec 30 '16 at 23:52

score 0 · Answer 2 · answered Dec 30 '16 at 23:44

data.txt:

{"license": 2, "file_name": "COCO_test2014_000000523573.jpg", "coco_url": "http://mscoco.org/images/523573", "height": 500, "width": 423, "date_captured": "2013-11-14 12:21:59", "id": 523573}, {"license": 2, "file_name": "COCO_test2014_000000523574.jpg", "coco_url": "http://mscoco.org/images/523574", "height": 500, "width": 423, "date_captured": "2013-11-14 12:21:59", "id": 523574}

command:

cat data.txt | sed 's/\},\s{/}\n{/g' | grep "000000523573"

output:

{"license": 2, "file_name": "COCO_test2014_000000523573.jpg", "coco_url": "http://mscoco.org/images/523573", "height": 500, "width": 423, "date_captured": "2013-11-14 12:21:59", "id": 523573}

score 0 · Answer 3 · edited May 23 '17 at 12:01

As demonstrated in ghoti's answer, jq is definitely your best bet.

As for your exact question ("search for the word 000000523573 and print the 100 characters preceding and 200 characters succeeding"): you could use grep -o as follows:

grep -Eo '.{100}000000523573.{200}' infile

This has a few drawbacks:

If 000000523573 occurs earlier than 100 characters from the beginning of the file or later than 200 characters from its end, it will be ignored.
If the distance between two occurrences is less than 300 characters, the later occurrence will be ignored (overlapping occurrences are not accounted for by grep -o).

These can be alleviated somewhat by loosening the requirements to "print up to 100/200 characters before/after occurrences":

grep -Eo '.{,100}000000523573.{,200}' infile

But, again, the proper approach is to use jq. See also this question about command line JSON parsing.

How to extract contents of a large text file that appears to editors as only one line

3 Answers3