Script for extracting information of specific pattern from a text file

Question

Hi I am working on a project which deals with large amount of data. I have a text file of around 2 GB with key value pairs and each key has multiple values. I need to extract all the keys in a different file, as I need the keys for testing a particular feature.

The format of the file is:

:k: k1 :v: {XYZ:{id:"k1",score:0e0,tags:null},ABC:[{XYZ:{id:"k1",score:0e0,tags:null},PQR:[{id:"ID1",score:71.85e0,tags:[{color:"DARK"},{Type:"S1"},{color:"BLACK"}]},MetaData:{RuleId:"R3",Score:66.26327129015809e0,Quality:"GOOD"}},{XYZ:{id:"k1",score:0e0,tags:null},PQR:[..(same as above format)..],MetaData:{RuleId:"R3",Score:65.8234565409752e0,Quality:"GOOD"}} ::

//same pattern repeats with different keys, and a new line

When I search ":k: " in the file using CTRL+F, these keys only get highlighted. SO I think this kind of pattern is no where in the file except the start of the line

Like these there are thousands of keys.

And I want all these keys (k1, k2) extracted to a separate file for testing.

There are multiple lines for :k: and want to separate (k1, k2, ..) in a separate file. How can I do this?

Python is also fine for me. I can use regular expressions in python or maybe use "sed" shell command. Please help me out here how I can use these to extract the keys.

Can someone help me in writing a shell/python script for same. I know its very trivial but I'm novice to all this kind of data processing.

Also focusing on optimizing the run time, as the data is very large.

I wouldn't call that _very trivial_. Can you provide a real example of a file (without the `...`) — Jean-François Fabre, Oct 03 '16 at 19:34
I have updated the post! Let me know if anything else you want to know! — user2621826, Oct 03 '16 at 19:59
so you want to create a separate file for each "first word" (:k:) in the file? And there are multiple lines for `:k:` ? If so do you want the separate file to contain all `:k:` records, the first, the last or ??? . (your requirement is unclear). Good luck. — shellter, Oct 04 '16 at 03:32
Yes correct. There are multiple lines for :k: and want to separate (k1, k2, ..) in a separate file — user2621826, Oct 04 '16 at 05:39

shellter · Answer 1 · 2016-10-04T21:21:40.160

Assuming a file like

:k: k1 :v: {XYZ:{id:
:k2: k1 :v: {XYZ:{id:
:k: k1 :v: {XYZ:{id:
:k3: k1 :v: {XYZ:{id:
:k: k1 :v: {XYZ:{id:

You can easily do (in 1 pass), and with no memory restrictions

awk '{fName=$1; gsub(/:/,"",fName); print >> fName ; close(fName)}' inFile

which gives the following output

$ cat k
:k: k1 :v: {XYZ:{id:
:k: k1 :v: {XYZ:{id:
:k: k1 :v: {XYZ:{id:
$ cat k2
:k2: k1 :v: {XYZ:{id:
$ cat k3
:k3: k1 :v: {XYZ:{id:

Depending on how may keys you have, you may not need the close(fName), but if you don't want to spend time testing what your limit of open files are, then this is the safe way to do the process.

IHTH

Script for extracting information of specific pattern from a text file

1 Answers1