how to extract values from a JSON file? Is regex a solution?

Question

I have a file that's quite large with entries that look like this:

{
  "_id": {
    "$oid": "572a5b93ae5174d3c4177da3"
  },
  "email": "removed@gmail.com",
  "gender": "F",
  "zip": "32934",
  "state": "FL",
  "city": "EAU GALLIE",
  "address1": "removed",
  "last_name": "removed",
  "first_name": "removed",
  "updatedAt": {
    "$date": "2016-05-04T20:29:02.061Z"
  },
  "__v": 0,
  "createdAt": {
    "$date": "2016-05-04T20:28:54.948Z"
  }
}
{
  "_id": {
    "$oid": "57a49bed913aebc7257145b9"
  },
  "email": "removed@gmail.com",
  "dob": "11/06/1996",
  "gender": "F",
  "zip": "SN14 8BZ",
  "address1": "removed",
  "last_name": "removed",
  "first_name": "removed",
  "updatedAt": {
    "$date": "2016-08-16T23:53:30.161Z"
  },
  "__v": 0,
  "createdAt": {
    "$date": "2016-08-05T14:00:13.130Z"
  }
}
{
  "_id": {
    "$oid": "57a49bed913aebc7257145d3"
  },
  "email": "removed@netzero.net",
  "zip": "NULL",
  "state": "NULL",
  "city": "NULL",
  "address1": "NULL",
  "last_name": "removed",
  "first_name": "removed",
  "updatedAt": {
    "$date": "2016-08-05T14:00:13.467Z"
  },
  "__v": 0,
  "createdAt": {
    "$date": "2016-08-05T14:00:13.467Z"
  }
}
{
  "_id": {
    "$oid": "57ab71379f7474b50eef976d"
  },
  "updatedAt": {
    "$date": "2016-08-16T23:40:55.851Z"
  },
  "createdAt": {
    "$date": "2016-08-10T18:23:51.177Z"
  },
  "email": "removed@hotmail.co.uk",
  "ip": "0.0.0.0",
  "first_name": "removed",
  "last_name": "removed",
  "address1": "removed",
  "city": "",
  "state": "",
  "zip": "removed",
  "gender": "F",
  "__v": 0,
  "dob": "03/01/1973"
}
{
  "_id": {
    "$oid": "57ab7137913aebc725194a20"
  },
  "email": "removed@gmail.com",
  "job": "DeliveryDriver",
  "zip": "24401",
  "state": "VA",
  "city": "FISHERSVILLE",
  "updatedAt": {
    "$date": "2016-09-16T12:45:50.984Z"
  },
  "__v": 0,
  "createdAt": {
    "$date": "2016-08-10T18:23:50.813Z"
  },
  "gender": "M",
  "last_name": "removed",
  "first_name": "removed"
}

and it's not in a particular order, i obviously removed names, address,ip's, and emails for privacy concerns. But the lines are all over, more than 20M of them.

How can i parse this properly? I'm looking to only extract Email, IP, Phone number, Name (First and Last) and Address (Zip, Address1,Addres2, City)

Some of these lines only have email & IP, and some have Email, IP, name, and some have Email, Name, Address and so on, including some with all the lines (they all have some junk data like OID, created and updated date, gender, ect)

What would the best way parsing this? I've been trying for a while now and i know it's been done, Thank you!

I agree with Randall's comment, you do not parse JSON with regex! — Nir Alfasi, Feb 15 '18 at 22:09
https://stackoverflow.com/questions/48817992/jq-c-compact-output-not-working-properly-json-parsing — RayCrush, Feb 16 '18 at 02:32

Gilles Quénot · Accepted Answer · 2018-02-15T23:10:43.993

0

Don't try to parse json with regex, instead, try jq.

It's cross-platforms.

Example, adapt the command to your needs :

$ jq '(.email, .first_name, .last_name)' file.json

Output:

"removed@gmail.com"
"removed"
"removed"
"removed@gmail.com"
"removed"
"removed"
"removed@netzero.net"
"removed"
"removed"
"removed@hotmail.co.uk"
"removed"
"removed"
"removed@gmail.com"
"removed"
"removed"

Check https://stedolan.github.io/jq/

Or you can use nodejs and js code

edited Feb 15 '18 at 23:10

answered Feb 15 '18 at 22:09

Gilles Quénot

173,512
41
224
223

Thank you, it seems to be working, this is what i came up with: jq '(.email, .first_name, .last_name, .ip, .address, .address1, .address2, .city, .zip, .state, .phone)' file.json > file2.json but it's exporting each column as a new line, how can i get it to make it one line? so instead of looking like: >"removed@gmail.com" "removed" "removed" i want it to look like >"removed@gmail.com" "removed" "removed" i know the command --compact-output but this is my first time using it and i dont know how to make it work, no matter where i put it it makes it nl – RayCrush Feb 15 '18 at 23:05
I have done the main job, I keep you with `jq` documentation. – Gilles Quénot Feb 15 '18 at 23:09
i know, i checked documentation; i'm just asking where i'd put the command --compact-output, i've tried everything and unsure if i'm understanding this correctly. – RayCrush Feb 15 '18 at 23:10
That's how i'm doing it, but it's still saving each column as a new line rather than compact :/ is this a bug ? Here's full command: "jq --compact-output '(.email, .first_name, .last_name, .ip, .address, .address1, .address2, .city, .zip, .state, .phone)' file.json > file2.json" – RayCrush Feb 15 '18 at 23:13
ahh, okay, seems to be a bug – RayCrush Feb 15 '18 at 23:14
No, it's logic, what you have in output is no more JSON but just strings. If you need to replace new lines by space, then use : `jq ... | tr $'\n' ' '` (if you are using bash) – Gilles Quénot Feb 15 '18 at 23:15
still not understanding, just wanted it implemented in my command, but it's fine. I'll just google, i wont ever use json, so there's no purpose for me to spend hours learning how to use a program just for 1 time use. Thanks for your time :) – RayCrush Feb 15 '18 at 23:18

how to extract values from a JSON file? Is regex a solution?

1 Answers1