0

I have a file that's quite large with entries that look like this:

{
  "_id": {
    "$oid": "572a5b93ae5174d3c4177da3"
  },
  "email": "removed@gmail.com",
  "gender": "F",
  "zip": "32934",
  "state": "FL",
  "city": "EAU GALLIE",
  "address1": "removed",
  "last_name": "removed",
  "first_name": "removed",
  "updatedAt": {
    "$date": "2016-05-04T20:29:02.061Z"
  },
  "__v": 0,
  "createdAt": {
    "$date": "2016-05-04T20:28:54.948Z"
  }
}
{
  "_id": {
    "$oid": "57a49bed913aebc7257145b9"
  },
  "email": "removed@gmail.com",
  "dob": "11/06/1996",
  "gender": "F",
  "zip": "SN14 8BZ",
  "address1": "removed",
  "last_name": "removed",
  "first_name": "removed",
  "updatedAt": {
    "$date": "2016-08-16T23:53:30.161Z"
  },
  "__v": 0,
  "createdAt": {
    "$date": "2016-08-05T14:00:13.130Z"
  }
}
{
  "_id": {
    "$oid": "57a49bed913aebc7257145d3"
  },
  "email": "removed@netzero.net",
  "zip": "NULL",
  "state": "NULL",
  "city": "NULL",
  "address1": "NULL",
  "last_name": "removed",
  "first_name": "removed",
  "updatedAt": {
    "$date": "2016-08-05T14:00:13.467Z"
  },
  "__v": 0,
  "createdAt": {
    "$date": "2016-08-05T14:00:13.467Z"
  }
}
{
  "_id": {
    "$oid": "57ab71379f7474b50eef976d"
  },
  "updatedAt": {
    "$date": "2016-08-16T23:40:55.851Z"
  },
  "createdAt": {
    "$date": "2016-08-10T18:23:51.177Z"
  },
  "email": "removed@hotmail.co.uk",
  "ip": "0.0.0.0",
  "first_name": "removed",
  "last_name": "removed",
  "address1": "removed",
  "city": "",
  "state": "",
  "zip": "removed",
  "gender": "F",
  "__v": 0,
  "dob": "03/01/1973"
}
{
  "_id": {
    "$oid": "57ab7137913aebc725194a20"
  },
  "email": "removed@gmail.com",
  "job": "DeliveryDriver",
  "zip": "24401",
  "state": "VA",
  "city": "FISHERSVILLE",
  "updatedAt": {
    "$date": "2016-09-16T12:45:50.984Z"
  },
  "__v": 0,
  "createdAt": {
    "$date": "2016-08-10T18:23:50.813Z"
  },
  "gender": "M",
  "last_name": "removed",
  "first_name": "removed"
}

and it's not in a particular order, i obviously removed names, address,ip's, and emails for privacy concerns. But the lines are all over, more than 20M of them.

How can i parse this properly? I'm looking to only extract Email, IP, Phone number, Name (First and Last) and Address (Zip, Address1,Addres2, City)

Some of these lines only have email & IP, and some have Email, IP, name, and some have Email, Name, Address and so on, including some with all the lines (they all have some junk data like OID, created and updated date, gender, ect)

What would the best way parsing this? I've been trying for a while now and i know it's been done, Thank you!

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
RayCrush
  • 29
  • 2
  • 12

1 Answers1

0

Don't try to parse with , instead, try .

It's cross-platforms.

Example, adapt the command to your needs :

$ jq '(.email, .first_name, .last_name)' file.json

Output:

"removed@gmail.com"
"removed"
"removed"
"removed@gmail.com"
"removed"
"removed"
"removed@netzero.net"
"removed"
"removed"
"removed@hotmail.co.uk"
"removed"
"removed"
"removed@gmail.com"
"removed"
"removed"

Check https://stedolan.github.io/jq/

Or you can use and code

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • Thank you, it seems to be working, this is what i came up with: jq '(.email, .first_name, .last_name, .ip, .address, .address1, .address2, .city, .zip, .state, .phone)' file.json > file2.json but it's exporting each column as a new line, how can i get it to make it one line? so instead of looking like: >"removed@gmail.com" "removed" "removed" i want it to look like >"removed@gmail.com" "removed" "removed" i know the command --compact-output but this is my first time using it and i dont know how to make it work, no matter where i put it it makes it nl – RayCrush Feb 15 '18 at 23:05
  • I have done the main job, I keep you with `jq` documentation. – Gilles Quénot Feb 15 '18 at 23:09
  • i know, i checked documentation; i'm just asking where i'd put the command --compact-output, i've tried everything and unsure if i'm understanding this correctly. – RayCrush Feb 15 '18 at 23:10
  • That's how i'm doing it, but it's still saving each column as a new line rather than compact :/ is this a bug ? Here's full command: "jq --compact-output '(.email, .first_name, .last_name, .ip, .address, .address1, .address2, .city, .zip, .state, .phone)' file.json > file2.json" – RayCrush Feb 15 '18 at 23:13
  • ahh, okay, seems to be a bug – RayCrush Feb 15 '18 at 23:14
  • No, it's logic, what you have in output is no more JSON but just strings. If you need to replace new lines by space, then use : `jq ... | tr $'\n' ' '` (if you are using bash) – Gilles Quénot Feb 15 '18 at 23:15
  • still not understanding, just wanted it implemented in my command, but it's fine. I'll just google, i wont ever use json, so there's no purpose for me to spend hours learning how to use a program just for 1 time use. Thanks for your time :) – RayCrush Feb 15 '18 at 23:18