I have a file that's quite large with entries that look like this:
{
"_id": {
"$oid": "572a5b93ae5174d3c4177da3"
},
"email": "removed@gmail.com",
"gender": "F",
"zip": "32934",
"state": "FL",
"city": "EAU GALLIE",
"address1": "removed",
"last_name": "removed",
"first_name": "removed",
"updatedAt": {
"$date": "2016-05-04T20:29:02.061Z"
},
"__v": 0,
"createdAt": {
"$date": "2016-05-04T20:28:54.948Z"
}
}
{
"_id": {
"$oid": "57a49bed913aebc7257145b9"
},
"email": "removed@gmail.com",
"dob": "11/06/1996",
"gender": "F",
"zip": "SN14 8BZ",
"address1": "removed",
"last_name": "removed",
"first_name": "removed",
"updatedAt": {
"$date": "2016-08-16T23:53:30.161Z"
},
"__v": 0,
"createdAt": {
"$date": "2016-08-05T14:00:13.130Z"
}
}
{
"_id": {
"$oid": "57a49bed913aebc7257145d3"
},
"email": "removed@netzero.net",
"zip": "NULL",
"state": "NULL",
"city": "NULL",
"address1": "NULL",
"last_name": "removed",
"first_name": "removed",
"updatedAt": {
"$date": "2016-08-05T14:00:13.467Z"
},
"__v": 0,
"createdAt": {
"$date": "2016-08-05T14:00:13.467Z"
}
}
{
"_id": {
"$oid": "57ab71379f7474b50eef976d"
},
"updatedAt": {
"$date": "2016-08-16T23:40:55.851Z"
},
"createdAt": {
"$date": "2016-08-10T18:23:51.177Z"
},
"email": "removed@hotmail.co.uk",
"ip": "0.0.0.0",
"first_name": "removed",
"last_name": "removed",
"address1": "removed",
"city": "",
"state": "",
"zip": "removed",
"gender": "F",
"__v": 0,
"dob": "03/01/1973"
}
{
"_id": {
"$oid": "57ab7137913aebc725194a20"
},
"email": "removed@gmail.com",
"job": "DeliveryDriver",
"zip": "24401",
"state": "VA",
"city": "FISHERSVILLE",
"updatedAt": {
"$date": "2016-09-16T12:45:50.984Z"
},
"__v": 0,
"createdAt": {
"$date": "2016-08-10T18:23:50.813Z"
},
"gender": "M",
"last_name": "removed",
"first_name": "removed"
}
and it's not in a particular order, i obviously removed names, address,ip's, and emails for privacy concerns. But the lines are all over, more than 20M of them.
How can i parse this properly? I'm looking to only extract Email, IP, Phone number, Name (First and Last) and Address (Zip, Address1,Addres2, City)
Some of these lines only have email & IP, and some have Email, IP, name, and some have Email, Name, Address and so on, including some with all the lines (they all have some junk data like OID, created and updated date, gender, ect)
What would the best way parsing this? I've been trying for a while now and i know it's been done, Thank you!