0

Trying to use Open Refine to analyze a data set of messy JSON strings (40k lines), however due to JSONs' nature of being unordered, some of the lines of JSON objects were mixed up when returned and recorded to a file.

Some objects are missing keys, some objects have incorrect order. Example:

1   {"about":"foo", "category":"bar", "id":"123", "cat_list": ["category1":"foo2"]}
2   {"id":"22","about":"barFoo", "category":"NotABar"}
3   {"about":"barbar", "category":"website", "id":"3333", "cat_list": ["category1":"foo22"]}
....
....
....
40,000 {"about":"bar123", "category":"publish", "id":"3323", "cat_list": ""}

ISSUE:

Importing the data into Open Refine, the program asks for a specific schema to compare to when it reads the file. It then reads the supplied file, comparing each JSON object on the line to the schema and imports or discards depending on how well it matches the schema! As a result many entries are left out!

IDEALLY:

Using Python, I would like to reorder the JSON objects to a specific schema which I specify.

Example:

Specified Schema

{"about":"", "category":"", "id":"", "cat_list": ""}

Which would then rearrange each line of JSON and its key-values to be in this specific format:

1   {"about": ....
2   {"about": ....
3   {"about": ....
....
....
....
40,000 {"about": ....

I am not entirely sure how I can do this efficiently?

EDIT:

I decided to just write a script to organize this. I removed some of the complex fields and have a full .JSON file:

{"name":"Carstar Bridgewater", 
"category":"Automotive", 
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.", 
"country":"Canada", 
"state":"NS", 
"city":"Bridgewater
"}, 
{"name":"Febreze", 
"category":"Product/Service
", 
"about":"Freshness that eliminates odorsso you can breathe happy.", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings", 
"category":"Professional Services", 
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings", 
"country":"Canada", 
"state":"NS
", 
"city":"Middle Sackville"},
{"name":"The Hunger Games", 
"category":"Movie
", 
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"},

Yet. Google-Refine still refuses to accept my file? What is it that I am doing incorrectly?

aquaflamingo
  • 792
  • 6
  • 17
  • Objects have no inherent order in JSON, only arrays do. – Barmar Jun 14 '16 at 19:04
  • Your `cat_list` values are not valid JSON. Arrays can't contain `key:value` pairs like that. And on line 40,000, the value is a string instead of an array, which probably violates the schema. I think the problems you're having are related to these problems, not the order of elements in the objects. – Barmar Jun 14 '16 at 19:10
  • As @Barmar says, your problem might not be ordering related. ... but if you're using the plain-old 'json' module, then it simply orders the keys in the order that dict.items() / dict.iteritems() provides unless you tell it to sort. You could use a collections.OrderedDict which 'remembers' the insertion order, or make a dict wrapper which returns the keys in the order you desire. – Wuggy Jun 14 '16 at 19:20
  • @bramar thanks for replying, the examples given were not my real data apologizes, was just being a bit lazy ! – aquaflamingo Jun 14 '16 at 19:27
  • Trying to recreate the issue you have importing this to OpenRefine and I can't. Imports OK if I use (based on a corrected version of your JSON): [ { "about": "foo", "category": "bar", "id": "123", "cat_list": [ "category1" ] }, { "id": "22", "about": "barFoo", "category": "NotABar" }, { "about": "barbar", "category": "website", "id": "3333", "cat_list": [ "category1" ] }, { "about": "bar123", "category": "publish", "id": "3323", "cat_list": "" } ] – Owen Stephens Jun 15 '16 at 08:55

2 Answers2

0

"Importing the data into Open Refine, the program asks for a specific schema to compare to when it reads the file."

This sounds like it accidentally detected it as XML rathar than as JSON or even Lines.

However, You can choose which importer you wish to use (like Line based or JSON), not just the auto-picked importer that OpenRefine tries to guess at and sometimes gets wrong.

To my eyes, it looks like you might be dealing with the new upcoming "JSON Lines" or "newline-delimited JSON" format such as documented here: http://jsonlines.org/

We have an issue open to add JSON Lines support to OpenRefine eventually: https://github.com/OpenRefine/OpenRefine/issues/1135

In the meantime, look at the section On the Web at the jsonlines.org site to get find tooling support to help you with your needs.

Thad Guidry
  • 579
  • 4
  • 8
0

Not sure if you resolved this.

The JSON needs to be valid before it can be successfully imported - at the moment the text you've posted in the Q above doesn't validate with a tool such as http://jsonlint.com.

The issue you have in terms of importing this to OpenRefine (aka Google Refine) is that the JSON objects have to be in an array:

[{"name":"Carstar Bridgewater", 
"category":"Automotive", 
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.", 
"country":"Canada", 
"state":"NS", 
"city":"Bridgewater"},
{"name":"Febreze", 
"category":"Product/Service", 
"about":"Freshness that eliminates odorsso you can breathe happy.", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"},
{"name":"Custom Wood & Acrylic Turnings", 
"category":"Professional Services", 
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings", 
"country":"Canada", 
"state":"NS", 
"city":"Middle Sackville"},
{"name":"The Hunger Games", 
"category":"Movie", 
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"}]

I can successfully import this JSON as posted here into OpenRefine it works fine - screenshots:

enter image description here enter image description here

Owen Stephens
  • 1,550
  • 1
  • 8
  • 10
  • That's really weird, but I gave you the answer even though I decided to write a tool to convert my big json file to csv :) – aquaflamingo Jul 08 '16 at 20:18