0

i have JSON data on user profiles that i want to eventually analyze with SPSS. Currently i imported the data in Google Refine, to run some data cleansing. My problem is however that the original JSON consists of nested objects, namely e.g. the "professional_experience" section with "companies", that includes several sub objects/arrays (see example). Google refine handles this by creating additional rows with that information. This is however in no way consistent with a "relational" (in terms of SQL) view/table structure that i'd need to analyze the data with SPSS or Excel or whatever, as there are other sub-objects (schools, awards, etc.) which are also "stupidly" filled in the rows below the high level "main" record, but do not have a direct (row/column wise) relationship to one another (considering analysis).

As i see it i would need to extract those (sub object) columns and rows to an own table and create some n:m relationship, or at least normalize it into ONE table (then of course with accepting the redundancies of the other unnested attributes of course).

What I want to end up with is one consistent table to run statistical analysis/clustering on certain attributes. I assume map reduce is not really an option here.

Does anyone of you have an idea on how to handle this issue or is there maybe an easier way directly to work on the JSON data?

{ "users": [
{
  "id": "123456_abcdef",
  "first_name": "Max",
  "last_name": "Mustermann",
  "display_name": "Max Mustermann",
  "page_name": "Max_Mustermann",
  "permalink": "https://www.xing.com/profile/Max_Mustermann",
  "employment_status": "EMPLOYEE",
  "gender": "m",
  "birth_date": {
    "day": 12,
    "month": 8,
    "year": 1963
  },
  "active_email": "max.mustermann@xing.com",
  "time_zone": {
    "name": "Europe/Copenhagen",
    "utc_offset": 2.0
  },
  "premium_services": [
    "SEARCH",
    "PRIVATEMESSAGES"
  ],
  "badges": [
    "PREMIUM",
    "MODERATOR"
  ],
  "wants": "einen neuen Job",
  "haves": "viele tolle Skills",
  "interests": "Flitzebogen schießen and so on",
  "organisation_member": "ACM, GI",
  "languages": {
    "de": "NATIVE",
    "en": "FLUENT",
    "fr": null,
    "zh": "BASIC"
  },
  "private_address": {
    "city": "Hamburg",
    "country": "DE",
    "zip_code": "20357",
    "street": "Privatstraße 1",
    "phone": "49|40|1234560",
    "fax": "||",
    "province": "Hamburg",
    "email": "max@mustermann.de",
    "mobile_phone": "49|0155|1234567"
  },
  "business_address": {
    "city": "Hamburg",
    "country": "DE",
    "zip_code": "20357",
    "street": "Geschäftsstraße 1a",
    "phone": "49|40|1234569",
    "fax": "49|40|1234561",
    "province": "Hamburg",
    "email": "max.mustermann@xing.com",
    "mobile_phone": "49|160|66666661"
  },
  "web_profiles": {
    "qype": [
      "http://qype.de/users/foo"
    ],
    "google+": [
      "http://plus.google.com/foo"
    ],
    "other": [
      "http://blog.example.org"
    ],
    "homepage": [
      "http://example.org",
      "http://other-example.org"
    ]
  },
  "instant_messaging_accounts": {
    "skype": "1122334455",
    "googletalk": "max.mustermann"
  },
  "professional_experience": {
    "primary_company": {
      "id": "1_abcdef",
      "name": "XING AG",
      "title": "Softwareentwickler",
      "company_size": "201-500",
      "tag": null,
      "url": "http://www.xing.com",
      "career_level": "PROFESSIONAL_EXPERIENCED",
      "begin_date": "2010-01",
      "description": null,
      "end_date": null,
      "industry": "AEROSPACE",
      "form_of_employment": "FULL_TIME_EMPLOYEE",
      "until_now": true
    },
    "companies": [
      {
        "id": "1_abcdef",
        "name": "XING AG",
        "title": "Softwareentwickler",
        "company_size": "201-500",
        "tag": null,
        "url": "http://www.xing.com",
        "career_level": "PROFESSIONAL_EXPERIENCED",
        "begin_date": "2010-01",
        "description": null,
        "end_date": null,
        "industry": "AEROSPACE",
        "form_of_employment": "FULL_TIME_EMPLOYEE",
        "until_now": true
      },
      {
        "id": "24_abcdef",
        "name": "Ninja Ltd.",
        "title": "DevOps",
        "company_size": null,
        "tag": "NINJA",
        "url": "http://www.ninja-ltd.co.uk",
        "career_level": null,
        "begin_date": "2009-04",
        "description": null,
        "end_date": "2010-07",
        "industry": "ALTERNATIVE_MEDICINE",
        "form_of_employment": "OWNER",
        "until_now": false
      },
      {
        "id": "45_abcdef",
        "name": null,
        "title": "Wiss. Mitarbeiter",
        "company_size": null,
        "tag": "OFFIS",
        "url": "http://www.uni.de",
        "career_level": null,
        "begin_date": "2007",
        "description": null,
        "end_date": "2008",
        "industry": "APPAREL_AND_FASHION",
        "form_of_employment": "PART_TIME_EMPLOYEE",
        "until_now": false
      },
      {
        "id": "176_abcdef",
        "name": null,
        "title": "TEST NINJA",
        "company_size": "201-500",
        "tag": "TESTCOMPANY",
        "url": null,
        "career_level": "ENTRY_LEVEL",
        "begin_date": "1998-12",
        "description": null,
        "end_date": "1999-05",
        "industry": "ARTS_AND_CRAFTS",
        "form_of_employment": "INTERN",
        "until_now": false
      }
    ],
    "awards": [
      {
        "name": "Awesome Dude Of The Year",
        "date_awarded": 2007,
        "url": null
      }
    ]
  },
  "educational_background": {
    "degree": "MSc CE/CS",
    "primary_school": {
      "id": "42_abcdef",
      "name": "Carl-von-Ossietzky Universtät Schellenburg",
      "degree": "MSc CE/CS",
      "notes": null,
      "subject": null,
      "begin_date": "1998-08",
      "end_date": "2005-02"
    },
    "schools": [
      {
        "id": "42_abcdef",
        "name": "Carl-von-Ossietzky Universtät Schellenburg",
        "degree": "MSc CE/CS",
        "notes": null,
        "subject": null,
        "begin_date": "1998-08",
        "end_date": "2005-02"
      }
    ],
    "qualifications": [
      "TOEFLS",
      "PADI AOWD"
    ]
  }
}

] }

pnuts
  • 58,317
  • 11
  • 87
  • 139
kreck
  • 68
  • 4

2 Answers2

0

I am not sure to correctly understand the format you want to create but I can provide two leads you can start exploring on:

Using fill down

OpenRefine let you easily fill down within records, so your information stored in a top master rows is carried over multiple rows withing each record. Once you fill down field you need, the user (id) for example, go back in row mode and by using the facet, export only a subset of your row you need in each table of your database.

Template export

The template option let you export your data in a XML / JSON format so you can nest element. However this feature doesn't support one to many relationship.

Let us know if this help.

magdmartin
  • 1,712
  • 3
  • 20
  • 43
  • Thank you very much for your reply! I commented under Thads post. My problem really goes down to the fact that array data is pushed into new rows (in the recordset) which then "mimics" a relation (e.g. between schools and employers) that is semantically not there. What i'd need is a transformation to a relation-like model that connects those array items to the user (id) but not to one another by putting them in rows. If any of you has any suggestions, that would be great. In the meantime i'll try to handle it with VBA/Excel. But as this won't be the last time i need to do this ideas are apprec – kreck Nov 08 '14 at 15:23
  • ok I edited my answer with more details. let me know if this help – magdmartin Nov 09 '14 at 14:45
0

You should be able to import the JSON, using the gui, select just under the user{} object (not on the user object itself), so that you have some fairly clean "record" rows that also hold the header information for each record row, such as the "__professional experience", etc.

Download and import this project with your example data...make sure to change the grid view to RECORD mode. (top left corner of grid)

Example OpenRefine Project showing JSON imported as RECORD rows: https://drive.google.com/file/d/0B533WzlrxWraQnF0NHN4anpFNHM/view?usp=sharing

From there you can export and change the template how you wish, even selecting only the Professional Experience rows, if you want, along with the users id.

If you need to deal with importing JSON record rows in bulk or from a data endpoint or web service, then I would suggest downloading and using the community versions of Pentaho, or Talend ETL to handle this for you....but you can certainly use OpenRefine for the initial inspection and alignment. (My preference is Pentaho, which has excellent live right click previewing of record rows within a transformation and Martin prefers Talend).

Thad Guidry
  • 579
  • 4
  • 8
  • Thanks a lot for your reply! Pentaho and Talend are really useful, however i can't quite figure out how to use them with my problem. What i'm doing now is writing a VBA script that "normalizes" the data! However i'm really interested if there is a tool that offers out-of-the box transformation from JSON to a relational model. Precisely: i have those array fields, which are added as new rows in refine, however from a semantic standpoint this makes no sense, as e.g. Schools and Employers are put in one "extra row" despite the fact that they are only related to the user but not to one another.... – kreck Nov 08 '14 at 15:17
  • Yes, that's what you can use Pentaho and Talend for. Get some training or ask on their forums. No free Pizza. Best of Luck. – Thad Guidry Nov 09 '14 at 16:31