Unable to parse JSON file, keep getting ValueError: Extra Data

Question

So, leading on from my prior issue [found here][1], I'm attempting to parse a JSON file that I've managed to download with @SiHa's help. The JSON is structured like so:

{"properties": [{"property": "name", "value": "A random company name"}, {"property": "companyId", "value": 123456789}]}{"properties": [{"property": "name", "value": "Another random company name"}, {"property": "companyId", "value": 31415999}]}{"properties": [{"property": "name", "value": "Yet another random company"}, {"property": "companyId", "value": 10101010}]}

I've been able to get this by slightly modifiying @SiHa's code:

def get_companies():
            create_get_recent_companies_call = "https://api.hubapi.com/companies/v2/companies/?hapikey={hapikey}".format(hapikey=wta_hubspot_api_key)
            headers = {'content-type': 'application/json'}
            create_get_recent_companies_response = requests.get(create_get_recent_companies_call, headers=headers)
            if create_get_recent_companies_response.status_code == 200:
                while True:
                    for i in create_get_recent_companies_response.json()[u'companies']:

                        all_the_companies = { "properties": [
                                                    { "property": "name", "value": i[u'properties'][u'name'][u'value'] },
                                                    { "property": "companyId", "value": i[u'companyId'] }
                                                ]
                                            }

                        with open("all_the_companies.json", "a") as myfile:
                            myfile.write(json.dumps(all_the_companies))
                        #print(companyProperties)
                    offset = create_get_recent_companies_response.json()[u'offset']
                    hasMore = create_get_recent_companies_response.json()[u'has-more']
                    if not hasMore:
                        break
                    else:
                        create_get_recent_companies_call = "https://api.hubapi.com/companies/v2/companies/?hapikey={hapikey}&offset={offset}".format(hapikey=wta_hubspot_api_key, offset=offset)
                        create_get_recent_companies_response = requests.get(create_get_recent_companies_call, headers=headers)


            else:
                print("Something went wrong, check the supplied field values.\n")
                print(json.dumps(create_get_recent_companies_response.json(), sort_keys=True, indent=4))

So that was part one. Now I'm trying to use the code below to extract two things: 1) the name and 2) the companyId.

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys
import os.path
import requests
import json
import csv
import glob2
import shutil
import time
import time as howLong
from time import sleep
from time import gmtime, strftime

# Local Testing Version
findCSV = glob2.glob('*contact*.csv')

theDate = time=strftime("%Y-%m-%d", gmtime())
theTime = time=strftime("%H:%M:%S", gmtime())

# Exception handling
try:
    testData = findCSV[0]
except IndexError:
    print ("\nSyncronisation attempted on {date} at {time}: There are no \"contact\" CSVs, please upload one and try again.\n").format(date=theDate, time=theTime)
    print("====================================================================================================================\n")
    sys.exit()

for theCSV in findCSV:

    def process_companies():
        with open('all_the_companies.json') as data_file:
            data = json.load(data_file)
            for i in data:
                company_name = data[i][u'name']
                #print(company_name)
                if row[0].lower() == company_name.lower():
                    contact_company_id = data[i][u'companyId']
                    #print(contact_company_id)
                    return contact_company_id

                else:
                    print("Something went wrong, check the \"get_companies()\" function.\n")
                    print(json.dumps(create_get_recent_companies_response.json(), sort_keys=True, indent=4))

    if __name__ == "__main__":
        start_time = howLong.time()
        process_companies()
        print("This operation took %s seconds.\n" % (howLong.time() - start_time))
        sys.exit()

Unfortunately, its not working - I'm getting the following traceback:

Traceback (most recent call last):
  File "wta_parse_json.py", line 62, in <module>
    process_companies()
  File "wta_parse_json.py", line 47, in process_companies
    data = json.load(data_file)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 290, in load
    **kw)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 369, in decode
    raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 130 - line 1 column 1455831 (char 129 - 1455830)

I've made sure that i'm using json.dumps not json.dump to open the file, but still its not working. :(

I've now given up on JSON, and am trying to export a simple CSV with the code below:

    def get_companies():
            create_get_recent_companies_call = "https://api.hubapi.com/companies/v2/companies/?hapikey={hapikey}".format(hapikey=wta_hubspot_api_key)
            headers = {'content-type': 'application/json'}
            create_get_recent_companies_response = requests.get(create_get_recent_companies_call, headers=headers)
            if create_get_recent_companies_response.status_code == 200:
                while True:
                    for i in create_get_recent_companies_response.json()[u'companies']:

                        all_the_companies = "{name},{id}\n".format(name=i[u'properties'][u'name'][u'value'], id=i[u'companyId'])
                        all_the_companies.encode('utf-8')

                        with open("all_the_companies.csv", "a") as myfile:
                            myfile.write(all_the_companies)
                        #print(companyProperties)
                    offset = create_get_recent_companies_response.json()[u'offset']
                    hasMore = create_get_recent_companies_response.json()[u'has-more']
                    if not hasMore:
                        break
                    else:
                        create_get_recent_companies_call = "https://api.hubapi.com/companies/v2/companies/?hapikey={hapikey}&offset={offset}".format(hapikey=wta_hubspot_api_key, offset=offset)
                        create_get_recent_companies_response = requests.get(create_get_recent_companies_call, headers=headers)
  [1]: http://stackoverflow.com/questions/36148346/unable-to-loop-through-paged-api-responses-with-python

But it looks like this isn't right either - even though i've read up on the formatting issues, and have added the .encode('utf-8') additions. I still end up getting the following traceback:

Traceback (most recent call last):
  File "wta_get_companies.py", line 78, in <module>
    get_companies()
  File "wta_get_companies.py", line 57, in get_companies
    all_the_companies = "{name},{id}\n".format(name=i[u'properties'][u'name'][u'value'], id=i[u'companyId'])
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 3: ordinal not in range(128)

JSON in example is incorrect - it is 3 json objects merged together. — Eugene Lisitsky, Mar 22 '16 at 15:00
Your input string contains `}{` as a substring, implying two objects' representations are running together. Whether this accounts for your error I can't say, but it's clearly not valid JSON — holdenweb, Mar 22 '16 at 15:00

score 2 · Answer 1 · answered Mar 22 '16 at 15:01

2

The JSON data has three Objects one after the other; simplified:

{ .. }{ .. }{ .. }

That's not something that's supported by the JSON standard. How is Python supposed to parse that? Automatically wrap it in an array? Assign it to three different variables? Just use the first one?

You probably want to wrap it in an array, simplified:

[{ .. },{ .. },{ .. }]

Or full:

[{"properties": [{"property": "name", "value": "A random company name"}, {"property": "companyId", "value": 123456789}]},{"properties": [{"property": "name", "value": "Another random company name"}, {"property": "companyId", "value": 31415999}]},{"properties": [{"property": "name", "value": "Yet another random company"}, {"property": "companyId", "value": 10101010}]}]

answered Mar 22 '16 at 15:01

Martin Tournoij

26,737
24
105
146

Hi @Carpetsmoker, I tried doing that manually (because I couldn't get the script above to add the commas automatically, but then I read somewhere else that JSON was self delimiting, is that correct? – Marko Mar 22 '16 at 15:07
@Marko "Self delimiting" merely means that the JSON parser doesn't need any help to determine where the message ends. i.e. you can just write `[ ... ]` and not `[ .. ] END_OF_JSON` or `data is 42 bytes: [ .. ]`. It doesn't mean you can just concatenate a bunch of JSON objects and parse it in one go. – Martin Tournoij Mar 22 '16 at 15:15
Thanks for the clarification @Carpetsmoker, i thought it sounded a bit easy - i've since given up on JSON and am attempting to export a simple CSV, which also seems to be fraught with problems.. – Marko Mar 22 '16 at 15:36
@Marko Hah, okay ;-) Millions of people use JSON daily, and it's a better format than CSV in many ways − especially for more complex data. – Martin Tournoij Mar 22 '16 at 15:38
I totally agree :) but for the purposes of this project which I'm working on, I need an immediate solution, and don't have enough time to continue experimentation with JSON. All I need is a simple two column CSV... – Marko Mar 22 '16 at 15:43

Unable to parse JSON file, keep getting ValueError: Extra Data

1 Answers1

Linked