0

I have the following JSON (for simplicity's sake I'll only use one but there are 100 entries in reality):

{
    "Active": false, 
    "Book": "US Derivat. London, Mike Übersax/Michael Jealous", 
    "ExpirationDate": "2006-10-12", 
    "Isin": "CH0013096497", 
    "IssueDate": "2001-10-09", 
    "KbForXMonths": "0", 
    "KbPeriodDay": "Period", 
    "KbType": "Prozent", 
    "KbYear": "0.5", 
    "Keyinvest_IssueRetro": "0.50%", 
    "Keyinvest_RecurringRetro": "1.00% pro rata temporis", 
    "Keyinvest_RetroPayment": "Every month", 
    "LastImportDate": "2008-12-31", 
    "LiberierungDate": "1900-01-01", 
    "NominalCcy": "USD", 
    "NominalStueck": "5,000", 
    "PrimaryCCR": "0", 
    "QuoteType": "Nominal", 
    "RealValor": "0", 
    "Remarks": "", 
    "RwbeProductId_CCR": "034900", 
    "RwbeProductId_EFS": "034900", 
    "SecName": "Cliquet GROI on Nasdaq", 
    "SecType": "EQ", 
    "SubscriptionEndDate": "1900-01-01", 
    "TerminationDate": "2003-10-19", 
    "TradingCcy": "USD", 
    "Valor": 1309649
}

I'm trying to read this JSON in order to save it as a .csv (so that I can import it into a database)

However when i try to write this JSON data as a csv like so:

with codecs.open('EFSDUMP.csv', 'w', 'utf-8-sig') as csv_file:
    content_writer = csv.writer(csv_file, delimiter=',')
    content_writer.writerow(data.values())

I get an error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xdc' in position 25: ordinal not in range(128)

That is because there's an umlaut in the JSON (see attribute "Book").

I try to read the JSON like this:

data = json.loads(open('EFSDUMP.json').read().decode('utf-8-sig'))

What's interesting is that this:

print data

Gives me this:

{u'PrimaryCCR': u'0', u'SecType': u'EQ', u'Valor': 1309649, u'KbType': u'Prozent', u'Book': u'US Derivat. London, Mike \xdcbersax/Michael Jealous', u'Keyinvest_RecurringRetro': u'1.00% pro rata temporis', u'TerminationDate': u'2003-10-19', u'RwbeProductId_CCR': u'034900', u'SubscriptionEndDate': u'1900-01-01', u'ExpirationDate': u'2006-10-12', u'Keyinvest_RetroPayment': u'Every month', u'Keyinvest_IssueRetro': u'0.50%', u'QuoteType': u'Nominal', u'KbYear': u'0.5', u'LastImportDate': u'2008-12-31', u'Remarks': u'', u'RealValor': u'0', u'SecName': u'Cliquet GROI on Nasdaq', u'Active': False, u'KbPeriodDay': u'Period', u'Isin': u'CH0013096497', u'LiberierungDate': u'1900-01-01', u'IssueDate': u'2001-10-09', u'KbForXMonths': u'0', u'NominalCcy': u'USD', u'RwbeProductId_EFS': u'034900', u'TradingCcy': u'USD', u'NominalStueck': u'5,000'}

Clearly the umlaut became a '\xdc'

However when I do this:

print data['Book']

Meaning I access the attribute directly, I get:

US Derivat. London, Mike Übersax/Michael Jealous

So the umlaut is an actual umlaut again.

I'm pretty sure that the JSON is UTF-8 without BOM (Notepad++ claims so)

I have already tried all of the suggestions here without any success: Python load json file with UTF-8 BOM header

How can I properly read the UTF-8 JSON file in order to be able to write it as .csv?

Any help is greatly appreciated.

Python version: 2.7.2

Community
  • 1
  • 1
emazzotta
  • 1,879
  • 3
  • 20
  • 31

1 Answers1

2

In Python 2, the csv module does not support writing Unicode. You need to encode it manually here, as otherwise your Unicode values are encoded for you using ASCII (which is why you got the encoding exception).

This also means you need to write the UTF-8 BOM manually, but only if you really need it. UTF-8 can only be written one way, a Byte Order Mark is not needed to read UTF-8 files. Microsoft likes to add it to files to make the task of detecting file encodings easier for their tools, but the UTF-8 BOM may actually make it harder for other tools to work correctly as they won't ignore the extra initial character.

Use:

with open('EFSDUMP.csv', 'wb') as csv_file:
    csv_file.write(codecs.BOM_UTF8)
    content_writer = csv.writer(csv_file)
    content_writer.writerow([unicode(v).encode('utf8') for v in data.values()])

Note that this'll write your values in arbitrary (dictionary) order. The unicode() call will convert non-string types to unicode strings first before encoding.

To be explicit: you've loaded the JSON data just fine. It is the CSV writing that failed for you.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thank you very much for clarifying where I'm failing :). However I now get the error `content_writer.writerow(v.encode('utf8') for v in data.values()).` `_csv.Error: sequence expected.` Please note that there is a boolean value as well as an integer in the JSON so the encode might be failing there – emazzotta Mar 13 '14 at 11:25
  • @gta99: your `data` object is empty in that case. Make sure you actually have a non-empty `data` dictionary. – Martijn Pieters Mar 13 '14 at 11:26
  • `data` is not empty but the encode seems to be failing for the boolean and integer value – emazzotta Mar 13 '14 at 11:29
  • @gta99: that'd be a different exception (`AttributeError: 'bool' object has no attribute 'encode'`). Use `unicode(v).encode('utf8')` if you have more than just strings. I do see you have a integer value in your sample input, I updated the code to handle that. – Martijn Pieters Mar 13 '14 at 11:31
  • `print [unicode(v).encode('utf8') for v in data.values()]` actually gives me the values however when using `content_writer.writerow(unicode(v).encode('utf8') for v in data.values())` on the next line I get `_csv.Error: sequence expected` – emazzotta Mar 13 '14 at 11:35
  • 1
    @gta99: ah, indeed, my mistake. Python 2 `csv.writer.writerow()` cannot handle iterators. Will update. – Martijn Pieters Mar 13 '14 at 11:36