0

I have to handle a big JSON file (approx. 47GB) and it seems as if I found the solution in ijson.

However, when I want to go through the objects I get the following error:

byggesag = (o for o in objects if o["h�ndelse"] == 'Byggesag')
                                                             ^
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xe6 in position 12: invalid continuation byte

Here is the code I am using so far:

import ijson

with open("C:/Path/To/Json/JSON_20220703180000.json", "r", encoding="cp1252") as json_file:
    objects = ijson.items(json_file, 'SagList.item')
    byggesag = (o for o in objects if o['hændelse'] == 'Byggesag')

How can I deal with the encoding of the input file?

TomGeo
  • 1,213
  • 2
  • 12
  • 24
  • JSON by definition is utf-8, don't know how you'd be getting cp1252 in there. – Mark Ransom Jul 11 '22 at 15:47
  • @MarkRansom after following the advise of Rodrigo, and altering the encoding in the open method 'back' to utf-8, it works. Don't get me wrong, but in the end *.json is only a text file and can by saved in what-ever encoding. – TomGeo Jul 11 '22 at 17:45
  • JSON is a standard, and I thought part of it was how to deal with Unicode. I don't work closely with it though so I could be wrong on the details. – Mark Ransom Jul 11 '22 at 18:56
  • @MarkRansom I don’t say you are wrong. The issue is, when you are outside of the English language world, then you have good chances that files are saved (especially in windows environments) in the local encoding. In my case that would be cp1252, and when I see stuff such as æ, ø, and å then my encoding flag is immediately is raised. – TomGeo Jul 11 '22 at 19:04

1 Answers1

3

The problem is with the python script itself, which is encoded with cp1252 but python expects it to be in utf8. You seem to be dealing with the input JSON file correctly (but you won't be able to tell until you actually are able to run your script).

First, note that the error is a SyntaxError, which probably happens when you are loading your script/module.

Secondly, note how in the first bit of code you shared hændelse appears somewhat scrambled, and python is complaining about how utf-8 cannot handle byte 0xe6. This is becase the character æ (U+00E6, https://www.compart.com/de/unicode/U+00E6) is encoded as 0xe6 in cp1252, which isn't a valid utf8 byte sequence; hence the error.

To solve it save your python script with utf8 encoding, or specify that it's saved with cp1252 (see https://peps.python.org/pep-0263/ for reference).

Rodrigo Tobar
  • 569
  • 4
  • 13