Python load json file with UTF-8 BOM header

Question

I needed to parse files generated by other tool, which unconditionally outputs json file with UTF-8 BOM header (EFBBBF). I soon found that this was the problem, as Python 2.7 module can't seem to parse it:

>>> import json
>>> data = json.load(open('sample.json'))

ValueError: No JSON object could be decoded

Removing BOM, solves it, but I wonder if there is another way of parsing json file with BOM header?

[Python : How to fix Unexpected UTF-8 BOM error when using json.loads](https://www.howtosolutions.net/2019/04/python-fixing-unexpected-utf-8-bom-error-when-loading-json-data/) — Grijesh Chauhan, Nov 27 '19 at 09:43

Pavel Anossov · Accepted Answer · 2012-10-31T11:25:36.320

96

You can open with codecs:

import json
import codecs

json.load(codecs.open('sample.json', 'r', 'utf-8-sig'))

or decode with utf-8-sig yourself and pass to loads:

json.loads(open('sample.json').read().decode('utf-8-sig'))

edited Oct 31 '12 at 11:25

answered Oct 31 '12 at 11:20

Pavel Anossov

60,842
14
151
124

22

I strongly recommend using `io.open()` over `codecs.open()`: `json.load(io.open('sample.json', 'r', encoding='utf-8-sig'))`. The `io` module is more robust and faster. – Martijn Pieters Mar 02 '17 at 09:53
@MartijnPieters: Thanks for that comment, good to know. I found this discussion of the differences that might be useful: https://groups.google.com/forum/#!topic/comp.lang.python/s_eIyt3KoLE – Bdoserror Apr 04 '17 at 18:02

score 39 · Answer 2 · edited Apr 10 '21 at 07:21

39

Simple! You don't even need to import codecs.

with open('sample.json', encoding='utf-8-sig') as f:
    data = json.load(f)

edited Apr 10 '21 at 07:21

John R Perry

3,916
2
38
62

answered Jun 06 '19 at 22:38

aerin

20,607
28
102
140

score 5 · Answer 3 · answered Oct 31 '12 at 11:21

5

Since json.load(stream) uses json.loads(stream.read()) under the hood, it won't be that bad to write a small hepler function that lstrips the BOM:

from codecs import BOM_UTF8

def lstrip_bom(str_, bom=BOM_UTF8):
    if str_.startswith(bom):
        return str_[len(bom):]
    else:
        return str_

json.loads(lstrip_bom(open('sample.json').read()))

In other situations where you need to wrap a stream and fix it somehow you may look at inheriting from codecs.StreamReader.

answered Oct 31 '12 at 11:21

newtover

31,286
11
84
89

Why not use the string strip function? – Sam Stoelinga Oct 21 '13 at 14:20
3

@SamStoelinga, since strip function receives not a prefix but a set of characters to remove. That it you need to either decode the byte-string into the `unicode` or use the approach above to be sure you left-strip only the UTF-8 BOM. – newtover Oct 21 '13 at 18:31
I'm getting an error that says TypeError: expected str,bytes or os.Pathlike object, not _io.TextIOWrapper – Zypps987 Jun 04 '17 at 16:14
@Zypps987, the snippet assumes python2 where `read()` returns bytes. To make the snippet work in python3 you will need to encode `BOM_UTF8` to 'utf-8'. But you don't need this, when you have `utf-8-sig` encoding. – newtover Jun 04 '17 at 19:44

score 4 · Answer 4 · edited May 04 '19 at 06:47

4

you can also do it with keyword with

import codecs
with codecs.open('samples.json', 'r', 'utf-8-sig') as json_file:  
    data = json.load(json_file)

or better:

import io
with io.open('samples.json', 'r', encoding='utf-8-sig') as json_file:  
    data = json.load(json_file)

edited May 04 '19 at 06:47

Ray Hulha

10,701
5
53
53

answered Mar 30 '19 at 15:24

Mohamed Ali Mimouni

111
4

score 0 · Answer 5 · answered Dec 04 '17 at 08:51

0

If this is a one-off, a very simple super high-tech solution that worked for me...

Open the JSON file in your favorite text editor.
Select-all
Create a new file
Paste
Save.

BOOM, BOM header gone!

answered Dec 04 '17 at 08:51

Mike N

6,395
4
24
21

score 0 · Answer 6 · answered Mar 24 '20 at 01:01

I removed the BOM manually with Linux command.

First I check if there are efbb bf bytes for the file, with head i_have_BOM | xxd.

Then I run dd bs=1 skip=3 if=i_have_BOM.json of=I_dont_have_BOM.json.

bs=1 process 1 byte each time, skip=3, skip the first 3 bytes.

score 0 · Answer 7 · answered Jun 10 '20 at 11:20

0

I'm using utf-8-sig just with import json

with open('estados.json', encoding='utf-8-sig') as json_file:
data = json.load(json_file)
print(data)

answered Jun 10 '20 at 11:20

Rodrigo Grossi

15
1

Python load json file with UTF-8 BOM header

7 Answers7

Linked

Related