BigQuery SQL - Fixing Broken Special Characters

Question

Our tables have been loaded with broken special characters and I'm trying to find a way to fix through Standard SQL, temp JS function, or otherwise in BigQuery web UI.

The broken text is this: DÃ©cor but should be this: Dècor

I've already tried some solutions I've found on SO, including this one but none of them work.

Is there a way to fix using Standard SQL, temp JS function, or another way in BigQuery web UI?

@MikhailBerlyant, this is it for now but I'm looking for a comprehensive function where all special characters passed through the function are converted correctly. I've searched quite a bit but doesn't seem like there's one that consistently works across use cases — cchan, Apr 16 '20 at 18:19

score 0 · Answer 1 · answered Apr 16 '20 at 15:35

0

If you don't have a lot of columns, you can just use REPLACE(brokencolumn, 'Ã©', 'è').

You can still use it if you have a lot of columns, but it could be better to find an automating method doing it :)

answered Apr 16 '20 at 15:35

Sabri Karagönen

2,212
1
14
28

hi sabri, this works but I'm looking for a more comprehensive solution where the error is unknown but as long as the special characters are passed through this solution, they're converted correctly – cchan Apr 16 '20 at 18:21

score 0 · Answer 2 · answered Apr 21 '20 at 10:36

Not sure if this answers your question, but I'd focus my attention elsewhere. Namely why the word "Dècor" becomes, if I understood correctly, when loaded in your BigQuery table "DÃ©cor".

Let's say you have a CSV file with the following content:

Dècor|Dècor|Dècor
Dècor|Dècor|Dècor

If you load it in BigQuery with encoding "ISO-8859-1" it gets corrupted.

bq load --autodetect --source_format=CSV  -field_delimiter="|"  -encoding='ISO-8859-1' mydataset.test_french gs://my-bucket/broken_french.csv

And here's how the table inside BigQuery looks like:

Row string_field_0  string_field_1  string_field_2  
1    DÃ¨cor          DÃ¨cor          DÃ¨cor
2    DÃ¨cor          DÃ¨cor          DÃ¨cor

On the other hand, if you use 'UTF-8' encoding, like so:

bq load --autodetect --source_format=CSV  -field_delimiter="|"  -encoding='UTF-8' mydataset.test_french2 gs://my-bucket/broken_french.csv

the result in BigQuery looks as it should:

Row string_field_0  string_field_1  string_field_2  
1    Dècor            Dècor           Dècor
2    Dècor            Dècor           Dècor

So, in the case where you're using the wrong encoding to load your data, I'd reload them by using the correct one.

BigQuery SQL - Fixing Broken Special Characters

2 Answers2