0

Our tables have been loaded with broken special characters and I'm trying to find a way to fix through Standard SQL, temp JS function, or otherwise in BigQuery web UI.

The broken text is this: Décor but should be this: Dècor

I've already tried some solutions I've found on SO, including this one but none of them work.

Is there a way to fix using Standard SQL, temp JS function, or another way in BigQuery web UI?

cchan
  • 1
  • 2
  • show more examples of broken stuff – Mikhail Berlyant Apr 16 '20 at 15:27
  • @MikhailBerlyant, this is it for now but I'm looking for a comprehensive function where all special characters passed through the function are converted correctly. I've searched quite a bit but doesn't seem like there's one that consistently works across use cases – cchan Apr 16 '20 at 18:19

2 Answers2

0

If you don't have a lot of columns, you can just use REPLACE(brokencolumn, 'é', 'è').

You can still use it if you have a lot of columns, but it could be better to find an automating method doing it :)

Sabri Karagönen
  • 2,212
  • 1
  • 14
  • 28
  • hi sabri, this works but I'm looking for a more comprehensive solution where the error is unknown but as long as the special characters are passed through this solution, they're converted correctly – cchan Apr 16 '20 at 18:21
0

Not sure if this answers your question, but I'd focus my attention elsewhere. Namely why the word "Dècor" becomes, if I understood correctly, when loaded in your BigQuery table "Décor".

Let's say you have a CSV file with the following content:

Dècor|Dècor|Dècor
Dècor|Dècor|Dècor

If you load it in BigQuery with encoding "ISO-8859-1" it gets corrupted.

bq load --autodetect --source_format=CSV  -field_delimiter="|"  -encoding='ISO-8859-1' mydataset.test_french gs://my-bucket/broken_french.csv

And here's how the table inside BigQuery looks like:

Row string_field_0  string_field_1  string_field_2  
1    Dècor          Dècor          Dècor
2    Dècor          Dècor          Dècor

On the other hand, if you use 'UTF-8' encoding, like so:

bq load --autodetect --source_format=CSV  -field_delimiter="|"  -encoding='UTF-8' mydataset.test_french2 gs://my-bucket/broken_french.csv

the result in BigQuery looks as it should:

Row string_field_0  string_field_1  string_field_2  
1    Dècor            Dècor           Dècor
2    Dècor            Dècor           Dècor

So, in the case where you're using the wrong encoding to load your data, I'd reload them by using the correct one.

milia
  • 521
  • 8
  • 20