1

We receive a survey Web-hook data in Bigquery. The comment in local language is captured as unicode and we do have special character in that comment.

  • Example

    • Comment in survey- "别老是晚点,现场补行李费太贵"
    • Comment in Bigquery data- "\u522b\u8001\u662f\u665a\u70b9\uff0c\u73b0\u573a\u8865\u884c\u674e\u8d39\u592a\u8d35"

We found a solution for decode individual comment :-

    CREATE TEMPORARY FUNCTION utf8convert(s STRING)
    RETURNS STRING
    LANGUAGE js AS """
    return unescape( ( s ) );
    """;
    with sample AS (SELECT '\u522b\u8001\u662f\u665a' AS S)
    SELECT utf8convert(s) from sample

When implement this code in comment field with thousand of comment and different languages its not working.

    CREATE TEMPORARY FUNCTION utf8convert(s STRING)
    RETURNS STRING
    LANGUAGE js AS """
    return unescape( ( s ) );
    """;
   SELECT Comment, utf8convert(Comment) as Convert
   FROM `airasia-nps.nps_production.NPSDashboard_Webhook_Data1`
   where Comment is not null 

no error while running but the result is in Unicode does not change to local language. Result: local language in Unicode

  • I have try this code

      CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
      IF(s NOT LIKE '%\\u%', s,
      (SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
      FROM UNNEST(SPLIT(s, '\\u')) AS x
       WHERE x != ''))
      );
    
      SELECT
      original,
      DecodeUnicode(original) AS decoded
      FROM (
      SELECT trim(r'$-\u6599\u91d1\u304c\u9ad8\u3059\u304e\uff01\uff01\uff01') AS original UNION ALL
      SELECT trim(r'abcd')
      );
    

shows error i think its because the comment start with special character?.

  • 1
    Can you explain why the solution you found is not working? Any way to reproduce? What's the error message? – Felipe Hoffa Nov 07 '17 at 05:02
  • No error while running but the result is in Unicode does not change to local language. Here with i have attach the code and result. – Sharmila Naidu AP Jayaveran - Nov 07 '17 at 06:11
  • weird thing is that if we try decoding the characters that didn't work for you they still work normally. Just out of curiosity I wonder what would happen if you select only one row that didn't encode properly and try to decode just that to see what happens ( i tried simulating broken characters but as far as i could see it just throws exceptions). – Willian Fuks Nov 07 '17 at 14:59

1 Answers1

1

See if this works. It does the "manual" decoding for strings that have \u in them by converting to Unicode code points and then to a string. It should be faster than using JavaScript, too.

CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
  IF(s NOT LIKE '%\\u%', s,
     (SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
      FROM UNNEST(SPLIT(s, '\\u')) AS x
      WHERE x != ''))
);

SELECT
  original,
  DecodeUnicode(original) AS decoded
FROM (
  SELECT r'\u522b\u8001\u662f\u665a\u70b9\uff0c\u73b0\u573a\u8865\u884c\u674e\u8d39\u592a\u8d35' AS original UNION ALL
  SELECT r'abcd'
);

As output, this returns 别老是晚点,现场补行李费太贵 and abcd.

Elliott Brossard
  • 32,095
  • 2
  • 67
  • 99