0

I want to select the records containing non-alphanumeric and remove those symbols from strings. The result I expecting is strings with only numbers and letters.

I'm not really familiar with regular expression and sometime it's really confusing. The code below is from answers to similar questions. But it also returns records having only letters and space. I also tried to use /s in case some spaces are not spaces but tabs instead. But I got the same result.

Also, I want to remove all symbols, characters excepting letters, numbers and spaces. I found a function named removesymbols from google could reference. But it seems this function does not exist at all. The website introduces removesymbols is https://cloud.google.com/dataprep/docs/html/REMOVESYMBOLS-Function_57344727. How can I remove all symbols? I don't want to use replace because there are a lot of symbols and I don't know all kinds of non-alphanumeric they have.

-- the code here only shows I want to select all records with non-alphanumeric
SELECT EMPLOYER
FROM fec.work
WHERE EMPLOYER NOT LIKE '[^a-zA-Z0-9/s]+'
GROUP BY 1;
Min Sun
  • 7
  • 1
  • 5
  • for the regex to avoid the \t, have you tried `[^a-zA-Z0-9 ]+` instead of `[^a-zA-Z0-9\s]+`? – IWHKYB Jul 25 '19 at 14:59

2 Answers2

3

Below is for BigQuery Standard SQL

SELECT 
  REGEXP_REPLACE(EMPLOYER, '[^a-zA-Z\\d\\s\\t]', ''), -- option 1
  REGEXP_REPLACE(EMPLOYER, r'[^a-zA-Z\d\s\t]', ''),   -- option 2
  REGEXP_REPLACE(EMPLOYER, r'[^\w]', ''),             -- option 3
  REGEXP_REPLACE(EMPLOYER, r'\W', '')                 -- option 4
FROM fec.work

As you can see - option 1 is most verbose and you can avoid double escaping by using r in front of string regular expression as it is in option 2
To further simplify - you can use \w or directly \W as in options 3 and 4

Note: BigQuery provides regular expression support using the re2 library; see that documentation for its regular expression syntax.

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • I've edited my answer to take `r` use into account. You need to add `\s` to option3. Also, using `\w`, undescrores will be considered alphanumerical. Option 4 will not replace spaces and tab. – Rafaël Jul 26 '19 at 08:07
  • i used those for purpose - if you agree to consider space as a alphanumeric, you can easily extend that to underscore too - that's what I did based on question and comments (obviously I could be wrong with such an assumption of OP's case). otherwise - nice job :o) – Mikhail Berlyant Jul 26 '19 at 15:09
1

I suggest using REGEXP_REPLACE for select, to remove the characters, and using REGEXP_CONTAINS to get only the one you want.

SELECT REGEXP_REPLACE(EMPLOYER, r'[^a-zA-Z\d\s]', '') 
FROM fec.work
WHERE REGEXP_CONTAINS(EMPLOYER, r'[^a-zA-Z\d\s]')

You say you don't want to use replace because you don't know how many alphanumerical there is. But instead of listing all non-alphanumerical, why not use ^ to get all but alphanumerical ?

EDIT :

To complete with what Mikhail answered, you have multiple choices for your regex :

'[^a-zA-Z\\d\\s]'  // Basic regex
r'[^a-zA-Z\d\s]'   // Uses r to avoid escaping
r'[^\w\s]'         // \w = [a-zA-Z0-9_] (! underscore as alphanumerical !)

If you don't consider underscores to be alphanumerical, you should not use \w

Rafaël
  • 977
  • 8
  • 17
  • Thank you! One said `NOT LIKE` and `[^a-zA-Z0-9]` can select all records with non-alphanumeric. But I didn't get the result I want, so I think it's incorrect. I tried your code but I don't know if it's because BigQuery cannot identify `/d/s/t`, the results removed all numbers, spaces and tabs. Also it did no selection I think. – Min Sun Jul 25 '19 at 15:32
  • Damn right, you need to escape the backslash character like `'\\'`. So for example, digits are `\\d`. I'm updating my answer. Checked and it worked for me. Could you check with updated version ? – Rafaël Jul 25 '19 at 15:38
  • No worries ! I've edited my answer to add regex format proposed by Mikhail. – Rafaël Jul 26 '19 at 07:53