0

I need to implement a simple text classification using regex, and for this I though to apply a simple CASE WHEN statement, but rather than in case 1 condition is met, I want to iterate over all the CASEs

for example

with `table` as(
SELECT 'It is undeniable that AI will change the landscape of the future. There is a frequent increase in the demand for AI-related jobs, especially in data science and machine learning positions. It is believed that artificial intelligence will change the world, just like how electricity changed the world about 100 years ago. As Professor Andrew NG has famously stated multiple times “Artificial Intelligence is the new electricity.” We have advanced immensely in the field of artificial intelligence. With the increase in the processing and computational power, thanks to graphical processing units (GPUs), and also due to the abundance of data, we have reached a position of supremacy in Deep Learning and modern algorithms.' as text
)
SELECT 
  CASE 
    WHEN REGEXP_CONTAINS(text, r'(?i)ai') THEN 'AI'
    WHEN REGEXP_CONTAINS(text, r'(?i)computational power') THEN 'Engineering'
    WHEN REGEXP_CONTAINS(text, r'(?i)deep learning') THEN 'Deep Learning'
  END as topic,
  text
FROM `table`

with this query, the text is classified as AI because is the first condition that is met, but it should be classified as AI, Engineering and Deep Learning in an Array or in 3 different rows, because all 3 conditions are met.

How can I classify the text applying all the regex/conditions?

Racana
  • 317
  • 3
  • 12

3 Answers3

1

One method is string concatenation:

SELECT CONCAT(CASE WHEN REGEXP_CONTAINS(text, r'(?i)ai') THEN 'AI;' ELSE '' END,
              CASE WHEN REGEXP_CONTAINS(text, r'(?i)computational power') THEN 'Engineering;' ELSE '' END,
              CASE WHEN REGEXP_CONTAINS(text, r'(?i)deep learning') THEN 'Deep Learning;' ELSE '' END
            ) as topics, text
FROM `table`;

Actually, this constructs a string. You can use similar-ish logic to construct an array instead.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • exactly what I needed, I use your answer and modified a little to get an array like result – Racana Nov 17 '20 at 18:46
  • this looks unreadable, but gets me where I want `SELECT ARRAY(SELECT CASE WHEN REGEXP_CONTAINS(text, r'(?i)ai') THEN 'AI' ELSE '' END as topics FROM table UNION ALL SELECT CASE WHEN REGEXP_CONTAINS(text, r'(?i)computational power') THEN 'Engineering' ELSE '' END as topics FROM table UNION ALL SELECT CASE WHEN REGEXP_CONTAINS(text, r'(?i)deep learning') THEN 'Deep Learning' ELSE '' END as topics FROM table ) as topics, text FROM table;` – Racana Nov 17 '20 at 18:49
1

Below is for BigQuery Standard SQL

#standardSQL
select 
   array_to_string(array(select distinct lower(topic)
    from unnest(regexp_extract_all(text, r'(?i)ai|computational power|deep learning')) topic
   ), ', ') topics,
  text
from `table`   

if to apply to sample data from your question - output is

enter image description here

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
1

I feel below is most generic and reusable solution (BigQuery Standard SQL)

#standardSQL
with `table` as(
select 'It is undeniable that AI will change the landscape of the future. There is a frequent increase in the demand for AI-related jobs, especially in data science and machine learning positions. It is believed that artificial intelligence will change the world, just like how electricity changed the world about 100 years ago. As Professor Andrew NG has famously stated multiple times “Artificial Intelligence is the new electricity.” We have advanced immensely in the field of artificial intelligence. With the increase in the processing and computational power, thanks to graphical processing units (GPUs), and also due to the abundance of data, we have reached a position of supremacy in Deep Learning and modern algorithms.' as text
), classification as (
  select 'ai' term, 'AI' topic union all
  select 'computational power', 'Engineering' union all
  select 'deep learning', 'Deep Learning'  
), pattern as (
  select r'(?i)' || string_agg(term, '|') as regexp_pattern
  from classification
)
select 
   array_to_string(array(
    select distinct topic
    from unnest(regexp_extract_all(lower(text), regexp_pattern)) term
    join classification using(term)
   ), ', ') topics,
  text
from `table`, pattern     

with output

enter image description here

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230