Get rid of blank result from reg_ex_split_table output

Question

I have a query that counts all of the words in one column and gives me the frequency and frequency rank of the word as the result. For some reason, I keep getting a row that has no word in it. How do I get rid of it?

Table:

CREATE TABLE dummy (
created_at TIMESTAMPTZ,
tweet TEXT);

Insert:

INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo squared');
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo foo');
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo foo');
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo bar');

Query:

select *
from (
    select date_trunc('day', created_at) as created_day, word, count(*) as cnt,
        rank() over(partition by date_trunc('day', created_at) order by count(*) desc) rn
    from dummy d
    cross join lateral regexp_split_to_table(
        regexp_replace(tweet, '\y(rt|co|https|bar|none)\y', '', 'g'),
        '\s+'
    ) w(word)
    group by created_day, word
) d
where created_day = CURRENT_DATE and word IS NOT NULL
order by rn
LIMIT 10;

Returns:

      created_day       |  word   | cnt | rn
------------------------+---------+-----+----
 2020-12-18 00:00:00+00 | foo     |   4 |  1
 2020-12-18 00:00:00+00 |         |   2 |  2
 2020-12-18 00:00:00+00 | arm     |   1 |  3
 2020-12-18 00:00:00+00 | squared |   1 |  3

I'd like to get rid of the blank word:

      created_day       |  word   | cnt | rn
------------------------+---------+-----+----
 2020-12-18 00:00:00+00 | foo     |   4 |  1
 2020-12-18 00:00:00+00 | arm     |   1 |  2
 2020-12-18 00:00:00+00 | squared |   1 |  3

Are you able to provide a minimum reproducible example so one can investigate the issue? Also, seeing that you are going for a new question, I would recommend accepting an answer on [your previous question](https://stackoverflow.com/questions/65354100/get-ranking-of-words-over-date-based-on-frequency-in-postgresql). — GMB, Dec 18 '20 at 19:45
Apologies, I've updated the question to include a reproducible example — mehsheenman, Dec 18 '20 at 20:55

score 1 · Accepted Answer · answered Dec 18 '20 at 21:43

The problem is with the inner regexp_replace(); when the matching part is at the end of the string, you end up with a trailing space at the end of the strings. Basically, when applied to 'foo bar', it generates 'foo '.

Then when parsed, this generates a final word whose value is the empty string ('').

A simple workaround is to trim() the output of regexp_replace(), so basically replace:

cross join lateral regexp_split_to_table(
    regexp_replace(tweet, '\y(rt|co|https|bar|none)\y', '', 'g'),
    '\s+'
) w(word)

With:

cross join lateral regexp_split_to_table(
    trim(regexp_replace(tweet, '\y(rt|co|https|bar|none)\y', '', 'g')),
    '\s+'
) w(word)

Demo on DB Fiddle

score 0 · Answer 2 · answered Dec 18 '20 at 18:19

0

Can you use it in where clause?

where created_day = CURRENT_DATE 
  And word is not null -- this
order by rn;

Or you can use same condition here also.

) w(word)
word is not null -- this
group by created_day, word

answered Dec 18 '20 at 18:19

Popeye

35,427
4
10
31

I tried that but unfortunately, I still get the same result – mehsheenman Dec 18 '20 at 18:40

Get rid of blank result from reg_ex_split_table output

2 Answers2