3

How can I extract the text up to the 4th instance of a character in a column?

I'm selecting text out of a column called filter_type up to the fourth > character.

To accomplish this, I've been trying to find the position of the fourth > character, but it's not working:

select substring(filter_type from 1 for position('>' in filter_type))
Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Zay Atia
  • 61
  • 1
  • 7

5 Answers5

4

You can use the pattern matching function in Postgres.

First figure out a pattern to capture everything up to the fourth > character.

To start your pattern you should create a sub-group that captures non > characters, and one > character:

([^>]*>)

Then capture that four times to get to the fourth instance of >

([^>]*>){4}

Then, you will need to wrap that in a group so that the match brings back all four instances:

(([^>]*>){4})

and put a start of string symbol for good measure to make sure it only matches from the beginning of the String (not in the middle):

^(([^>]*>){4})

Here's a working regex101 example of that!

Once you have the pattern that will return what you want in the first group element (which you can tell at the online regex on the right side panel), you need to select it back in the SQL.

In Postgres, the substring function has an option to use a regex pattern to extract text out of the input using a 'from' statement in the substring.

To finish, put it all together!

select substring(filter_type from '^(([^>]*>){4})')
from filter_table

See a working sqlfiddle here


If you want to match the entire string whenever there are less than four instances of >, use this regular expression:

 ^(([^>]*>){4}|.*)
Josh Withee
  • 9,922
  • 3
  • 44
  • 62
  • There is a catch, if the field to be filtered don't have the given character >= 4 times, the substring will return null. – chavocarlos Jul 20 '18 at 15:36
  • Yes the original answer assumed there are at least 4 instances of the given character. I have updated the answer to handle the situation in which the given character appears less that 4 times. – Josh Withee Jul 20 '18 at 19:52
  • 1
    Check out my new answer to this question - you might wish to revise your one? – Vérace Mar 08 '22 at 14:08
3

You can also use a simple, non-regex solution:

SELECT array_to_string((string_to_array(filter_type, '>'))[1:4], '>')

The above query:

  1. splits your string into an array, using '>' as delimeter
  2. selects only the first 4 elements
  3. transforms the array back to a string
treecon
  • 2,415
  • 2
  • 14
  • 28
1

You can also split the string, then choose the N'th element inside the result list. For example:

SELECT SPLIT_PART('aa,bb,cc', ',', 2)

will return: bb.

This function is defined as:

SPLIT_PART(string, delimiter, position)
Navid Farhadi
  • 3,397
  • 2
  • 28
  • 34
0
substring(filter_type from '^(([^>]*>){4})')

This form of substring lets you extract the portion of a string that matches a regex pattern.

ysth
  • 96,171
  • 6
  • 121
  • 214
0

In order to look at this problem, I did the following (all of the code below is available on the fiddle here):

CREATE TABLE s
(
  a TEXT
);

I then created a PL/pgSQL function to generate random strings as follows.

CREATE FUNCTION f() RETURNS TEXT LANGUAGE SQL AS 
$$
  SELECT STRING_AGG(SUBSTR('abcdef>', CEIL(RANDOM() * 7)::INTEGER, 1), '')
  FROM GENERATE_SERIES(1, 40)
$$;

I got the code from here and modified it so that it would produce strings with lots of > characters for testing purposes.

I then manually inserted a few strings at the beginning so that a quick look would tell me if the code was working as anticipated.

INSERT INTO s VALUES 
('afsad>adfsaf>asfasf>afasdX>asdffs>asfdf>'),
('23433>433453>4>4559>455>3433>'),
('adfd>adafs>afadsf>'),                           -- only 3 '>'s!
('babedacfab>feaefbf>fedabbcbbcdcfefefcfcd'),
('e>>>>>'),                                    -- edge case - multiple terminal '>'s
('aaaaaaa');                                   -- edge case - no '>'s whatsoever

The reason I put in the records with fewer than 4 >s is because the accepted answer (see discussion at the end of this answer) puts forward a solution which should return the entire string if this is the case!

On the fiddle, I then added 50,000 records as follows:

INSERT INTO s
SELECT f() FROM GENERATE_SERIES(1, 50000);

I also created a table s on a home laptop (16GB RAM, 500MB NVMe SSD) and populated it with 40,000,000 (50M) records - times also shown.

Now, my reading of the question is that we need to extract the string up to but not including the 4th > character.

The first solution (from treecon) was this one (I also show them running on the fiddle, but to save space here, I've only included the partial output of EXPLAIN (ANALYZE, BUFFERS, VERBOSE)) - the times shown are typical over a few runs:

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT 
  ARRAY_TO_STRING((STRING_TO_ARRAY(a, '>'))[1:4], '>'),
  a
FROM s;

Result (only key parts included):

Seq Scan on public.s
Execution Time: 81.807 ms

40M Time:  46 seconds

A regex solution which works (significantly faster):

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)    
SELECT
  SUBSTRING(a FROM '^(?:[^>]*>){0,3}[^>]*'),
  a
FROM s;

Result:

Seq Scan on public.s
Execution Time: 74.757 ms

40M Time:  32 seconds

The accepted answer fails on many levels (see the fiddle). It leaves a > at the end and fails on various strings even when modified. Also, the solution proposed to include strings with fewer than 4 >s (i.e. ^(([^>]*>){4}|.*)) merely returns the original string (see end of fiddle).

Vérace
  • 854
  • 10
  • 41