1

I was looking at this question and the accepted answer gives this as a solution for the case when there are fewer than n characters in the string:

^(([^>]*>){4}|.*)

However, I have done a fiddle here, and it shows that this regex will just simply return the entire string all of the time.

This code:

SELECT
  SUBSTRING(a FROM '^(([^>]*>){4}|.*)'),
  a,
  LENGTH(SUBSTRING(a FROM '^(([^>]*>){4}|.*)')),
  LENGTH(a),

  LENGTH(SUBSTRING(a FROM '^(([^>]*>){4}|.*)')) = LENGTH(a)
  
FROM s
WHERE   LENGTH(SUBSTRING(a FROM '^(([^>]*>){4}|.*)')) = LENGTH(a) IS false;

after several runs returns no records - meaning that the regex is doing nothing.

Question:

I would like a regex which returns up to the fourth > character (not including it) OR the entire string if the string only contains 3 or fewer > characters. RTRIM() can always be used to trim the final > if not including it is too tricky - having an answer which gives both possibilities would help me to deepen my understanding of regexes!

This is not a duplicate - it's certainly related, but I'd like to correct the error in the original answer - and provide a correct answer of my own.

Vérace
  • 854
  • 10
  • 41
  • An alternative solution without a regex: https://dbfiddle.uk/?rdbms=postgres_14&fiddle=21ede6b762854e40952648841a06484f –  Mar 08 '22 at 10:28
  • @a_horse_with_no_name - thanks for your input, but the original question contains a [simpler version](https://stackoverflow.com/a/65266216/470530) of this which works very well - if there are fewer than 4 `>`s, it includes the whole string. I have used CARDINALITY in the past and it can be very handy! – Vérace Mar 08 '22 at 10:55

2 Answers2

1

You can repeat matching 0-3 times including the > using

^(?:[^>]*>){0,3}[^>]*
  • ^ Start of string
  • (?:[^>]*>){0,3} Repeat 0 - 3 times matching any character except > and then match >
  • [^>]* Optionally match any char except >

See a regex demo.

If there should be at least a single > then the quantifier can be {1,3}

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Thanks for your answer. Would you care to explain the regex? Also, how would I modify it to include the final (i.e. 4th) `>` character? – Vérace Mar 08 '22 at 09:00
  • @Vérace You could include the 4th `>` character repeating the first part 0 - 4 times like `^(?:[^>]*>){0,4}` If the `>` is optional at the end `^(?:[^>\n]*>){0,3}[^>\n]*>?` – The fourth bird Mar 08 '22 at 09:03
1

You can use

REGEXP_REPLACE(a, '^((?:[^>]*>){4}).*', '\1') 

See the regex demo. Details:

  • ^ - start of string
  • ((?:[^>]*>){4}) - Group 1 (\1): four sequences of any chars other than > and then a > char
  • .* - the rest of the line.

Here is a test:

CREATE TABLE s
(
  a TEXT
);
INSERT INTO s VALUES 
('afsad>adfsaf>asfasf>afasdX>asdffs>asfdf>'),
('23433>433453>4>4559>455>3433>'),
('adfd>adafs>afadsf>');
SELECT REGEXP_REPLACE(a, '^((?:[^>]*>){4}).*', '\1') as Output FROM s;

Output:

enter image description here

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Hi, and thanks for your input. I don't want to replace the last `>` - I want to `SELECT` the string - as per [here](https://dbfiddle.uk/?rdbms=postgres_14&fiddle=441449b6b9939a815656531d856de01e) (not really important - your regex works) - I'd just like to know how to include the final (4th) `>` character if I so wished? – Vérace Mar 08 '22 at 09:05
  • @Vérace It is there in the final output. If you want to *exclude* it, you would use `'^((?:[^>]*>){3}[^<]*).*'` – Wiktor Stribiżew Mar 08 '22 at 09:05
  • Not quite. It will include the final `>` if there are fewer than 4 `>`s - but if there are 4 `>`s, the regex removes it. I'd just like to know how to **keep** the final (4th) `>` if I wanted to? It would help me understand the subtleties of regexes. I'll be marking your answer as correct anyway, because you made the effort to explain it - I'd just appreciate the little extra snippet I've asked for? – Vérace Mar 08 '22 at 09:21
  • @Vérace I think you ask about `SELECT REGEXP_REPLACE(a, '^((?:[^>]*>){3}[^>]*).*', '\1') as output FROM s;`, right? See [this fiddle](https://dbfiddle.uk/?rdbms=postgres_14&fiddle=97f8ac38f2867eb1a7f7c1c7299d07f3). – Wiktor Stribiżew Mar 08 '22 at 09:24
  • In the fiddle, in the first record, say I want to keep the `>` after the X? – Vérace Mar 08 '22 at 09:26
  • @Vérace It is already there, the `>` after `X` is not removed, see the demo in my answer. – Wiktor Stribiżew Mar 08 '22 at 09:28
  • I don't understand. If I run `SELECT SUBSTRING(a, '^((?:[^>]*>){3}[^>]*).*') as Output FROM s;` I get `afsad>adfsaf>asfasf>afasdX` - I want `afsad>adfsaf>asfasf>afasdX>` - note `>` at end! See [here](https://dbfiddle.uk/?rdbms=postgres_14&fiddle=4d930f89d7992312c60d7ba4457fd6d4) - hope this is now clear? It's the cherry on the icing! :-) – Vérace Mar 08 '22 at 09:32
  • @Vérace `SUBSTRING` *extracts* the match. `REGEXP_REPLACE` *replaces* the found match(es) with replacement pattern. So, you'd just need `SUBSTRING(a, '^(?:[^>]*>){4}')` as 1) you are extracting, and 2) you need to match the fourth `>`. – Wiktor Stribiżew Mar 08 '22 at 09:34
  • Correct - I see what you've done. Thanks for your input and patience on this! I wanted to extract - not replace! BTW, it should be {0,4} or else strings with fewer than 4 `>` s come up as NULL with SUBSTRING! – Vérace Mar 08 '22 at 09:38