0

Products have names like

Product one white Adidas
Other product black Hill sheet
Nice T-shirt blue Brower company

How to get starting part of product name before first uppercase word starting from second word and up to first uppercase word excluding word T-shirt. Result from strings above should be

Product one white
Other product black
Nice T-shirt blue

Using Bohemian answer to question

Substring before first uppecase word excluding first word

regexp_replace('Nice T-shirt blue Brower company', '(?<!^)\m[A-ZÕÄÖÜŠŽ].*', '')

returns wrong result

Nice

How to modify regex so that it returns

Nice T-shirt blue

Using Postgres 12

Andrus
  • 26,339
  • 60
  • 204
  • 378

3 Answers3

1

This regex works for your test cases:

^[A-Z][a-z ]*(T-shirt)?[a-z ]*

Explanation:

^: Start of line

[A-Z]: Any capital letter

[a-z ]*: zero or more characters that are either a lowercase letter or space

(T-shirt)?: The phrase T-shirt 0 or 1 times

[a-z ]*: same again

dc-ddfe
  • 487
  • 1
  • 11
  • I added national letters to regex: ^[A-ZÕÄÖÜŠŽ][a-zõäöüšž ]*(T-shirt)?[a-zõäöüšž ]* . Is this ok? Words may contain also some other characters like - / and + . Should all possible characters listed separately in regex? – Andrus Dec 06 '22 at 22:15
  • 1
    Please test on your data, and make another comment if it doesn't work. It's difficult to fix issues when 1, I don't know if it's a problem in the first place, and 2, I don't have sample inputs to check. – dc-ddfe Dec 06 '22 at 22:28
  • regexp_replace('Shoes Attalla', '^[A-Z][a-z ]*(T-shirt)?[a-z ]*', '') returns **Attalla** but should return Shoes. It looks like this is totally wrong. – Andrus Dec 07 '22 at 22:29
  • 1
    That's because you haven't applied the `regexp_replace` correctly. The regex I've provided selects the part of the next you want to keep, not the text to be removed. To fix that, run `regexp_replace('Shoes Attalla', '^([A-Z][a-z ]*(T-shirt)?[a-z ]*).*$', '$1')`. (To briefly explain, capture the part you need to keep, and use `.*$` to consume the rest of the line. Then just replace with the first capture group) – dc-ddfe Dec 07 '22 at 23:13
1

Use a negative look ahead:

select regexp_replace('Nice T-shirt blue Brower company', '(?<!^)\m(?!T-shirt)[A-ZÕÄÖÜŠŽ].*', '')

See live demo.

(?!T-shirt) means the following characters must not be 'T-shirt'

You can add other capitalised terms to ignore:

(?!T-shirt|BB gun|BBQ tongs)
Bohemian
  • 412,405
  • 93
  • 575
  • 722
0

Instead of doing a direct match, you can simply remove the part of the string (in this case from the last uppercase word to the end) that you do not need:

select regexp_replace(name, '(?<=\s)[A-Z]+[a-zA-Z\s]+$', '') from tbl

See fiddle.

Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • This returns wrong result if there are no oper uppercase words. For Nice T-shirt blue brower company it returns Nice but correct result is whole string Nice T-shirt blue brower company – Andrus Dec 06 '22 at 22:12
  • It returns incorrect result for Product ONE white Adidas. Correct result is **Product** but it returns **Product ONE white** – Andrus Dec 06 '22 at 22:40
  • @Andrus Please see my recent edit. Also, as dc-ddfe pointed out, consider posting your *all* your sample inputs. – Ajax1234 Dec 06 '22 at 22:52