0

Hello StackOverflow Community Kindly review the following print: enter image description here

As you can see with I'm capturing everything between <title> and </title> brackets, but I want to avoid capturing any commas that might exist in the text.

Currently I get:

Kincrome K1500G - Tool Workshop Contour 472 Piece 15 Drawer 1/4", 3/8" &amp; 1/2" Drive Monster Green

what I want to get:

Kincrome K1500G - Tool Workshop Contour 472 Piece 15 Drawer 1/4" 3/8" &amp; 1/2" Drive Monster Green

I need a one line regex command that does that for me. Any ideas?

This is the regex command that I use:

(?<=<title\>)(.*?)(?=\s*\<)

Sample text is:

<title>Kincrome K1500G - Tool Workshop Contour 472 Piece 15 Drawer 1/4", 3/8" &amp; 1/2" Drive Monster Green</title>

I'm using Kantu Browser Automation to extract the title of some webpages. Bear in mind that I'm scraping the whole web page HTML.

If is not possible to do this, then what about matching until the first comma and then return, for example return this:

Kincrome K1500G - Tool Workshop Contour 472 Piece 15 Drawer 1/4"

Thank you for your time.

Adriano
  • 3,788
  • 5
  • 32
  • 53
Jackknife
  • 105
  • 1
  • 9
  • 2
    Can you post your input as text, not an image? (Text can't be copied from an image...) – CertainPerformance Aug 09 '18 at 00:51
  • Sample text is now text and not only image. – Jackknife Aug 09 '18 at 00:57
  • 2
    Regular expressions can't do that. They either match or they don't, and they return whatever was matched. If you need to edit the result, you need to do that with something else that processes it. – Barmar Aug 09 '18 at 01:08
  • Hello Barmar, thank you for your reply. But what about this expression? /[^,]/g It will match everytyhing except commas right? why I can't do the same but just between 2 strings (in this case the title brackets). If it match until the first comma, is fine for me. – Jackknife Aug 09 '18 at 01:15
  • 1
    As @Barmar says, you can't do this with a one line regex. You can match with one regex, and then replace to remove the commas with a second one, but you cannot do it with a single regex, which is your requirement as stated. – Ken White Aug 09 '18 at 01:22
  • Fair enough, what about matching until the first comma? – Jackknife Aug 09 '18 at 01:24
  • 1
    @Jackknife If you want to notify someone of a comment, put `@` before their name. – Barmar Aug 09 '18 at 01:26
  • 1
    You can use `/(?<=)([^,]*?)/` to match everything until the first comma, but that doesn't seem like what you really need. You'll just get `Kincrome K1500G - Tool Workshop Contour 472 Piece 15 Drawer 1/4"` – Barmar Aug 09 '18 at 01:28
  • @Barmar when I try this line on https://regexr.com/ It doesnt capture anythying. I'm Using Chrome, where are you testing your regex lines? – Jackknife Aug 09 '18 at 01:33
  • @Barmar, by using this `(?<=)([^,]*)` works! I know it's not what I really need but it's better to have something than let a comma ruin my whole CSV file. I can get around with your tip. Thank you – Jackknife Aug 09 '18 at 01:35
  • 1
    You could capture the parts between commas if the regex engine you use supports the `\G` construct (start exactly where last match left off), then join the array items. Or you could just match the whole thing, then strip out the commas. Gee, what sounds easiest ? –  Aug 09 '18 at 01:37
  • 1
    If you're putting this into a CSV file, the double quotes might also cause problems. – Barmar Aug 09 '18 at 01:40
  • @sln The problem is that I can't reprocess the line after capturing if it has double quotes (I get string errors in Kantu like crazy) Kantu is not very robust at dealing with strings you see, that's why i needed a one liner – Jackknife Aug 09 '18 at 01:45
  • @Barmar I though it may cause problems but I've checked several times and only the commas are the one messing stuff up. – Jackknife Aug 09 '18 at 01:46
  • @Jackknife - Sigh, everything I said went right over you head.. amazing. –  Aug 09 '18 at 01:49
  • @sln mind to elaborate what exactly went over my head or you're here just for trolling? – Jackknife Aug 09 '18 at 01:53
  • Sure bud, `you could just match the whole thing, then strip out the commas` went right over your head. –  Aug 10 '18 at 15:13
  • 1
    @sln maybe my reply went over your head. I understood what you wrote, my reply was `I can't reprocess the line after capturing if it has double quotes (I get string errors in Kantu)` Have another read would you?, what you proposed is the most obvious thing to do, I've tried it many times already even before your suggestion, but I just can't do it because of many reasons too long to explain here. I even wrote that that's the reason I needed the one liner... Please make sure to understand the replies before replying in an arrogant way, it's embarrasing – Jackknife Aug 13 '18 at 01:21
  • See this `I can't reprocess the line after capturing if it has double quotes` means jack buddy. Anything that is within a variable is not subject to parsing from an IDE. It's a problem only you possess the tribal knowledge .. like only you have insight into whatever junk you're using. If you can't do anything with a variable because it's not a language, then say so. But stop the BS about quotes being in a variable .. –  Aug 13 '18 at 23:54
  • @sln what part of `I can't reprocess the line after capturing if it has double quotes` you didn't get? It's simple, I just can't that's it, even if you keep insisting it's not magically going to happen buddy. On my question I clearly stated I needed a one liner, that's the condition for the solution and there is a good reason for that, It's too long to explain why and even if i do, you wouldn't care, so why bother? You call it BS just because you don't have any idea. Inform yourself before posting, this is just wasteful, not contributing anything to the answer just to prove your little point. – Jackknife Aug 14 '18 at 06:48
  • @Jackknife - I guess it just went right over your head bud. –  Aug 15 '18 at 00:40
  • @sln sure thing bud, have it your way. have a nice day – Jackknife Aug 16 '18 at 01:02

1 Answers1

2

As mentioned in comments, a regular expression can't alter the text that was matched, it just matches something or not.

If you're willing to stop the match at the first comma, rather than including all the rest with the commas removed, you can use this:

(?<=<title\>)(.*?)(?=(,|\s*<\/title>))

https://regex101.com/r/PPb1ba/1

Barmar
  • 741,623
  • 53
  • 500
  • 612