0

I hope everyone is having a blast I have come to face this challange:

I want to be able to extract one portion of a string in the folliing manner:

  1. The string may or may not have a dot or may have plenty of them
  2. I want to extract the string part that is before the first dot, if there is no dot then I want the whole string
  3. I want to use a regex to achieve this
    test<-c("This_This-This.Not This",
            "This_This-This.not_.this",
            "This_This-This",
            "this",
            "this.Not This")

since I need to use a regex, I have been trying to use this expression:

str_match(test,"(^[a-zA-Z].+)[\\.\\b]?")[,2]

but what I get is:

> str_match(test,"(^[a-zA-Z].+)[\\.\\b]?")[,2]
[1] "This_This-This.Not This" "This_This-This.not_this"
[3] "This_This-This"          "this"                   
[5] "this.Not This"          
> 

My desired output is:

"This_This-This"
"This_This-This"
"This_This-This"
"this"
"this"

This is my thought process behind the regex

str_match(test,"(^[a-zA-Z].+)[\\.\\b]?")[,2]

(^[a-zA-Z].+)= this to capture the group before the dot since the string starts always with a letter cpas or lowers case, and all other strings after that thats why the .+

[\.\b]?=a dot or a world boundary that may or may not be thats why the ?

Is not giving what I want and I will be so happy if yo guys can help me out to understand my miskte here thank you so much!!!

R_Student
  • 624
  • 2
  • 14

2 Answers2

3

Actually, rather than extracting, a regex replacement should work well here:

test <- c("This_This-This.Not This",
          "This_This-This.not_.this",
          "This_This-This",
          "this",
          "this.Not This")
output <- sub("\\..*", "", test)
output

[1] "This_This-This" "This_This-This" "This_This-This" "this"          
[5] "this

Replacement works well here because it no-ops for any input not having any dots, in which case the original string is returned.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

My regex is "match anything up to either a dot or the end of the line".

library(stringr)
str_match(test, "^(.*?)(\\.|$)")[, 2]

Result:

[1] "This_This-This" "This_This-This" "This_This-This" "this" "this"          
neilfws
  • 32,751
  • 5
  • 50
  • 63
  • neilfws thank you so much! I have a question I just wrap my head around this: if I use str_match(test, "^(.*?)(\\.|$)") then it works perfectly but if I use str_match(test, "^(.*?)(\\.|\\.b)") it return: "" , "", "","","" is not the last code telling R to look for a dot or a word boundary which will happen? and please let me know what would you use the ? after .*, what I get is that .* mean capture everything (.) zero or more times (*) but the ('?) means zero or 1 time like I dont get it, I will be so happy if you could please explain to me thank you in advance from the bottom of my heart – R_Student Oct 13 '22 at 02:57