7

I have a character variable (companies) with observations that look like this:

  1. "612. Grt. Am. Mgt. & Inv. 7.33"
  2. "77. Wickes 4.61"
  3. "265. Wang Labs 8.75"
  4. "9. CrossLand Savings 6.32"
  5. "228. JPS Textile Group 2.00"

I'm trying to split these strings into 3 parts:

  1. all the digits before the first "." ,
  2. everything between the first "." and the next number (consistently formatted #.##), and
  3. that last number itself (format #.##).

Using the first obs as an example, I'd like: "612", "Grt. Am. Mgt & Inv", "5.01"

I've tried defining the pattern in rebus and using str_match, but the code below only works on cases like obs #2 and #3. It doesn't reflect all the variation in the middle part of the string to capture the other obs.

pattern2 <- capture(one_or_more(DGT)) %R% DOT %R% SPC %R% 
            capture(or(one_or_more(WRD), one_or_more(WRD) %R% SPC 
            %R% one_or_more(WRD))) %R% SPC %R% capture(DGT %R% DOT 
            %R% one_or_more(DGT))

str_match(companies, pattern = pattern2)

Is there a better way to split the strings into these 3 parts?

I'm not familiar with regex, but I've seen that suggested here a lot (I'm brand new to R and Stack Overflow)

Nina
  • 73
  • 5

5 Answers5

3

You can delimit your string using regex and then split that strings for getting your results:

delimitedString = gsub( "^([0-9]+). (.*) ([0-9.]+)$", "\\1,\\2,\\3", companies  )

do.call( 'rbind', strsplit(split = ",", x = delimitedString) )
#      [,1]  [,2]                   [,3]  
#[1,] "612" "Grt. Am. Mgt. & Inv." "7.33"
#[2,] "77"  "Wickes"               "4.61"
#[3,] "265" "Wang Labs"            "8.75"
#[4,] "9"   "CrossLand Savings"    "6.32"
#[5,] "228" "JPS Textile Group"    "2.00" 

Regex explanation:

  • ^[0-9]+ : any pattern composed by numbers from 0 to 9 at the beginning (i.e. ^) of your string
  • .* : greedy match, basically anything surrounded by two spaces on the above case
  • [0-9.]+$: again numbers + a point and at the ending (i.e. $) of your string

Parenthesis are used to indicate that I want to catch those part of string which are fitted by regex. Upon caught them, those substring are collapsed and delimited by commas. Finally, we can split the whole string with strsplit function and bind rows with do.call function

Ulises Rosas-Puchuri
  • 1,900
  • 10
  • 12
1

Instead of splitting the text, you can match the information using a grouping regex and extract the information from three groups you want. Try using this regex,

(.+?)\.\s+(.+)\s+(\d+\.\d+)

Which will capture your information in group1, group2 and group3.

Demo

Here, group1 captures your first number before company information, and group2 captures the company information and group3 captures the last number of form #.##

Check this r code,

companies = c("612. Grt. Am. Mgt. & Inv. 7.33")
result <- str_match(companies, pattern = "(.+?)\\.\\s+(.+)\\s+(\\d+\\.\\d+)")
result[,2]
result[,3]
result[,4]

Prints,

[1] "612"
[1] "Grt. Am. Mgt. & Inv."
[1] "7.33"
Pushpesh Kumar Rajwanshi
  • 18,127
  • 2
  • 19
  • 36
1

Use the following regex:

^(.*?)\.(.*?)(?=\d)(.*)$

Demo

The three capturing groups contain the desired information: the first group captures everything until it finds the first '.', the second group captures everything until it finds a digit (this is done via positive lookahead, which ensures that the digit isn't consumed since we need to capture it in the next group), and the third group captures everything till the end.

CinCout
  • 9,486
  • 12
  • 49
  • 67
1

You might use 3 capturing groups:

([^.]+)\.\s+(\D+)\s+(\d\.\d{2})

For example

companies=c("612. Grt. Am. Mgt. & Inv. 7.33")
pattern="([^.]+)\\.\\s+(\\D+)\\s+(\\d\\.\\d{2})"
str_match(companies, pattern)

Result

     [,1]                             [,2]  [,3]                   [,4]  
[1,] "612. Grt. Am. Mgt. & Inv. 7.33" "612" "Grt. Am. Mgt. & Inv." "7.33"

See a regex101 demo | R demo

Explanation

  • ([^.]+) Capture in group 1 matching 1+ times not a dot (To not match a newline as well, use [^.\r\n] )
  • \.\s+ Match a dot and 1+ times a whitespace character
  • (\D+) Capture in group 2 matching 1+ times not a digit
  • \s+ Match 1+ times a whitespace character
  • (\d\.\d{2}) Capture in group 3 a digit, dot and 2 digits (format #.##)
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

You should be able to debug the regex you wrote.

> as.regex(pattern2)
<regex> ([\d]+)\.\s((?:[\w]+|[\w]+\s[\w]+))\s(\d\.[\d]+)

Plug it in at regex101, and you see your strings do not always match. The explanation on the right tells you that you only allow 1 or 2 space separated words between the dot and number. Also, WRD ([\w]+ pattern) does not match dots and any other chars that are not letters, digits or _. Now, you know you need to match your string with

^(\d+)\.(.*?)\s*(\d\.\d{2})$

See this regex demo. Translating into Rebus:

pattern2 <- START %R%            # ^ - start of string
 capture(one_or_more(DGT)) %R%   # (\d+) - Group 1: one or more digits
 DOT %R%                         # \. - a dot
 "(.*?)" %R%                     # (.*?) - Group 2: any 0+ chars as few as possible
 zero_or_more(SPC) %R%           # \s* - 0+ whitespaces 
 capture(DGT %R% DOT %R% repeated(DGT, 2)) %R% # (\d\.\d{2}) - Group 3: #.## number
END                              # $ - end of string

Checking:

> pattern2
<regex> ^([\d]+)\.(.*?)[\s]*(\d\.[\d]{2})$

> companies <- c("612. Grt. Am. Mgt. & Inv. 7.33","77. Wickes 4.61","265. Wang Labs 8.75","9. CrossLand Savings 6.32","228. JPS Textile Group 2.00")
> str_match(companies, pattern = pattern2)
     [,1]                             [,2]  [,3]                    [,4]  
[1,] "612. Grt. Am. Mgt. & Inv. 7.33" "612" " Grt. Am. Mgt. & Inv." "7.33"
[2,] "77. Wickes 4.61"                "77"  " Wickes"               "4.61"
[3,] "265. Wang Labs 8.75"            "265" " Wang Labs"            "8.75"
[4,] "9. CrossLand Savings 6.32"      "9"   " CrossLand Savings"    "6.32"
[5,] "228. JPS Textile Group 2.00"    "228" " JPS Textile Group"    "2.00"

WARNING: the capture(lazy(zero_or_more(ANY_CHAR))) returns ([.]*?) pattern that matches 0 or more dots as few as possible instead of matching any 0+ chars, because rebus has a bug: it wraps all the repeated (one_or_more or zero_or_more) chars with [ and ], a character class. That is why (.*?) is added "manually".

This can be resolved, or worked around, using a common construct like [\w\W] / [\s\S] or [\d\D]:

pattern2 <- START %R%                          # ^ - start of string
 capture(one_or_more(DGT)) %R%                 # (\d+) - Group 1: one or more digits
 DOT %R%                                       # \. - a dot
 capture(                                      # Group 2 start:
  lazy(zero_or_more(char_class(WRD, NOT_WRD))) #  - [\w\W] - any 0+ chars as few as possible
 ) %R%                                         # End of Group 2
 zero_or_more(SPC) %R%                         # \s* - 0+ whitespaces 
 capture(DGT %R% DOT %R% repeated(DGT, 2)) %R% # (\d\.\d{2}) - Group 3: #.## number
END

Check:

> as.regex(pattern2)
<regex> ^([\d]+)\.([\w\W]*?)[\s]*(\d\.[\d]{2})$

See the regex demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks for this super detailed explanation! It worked well, and I learned a lot from your comments. – Nina Feb 19 '19 at 18:03