15

I have some text like the following:

foo_text <- c(
  "73000 PARIS   74000 LYON",
  "75 000 MARSEILLE 68483 LILLE",
  "60  MARSEILLE 68483 LILLE"
)

I'd like to separate each element in two after the first word. Expected output:

"73000 PARIS" "74000 LYON" "75000 MARSEILLE" "68483 LILLE" "60 MARSEILLE" "68483 LILLE"

Note that the number of spaces between two elements in the original text is not necessarily the same (e.g the number of spaces between PARIS and 74000 is not the same than the number of spaces between MARSEILLE and 68483). Also, sometimes the first number has a space in it (e.g 75 000) and sometimes not (e.g 73000).

I tried to adapt this answer but without success:

(delimitedString = gsub( "^([a-z]+) (.*) ([a-z]+)$", "\\1,\\2", foo_text))

Any idea how to do that?

maydin
  • 3,715
  • 3
  • 10
  • 27
bretauv
  • 7,756
  • 2
  • 20
  • 57

4 Answers4

13

We can try using strsplit here as follows:

foo_text <- c(
    "73000 PARIS   74000 LYON",
    "75 000 MARSEILLE 68483 LILLE",
    "60  MARSEILLE 68483 LILLE"
)
output <- unlist(strsplit(foo_text, "(?<=[A-Z])\\s+(?=\\d)", perl=TRUE))
output

[1] "73000 PARIS"      "74000 LYON"       "75 000 MARSEILLE" "68483 LILLE"
[5] "60  MARSEILLE"    "68483 LILLE"

The regex pattern used here says to split when:

(?<=[A-Z])  what precedes is an uppercase letter
\\s+        split (and consume) on one or more whitespace characters
(?=\\d)     what follows is a digit
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
3

Another possible solution, based on tidyverse:

library(tidyverse) 

foo_text <- c(
  "73000 PARIS   74000 LYON",
  "75 000 MARSEILLE 68483 LILLE",
  "60  MARSEILLE 68483 LILLE"
)

foo_text %>% 
  str_split("(?<=[:alpha:])\\s+(?=\\d)") %>% flatten %>% 
  str_remove_all("(?<=\\d)\\s+(?=\\d)")

#> [1] "73000 PARIS"     "74000 LYON"      "75000 MARSEILLE" "68483 LILLE"    
#> [5] "60  MARSEILLE"   "68483 LILLE"
PaulS
  • 21,159
  • 2
  • 9
  • 26
3

You are using a pattern ^([a-z]+) (.*) ([a-z]+)$ with gsub that is anchored and matches a char [a-z] at the start and at the end of the string, which does not take a digit into account and can not match multiple parts in the same string due to the anchors.

For your example data, you might also match all parts that have digits and spaces in the first part, followed by 1 or more parts without a digit.

library(stringr)
s <- c(
  "73000 PARIS   74000 LYON",
  "75 000 MARSEILLE 68483 LILLE",
  "60  MARSEILLE 68483 LILLE"
)
unlist(str_match_all(s, "\\b\\d[\\d\\s]*(?:\\s+[^\\d\\s]+)+"))

Output

[1] "73000 PARIS"      "74000 LYON"       "75 000 MARSEILLE" "68483 LILLE"     
[5] "60  MARSEILLE"    "68483 LILLE" 

See an R demo and a regex demo.

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
3

Here are some other base R options

> scan(text = gsub("(?<=\\D)\\s+(?=\\d)", "\n", foo_text, perl = TRUE), sep = "\n", what = "character")
Read 6 items
[1] "73000 PARIS"      "74000 LYON"       "75 000 MARSEILLE" "68483 LILLE"
[5] "60  MARSEILLE"    "68483 LILLE"

> read.delim2(text = gsub("(?<=\\D)\\s+(?=\\d)", "\n", foo_text, perl = TRUE), header = FALSE)$V1
[1] "73000 PARIS"      "74000 LYON"       "75 000 MARSEILLE" "68483 LILLE"
[5] "60  MARSEILLE"    "68483 LILLE"
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81