Split string without delimeters

Question

I have df which contains a column with metro lines. The problem is that L9N and L10N appear sometimes in the same row and I want to split them into two different rows. I have been trying many things but I can't figure out how to do it.

Type	Lines	Year
METRO	L5	1959
METRO	L5	1959
METRO	L5	1959
METRO	L9NL10N	2009
METRO	L9S	2016
METRO	L10S	2018
METRO	L10N	2010
METRO	L4	1926
METRO	L1	1926
METRO	L1	1926

Apart from "L9NL10N", are there other "Lines" with such issues? So that we can post more generalised solutions. Do metro line names always start with letter "L"? — zx8754, Apr 01 '22 at 08:21
All lines start with "L" and only lines 9 and 10 have such issues, but it appears in two cases: "L9NL10N" and "L9SL10S" — Berta_94, Apr 01 '22 at 08:57

Wimpel · Answer 1 · 2022-04-01T09:05:11.773

4

A data.table oneliner.
It uses a lookbehind-lookahead regex "(?<=\[NS])(?=L)" to identify splitting points (between capital N (or S) and capital L), and then splits those rows, keeping the delimeter.

library(data.table)
setDT(mydata)[, .(Lines = unlist(tstrsplit(Lines, "(?<=[NS])(?=L)", perl = TRUE))), by = .(Type, Year)][]

#     Type Year Lines
# 1: METRO 1959    L5
# 2: METRO 1959    L5
# 3: METRO 1959    L5
# 4: METRO 2009   L9N
# 5: METRO 2009  L10N
# 6: METRO 2016   L9S
# 7: METRO 2018  L10S
# 8: METRO 2010  L10N
# 9: METRO 1926    L4
#10: METRO 1926    L1
#11: METRO 1926    L1

edit:
Use regex "(?<=[A-Z])(?=L)" for splitting after any capital letter [A-Z], followed by the capital letter L.

edited Apr 01 '22 at 09:05

answered Apr 01 '22 at 08:20

Wimpel

26,031
1
20
37

updated code, to the S followed by an L is also a delimeter. – Wimpel Apr 01 '22 at 09:02
OP clarified it always starts with "L", so we can just look for L and split: `strsplit(c("L5", "L9NL10N", "L9SL10S"), '(?<=.)(?=[L])', perl = TRUE)` – zx8754 Apr 01 '22 at 09:04
That is correct... but what if a line also ends with an L , like `L9LL10N` (ok, now I'm just making up usecases ;-) ) – Wimpel Apr 01 '22 at 09:07
1

I think @jpiversen guessed it right, last letter is Cardinal directions = N,E,S,W. So metro going Louth is unlikely on Earth. :D – zx8754 Apr 01 '22 at 09:17

score 2 · Answer 2 · answered Apr 01 '22 at 08:30

Here is a tidy and efficient way for data.frames:

library(dplyr)

df %>% 
  mutate(Lines = stringr::str_extract_all(Lines, "L\\d*([NSEW]?)")) %>% 
  tidyr::unnest(Lines)

#> # A tibble: 11 × 3
#>    Type  Lines  Year
#>    <chr> <chr> <dbl>
#>  1 METRO L5     1959
#>  2 METRO L5     1959
#>  3 METRO L5     1959
#>  4 METRO L9N    2009
#>  5 METRO L10N   2009
#>  6 METRO L9S    2016
#>  7 METRO L10S   2018
#>  8 METRO L10N   2010
#>  9 METRO L4     1926
#> 10 METRO L1     1926
#> 11 METRO L1     1926

^{Created on 2022-04-01 by the reprex package (v2.0.1)}

It will work for any duplicated line following the pattern: L <som number> <possible one of N, S, E or W>.

Data

df <- tibble::tribble(
  ~Type, ~Lines, ~Year,
  "METRO",  "L5", 1959,
  "METRO",  "L5", 1959,
  "METRO",  "L5", 1959,
  "METRO",  "L9NL10N",  2009,
  "METRO",  "L9S",  2016,
  "METRO",  "L10S", 2018,
  "METRO",  "L10N", 2010,
  "METRO",  "L4", 1926,
  "METRO",  "L1", 1926,
  "METRO",  "L1",   1926
)

PaulS · Answer 3 · 2022-04-01T11:28:14.300

Another possible solution:

library(tidyverse)

df %>% 
  separate_rows(Lines, sep="(?<=[N|S])(?=L)")

#> # A tibble: 11 × 3
#>    Type  Lines  Year
#>    <chr> <chr> <int>
#>  1 METRO L5     1959
#>  2 METRO L5     1959
#>  3 METRO L5     1959
#>  4 METRO L9N    2009
#>  5 METRO L10N   2009
#>  6 METRO L9S    2016
#>  7 METRO L10S   2018
#>  8 METRO L10N   2010
#>  9 METRO L4     1926
#> 10 METRO L1     1926
#> 11 METRO L1     1926

Split string without delimeters

3 Answers3

Data