1

I have df which contains a column with metro lines. The problem is that L9N and L10N appear sometimes in the same row and I want to split them into two different rows. I have been trying many things but I can't figure out how to do it.

Type Lines Year
METRO L5 1959
METRO L5 1959
METRO L5 1959
METRO L9NL10N 2009
METRO L9S 2016
METRO L10S 2018
METRO L10N 2010
METRO L4 1926
METRO L1 1926
METRO L1 1926
zx8754
  • 52,746
  • 12
  • 114
  • 209
Berta_94
  • 21
  • 3
  • Apart from "L9NL10N", are there other "Lines" with such issues? So that we can post more generalised solutions. Do metro line names always start with letter "L"? – zx8754 Apr 01 '22 at 08:21
  • Related post https://stackoverflow.com/q/41912524/680068 – zx8754 Apr 01 '22 at 08:28
  • 2
    All lines start with "L" and only lines 9 and 10 have such issues, but it appears in two cases: "L9NL10N" and "L9SL10S" – Berta_94 Apr 01 '22 at 08:57

3 Answers3

4

A data.table oneliner.
It uses a lookbehind-lookahead regex "(?<=\[NS])(?=L)" to identify splitting points (between capital N (or S) and capital L), and then splits those rows, keeping the delimeter.

library(data.table)
setDT(mydata)[, .(Lines = unlist(tstrsplit(Lines, "(?<=[NS])(?=L)", perl = TRUE))), by = .(Type, Year)][]

#     Type Year Lines
# 1: METRO 1959    L5
# 2: METRO 1959    L5
# 3: METRO 1959    L5
# 4: METRO 2009   L9N
# 5: METRO 2009  L10N
# 6: METRO 2016   L9S
# 7: METRO 2018  L10S
# 8: METRO 2010  L10N
# 9: METRO 1926    L4
#10: METRO 1926    L1
#11: METRO 1926    L1

edit:
Use regex "(?<=[A-Z])(?=L)" for splitting after any capital letter [A-Z], followed by the capital letter L.

Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • updated code, to the S followed by an L is also a delimeter. – Wimpel Apr 01 '22 at 09:02
  • OP clarified it always starts with "L", so we can just look for L and split: `strsplit(c("L5", "L9NL10N", "L9SL10S"), '(?<=.)(?=[L])', perl = TRUE)` – zx8754 Apr 01 '22 at 09:04
  • That is correct... but what if a line also ends with an L , like `L9LL10N` (ok, now I'm just making up usecases ;-) ) – Wimpel Apr 01 '22 at 09:07
  • 1
    I think @jpiversen guessed it right, last letter is Cardinal directions = N,E,S,W. So metro going Louth is unlikely on Earth. :D – zx8754 Apr 01 '22 at 09:17
2

Here is a tidy and efficient way for data.frames:

library(dplyr)

df %>% 
  mutate(Lines = stringr::str_extract_all(Lines, "L\\d*([NSEW]?)")) %>% 
  tidyr::unnest(Lines)

#> # A tibble: 11 × 3
#>    Type  Lines  Year
#>    <chr> <chr> <dbl>
#>  1 METRO L5     1959
#>  2 METRO L5     1959
#>  3 METRO L5     1959
#>  4 METRO L9N    2009
#>  5 METRO L10N   2009
#>  6 METRO L9S    2016
#>  7 METRO L10S   2018
#>  8 METRO L10N   2010
#>  9 METRO L4     1926
#> 10 METRO L1     1926
#> 11 METRO L1     1926

Created on 2022-04-01 by the reprex package (v2.0.1)

It will work for any duplicated line following the pattern: L <som number> <possible one of N, S, E or W>.

Data

df <- tibble::tribble(
  ~Type, ~Lines, ~Year,
  "METRO",  "L5", 1959,
  "METRO",  "L5", 1959,
  "METRO",  "L5", 1959,
  "METRO",  "L9NL10N",  2009,
  "METRO",  "L9S",  2016,
  "METRO",  "L10S", 2018,
  "METRO",  "L10N", 2010,
  "METRO",  "L4", 1926,
  "METRO",  "L1", 1926,
  "METRO",  "L1",   1926
)
jpiversen
  • 3,062
  • 1
  • 8
  • 12
0

Another possible solution:

library(tidyverse)

df %>% 
  separate_rows(Lines, sep="(?<=[N|S])(?=L)")

#> # A tibble: 11 × 3
#>    Type  Lines  Year
#>    <chr> <chr> <int>
#>  1 METRO L5     1959
#>  2 METRO L5     1959
#>  3 METRO L5     1959
#>  4 METRO L9N    2009
#>  5 METRO L10N   2009
#>  6 METRO L9S    2016
#>  7 METRO L10S   2018
#>  8 METRO L10N   2010
#>  9 METRO L4     1926
#> 10 METRO L1     1926
#> 11 METRO L1     1926
PaulS
  • 21,159
  • 2
  • 9
  • 26