Splitting a Column According to the Natural Format of its Characters in R

Question

I have the following dataframe:

library(rvest)
library(XML)
library(tidyr)
library(zoo)
library(chron)
library(lubridate)
library(stringr)
page.201702050atl = read_html("http://www.pro-football-reference.com/boxscores/201702050atl.htm")
comments.201702050atl = page.201702050atl %>% html_nodes(xpath = "//comment()")
pbp.201702050atl = comments.201702050atl[45] %>% html_text() %>% read_html() %>% html_node("#pbp") %>% html_table()
colnames(pbp.201702050atl) = c('Quarter', 'Time', 'Down', 'ToGo', 'Location', 'Detail', 'Away.Score', 'Home.Score', 'EPB', 'EPA', 'Win.pct')
pbp.201702050atl.a = pbp.201702050atl[-union(which(pbp.201702050atl$Quarter == '1st Quarter'), which(pbp.201702050atl$Quarter == 'Quarter')), ]
pbp.201702050atl.b = pbp.201702050atl.a[-union(which(pbp.201702050atl.a$Quarter == '2nd Quarter'), which(pbp.201702050atl.a$Quarter == '3rd Quarter')), ]
pbp.201702050atl.c = pbp.201702050atl.b[-union(which(pbp.201702050atl.b$Quarter == '4th Quarter'), which(pbp.201702050atl.b$Quarter == 'Overtime')), ]
pbp.201702050atl.d = pbp.201702050atl.c[-which(pbp.201702050atl.c$Quarter == 'End of Overtime'), ]

I would like to make a new dataframe that splits pbp.201702050atl.d$Location into two columns so that the character elements compose one and the numeric elements compose the other, like so:

     V1    V2
1    "ATL" "35"
2    "NWE" "25"
3    "NWE" "34"
4    "NWE" "34"
5    "NWE" "34"
6    "NWE" "34"
7    "ATL" "34"
8    "ATL" "34"
9    "ATL" "34"
10   ""    "50"
...

To do this, I've written:

Location.201702050atl = as.data.frame(str_split_fixed(as.character(pbp.201702050atl.d$Location), boundary("word"), n = 2))

While close to what I desire, this function results in:

     V1    V2
1    "ATL" "35"
2    "NWE" "25"
3    "NWE" "34"
4    "NWE" "34"
5    "NWE" "34"
6    "NWE" "34"
7    "ATL" "34"
8    "ATL" "34"
9    "ATL" "34"
10   "50"  ""
...

Notice Location.201702050atl[10,]. This function only places characters in Location.201702050atl$V2 if, for that row, the original column consists of two sets of characters seperated by a space. Instead, I would like to place similar (text) characters in Location.201702050atl$V1 and similar (numeric) characters in Location.201702050atl$V2. How do I split the elements of one column according to the natural format of its characters when the entire column, practically, has to be formatted identically, regardles of the natural format of its composing characters? Your help is much appreciated, thank you.

score 1 · Accepted Answer · answered Mar 10 '17 at 11:43

If I understood you correctly, perhaps this can help

library(data.table)
DT <- data.table(C1=replicate(10, paste0(sample(99,1), paste0(sample(LETTERS,2), collapse = "")) ) )
# Simulating a white space
DT$C1[10] <- "84 ME"
DT
    C1
 1:  38XT
 2:  29XL
 3:  24XH
 4:  14SC
 5:  34SY
 6:  80WB
 7:  23VB
 8:  23WR
 9:  19KJ
10: 84 ME
DT[, `:=` (C1_1 = gsub("[\\d]", "", C1, perl = T), C1_2 = gsub("[^\\d]", "", C1, perl = T)) ]
DT
       C1 C1_1 C1_2
 1:  38XT   XT   38
 2:  29XL   XL   29
 3:  24XH   XH   24
 4:  14SC   SC   14
 5:  34SY   SY   34
 6:  80WB   WB   80
 7:  23VB   VB   23
 8:  23WR   WR   23
 9:  19KJ   KJ   19
10: 84 ME   ME   84

If you need to drop the original column, you can just

DT[, C1:=NULL]

Notice that this regular expression will remove all digits in the first, and all non-digits in the second. This will not take into account the order. For instance, D7M8, will return, DM and 78.

Splitting a Column According to the Natural Format of its Characters in R

1 Answers1