0

I have a data frame that looks like this basketball scores

Reproduceable code can be found here: https://gist.github.com/jeffgswanson/703bb9eb1698518d1dd9aec43e91fefd

I want to break this up into four columns: Away Score, Away Team, Home Score, Home Team.

The problem I'm running into is some of the teams I've scraped also have a ranking next to them in parentheses. I want to remove the rankings and then split into four columns. I've tried countless ways to either avoid scraping what's in the parentheses in the first place or remove them after the fact but to no avail.

UPDATE 8/15/2023: I was able to get the parentheses moved and now my data table looks like this:

scores <- data.frame(V1 = c("52 Gering 38 Alliance", "60 Wahoo 43 Bennington", "58 Gothenburg 43 Lexington", "73 Plattview 62 Elkhorn North"))

How can I break this single column up into four columns: Away Score, Away Team, Home Score, Home Team

Jeff Swanson
  • 45
  • 1
  • 8
  • 2
    I tried loading the data from your gist, and something either timed out or was just taking too long, so I killed it. I don't think we need all of your data, could you please add the output from `dput(head(x, 30))` to your question? – r2evans Aug 13 '23 at 19:23
  • Yeah it scrapes scores from the web and takes a minute to put together. I'm not sure how to use dput but will see if I can figure out how to get you a data frame in another way. – Jeff Swanson Aug 14 '23 at 16:41
  • Actually use this: V1 <- c("52 (#18) Gering 38 (#22) Alliance", "60 Wahoo 43 (#5) Bennington", "47 Madison 37 Schulyer") scores <- data.frame(V1) – Jeff Swanson Aug 14 '23 at 16:50

2 Answers2

0

Here is a two step approach of adding an "_" at the desired locations and then using tidyr's separate() function to break the single column into 4.

I am adding the _ after the space after a number and before a letter. And the second adding the _ after the space after a letter and before a number. This avoids any confusion with the numbers in the ().

library(tidyr)
#1 line sample data
score <- c("52 (#18) Lincoln Lutheran 38 (#22) Gross Alliance", "52 (#18) Gering 38 (#22) Alliance", "60 Wahoo 43 (#5) Bennington", "47 Madison 37 Schulyer")

df <- data.frame(score)

#remove the parentheses and contents
score <- gsub( "\\(.+?\\) ", "", df$score)

#add a " _" at the desired breaks
score<-gsub( "(\\d )(\\D)", "\\1_\\2", score)
score<-gsub( "(\\D )(\\d)", "\\1_\\2", score)

df$score <- score

#Separate the single column into 4
separate(df, score, into=c("Away score", "Away Team", "Home score", "Home Team"), sep=" _")
#    Away score        Away Team Home score      Home Team
#  1         52 Lincoln Lutheran         38 Gross Alliance
Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • I did not articulate this very well above, but I need everything in the parentheses, including the parentheses, removed because later on I need to summarize the data. So for example "Lincoln Lutheran" would not align with "(#18 Lincoln Lutheran)" – Jeff Swanson Aug 14 '23 at 16:02
  • @JeffSwanson, Ok, see edit above. – Dave2e Aug 14 '23 at 21:58
0

base R

scores2 <- cbind(
  scores,
  strcapture("^ *([0-9]+) +(.*) +([0-9]+) +(.*) *$", scores$V1, 
             proto = list(awayscore=0L, awayteam="", homescore=0L, hometeam=""))
)
teams <- grep("team$", colnames(scores2), value = TRUE)
scores2[,teams] <- lapply(scores2[,teams], gsub, pattern = "\\([^)]*\\)", replacement = "")
scores2
#                                  V1 awayscore awayteam homescore    hometeam
# 1 52 (#18) Gering 38 (#22) Alliance        52   Gering        38    Alliance
# 2       60 Wahoo 43 (#5) Bennington        60    Wahoo        43  Bennington
# 3            47 Madison 37 Schulyer        47  Madison        37    Schulyer

dplyr

library(dplyr)
strcapture("^ *([0-9]+) +(.*) +([0-9]+) +(.*) *$", scores$V1, 
           proto = list(awayscore=0L, awayteam="", homescore=0L, hometeam="")) %>%
  bind_cols(scores, .) %>%
  mutate(across(ends_with("team"), ~ gsub("\\([^)]*\\)", "", .)))
#                                  V1 awayscore awayteam homescore    hometeam
# 1 52 (#18) Gering 38 (#22) Alliance        52   Gering        38    Alliance
# 2       60 Wahoo 43 (#5) Bennington        60    Wahoo        43  Bennington
# 3            47 Madison 37 Schulyer        47  Madison        37    Schulyer

(If you don't want to preserve the V1 column, you can get rid of the bind_cols line.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Appreciate this. It works when I use it on the example three rows I provided above. However, when I try to use this on my data frame with all of the scores (code in initial post above) it return "NA"s for all of the columns except V1. There is something odd going on with my full data frame that I can't put my finger on. – Jeff Swanson Aug 14 '23 at 19:16
  • 1
    @JeffSwanson that's one reason it's best to provide a good variety of sample data in your question (not as a comment, it renders poorly). I suggest you test on batches of rows at a time until you find a row that fails. Add that row to the sample data you provided, and please [edit](https://stackoverflow.com/posts/76894746/edit) your question and place it there (using `dput`, please). Thanks! – r2evans Aug 14 '23 at 19:18
  • I have updated the original question and provided sample data as well. I was able to remove the parentheses / rankings, now I just need to break the single column into four columns. – Jeff Swanson Aug 15 '23 at 20:57
  • The four rows you provide in the sample `score` work well with both the base R and dplyr code in my answer, both returning `structure(list(awayscore = c(52L, 60L, 58L, 73L), awayteam = c("Gering", "Wahoo", "Gothenburg", "Plattview"), homescore = c(38L, 43L, 43L, 62L), hometeam = c("Alliance", "Bennington", "Lexington", "Elkhorn North")), class = "data.frame", row.names = c(NA, -4L))` (minus the original `V1` column, for brevity). – r2evans Aug 15 '23 at 21:15