automatically fill a column with the same information AND use na for missing values in r

Question

I'm trying to scrape all competitor information such as the competitors division, gender, belt, weight, and other things from this website. The end goal is to put all competitor information from this page into one data frame.

First Question: The division, gender, belt, and weight only appear once at the top of the page, but I want r to automatically fill in this information next to each competitors name in a data frame. How can I code this so that the appropriate information is correctly filled next to each competitor?

Second Question: How can I input NA for missing information, like the date or competitor number?

Because of the varying lengths, my code cannot place any of the scraped data into a df.

library(rvest)
library(tidyverse)

MensUrl <- read_html('https://www.bjjcompsystem.com/tournaments/1869/categories/2053147')

## SCRAPE FIGHT INFO -------------------------------------------
ageDivision <- MensUrl %>% 
  html_nodes('.category-title__age-division') %>% 
  html_text()

gender <- MensUrl %>% 
  html_nodes('.category-title__age-division+ .category-title__label') %>% 
  html_text()

belt <- MensUrl %>% 
  html_nodes('.category-title__label:nth-child(3)') %>% 
  html_text()

weight <-  MensUrl %>% 
  html_nodes('.category-title__label:nth-child(4)') %>% 
  html_text()

fightAndMat <- MensUrl %>% 
  html_nodes('.bracket-match-header__where , .bracket-match-header__fight') %>% 
  html_text()

date = MensUrl %>% 
  html_nodes('.bracket-match-header__when') %>% 
  html_text()

CompetitorNo = MensUrl %>% 
  html_nodes('.match-card__competitor-n') %>% 
  html_text()

name = MensUrl %>% 
  html_nodes('.match-card__competitor-description div:nth-child(1)') %>% 
  html_text()

gym = MensUrl %>% 
  html_nodes('.match-card__club-name') %>% 
  html_text()

# create match df 
matches = data.frame('division' = ageDivision,
                     'gender' = gender,
                     'belt' = belt,
                     'weight' = weight,
                     'fightAndMat' = fightAndMat,
                     'date' = date,
                     'competitor' = CompetitorNo,
                     'name' = name,
                     'gym' = gym)

This is similar to what the end data frame should look like:

Simply searching for all of the information individually is likely a bad approach. Look for related information (e.g. name, gym) in the same node, and find the node first, then extract both name and gym from it so you know they are related (and in the same quantity). — dcsuka, Sep 11 '22 at 02:16
I tried this too, but my problem still circles back to dealing with missing data. there are some instances where no gym name is listed, so I'm not sure how handle that when it comes time to separate the strings.. — bandcar, Sep 11 '22 at 02:42

dcsuka · Answer 1 · 2022-09-11T03:42:43.277

Using flatten = FALSE in xml2 is what you need to handle the missing data. Here is an example for two of the variables, first finding nodes then subnodes:

library(xml2)

des <- MensUrl %>% 
  html_nodes(xpath = "//span[contains(@class, 'match-card__competitor-description')]")

comp_names <- des %>%
  xml_find_all(xpath = "./div[contains(@class, 'match-card__competitor-name')]", flatten = FALSE) %>%
  lapply(function(x) {txt <- xml_text(x)
                      ifelse(identical(txt, character(0)), NA_character_, txt)}) %>%
  unlist()

club_names <- des %>%
  xml_find_all(xpath = "./div[contains(@class, 'match-card__club-name')]", flatten = FALSE) %>%
  lapply(function(x) {txt <- xml_text(x)
                      ifelse(identical(txt, character(0)), NA_character_, txt)}) %>%
  unlist()

tibble(A = comp_names, B = club_names)

# # A tibble: 30 × 2
#    A                          B                   
#    <chr>                      <chr>               
#  1 Francisco J. O'Ryan Lesser "Cohab Chile"       
#  2 Bryan Nguyen               "Logic"             
#  3 Lio Alexander Duarte       "Brazilian Top Team"
#  4 NA                          NA                 
#  5 Mark Luebking              "CheckMat"          
#  6 Francisco J. O'Ryan Lesser "Cohab Chile"       
#  7 Lio Alexander Duarte       "Brazilian Top Team"
#  8 Omar Eduardo Alfaro Solis  "LEAD BJJ"          
#  9 Lucas John Wagner          "Alliance "         
# 10 Lio Alexander Duarte       "Brazilian Top Team"

As for your first question, just scrape the single words, then add a column to the tibble with that word.

automatically fill a column with the same information AND use na for missing values in r

1 Answers1