0

I created a data frame of reviews from a website. The three columns are date, rating, and text. I want to only see 1 and 5 star reviews. I have tried everything below and get roughly the same error

df %>% filter(Rating = '1 star', Rating = '5 star')

df$Rating

[1] Date   Rating Text  
<0 rows> (or 0-length row.names)

None have worked. Here's the full code. The bit with the df is at the very bottom:

library(rvest)
library(tidyverse)

# Create url object ---------------------------------
url = "https://www.yelp.com/biz/24th-st-pizzeria-san-antonio?osq=Worst+Restaurant"

# Convert url to html object ------------------------
page <- read_html(url)

# Number of pages -----------------------------------
pageNums = page %>%
  html_elements(xpath = "//div[@class=' border-color--default__09f24__NPAKY text-align--center__09f24__fYBGO']") %>%
  html_text() %>%
  str_extract('of.*') %>% 
  str_remove('of ') %>% 
  as.numeric() 

# Create page sequence ------------------------------
pageSequence <- seq(from=0, to=(pageNums * 10)-10, by = 10)

# Create empty vectors to store data ----------------
review_date_all = c()
review_rating_all = c()
review_text_all = c()

# Create for loop -----------------------------------
for (i in pageSequence){
  if (i==0){
    page <- read_html(url) 
  } else {
    page <- read_html(paste0(url, '&start=', i))
  }
  
  # Review date ----
  review_dates <- page %>%
    html_elements(xpath = "//*[@class=' css-chan6m']") %>%
    html_text() %>%
    .[str_detect(., "^\\d+[/]\\d+[/]\\d{4}$")]
  
  # Review Rating ----
  review_ratings <- page %>%
    html_elements(xpath = "//div[starts-with(@class, ' review')]") %>%
    html_elements(xpath = ".//div[contains(@aria-label, 'rating')]") %>%
    html_attr('aria-label') %>%
    str_remove('rating')
  
  # Review text ----
  review_text = page %>%
    html_elements(xpath = "//p[starts-with(@class, 'comment')]") %>%
    html_text()
  
  # For each page, append these to appropriate vectors----
  review_date_all = append(review_date_all, review_dates)
  review_rating_all = append(review_rating_all, review_ratings)
  review_text_all = append(review_text_all, review_text)
}

# Create data frame ---------------------------------
df <- data.frame('Date' = review_date_all,
                 'Rating' = review_rating_all,
                 'Text'= review_text_all)
View(df)

What am I overlooking?

bandcar
  • 649
  • 4
  • 11
  • 1
    If your question is about an error, it will be useful to share what that error is. In this case I suspect the issue is you are using `filter(Rating = '1 star')` when you want `filter(Rating == '1 star')`. See here: https://stackoverflow.com/questions/28176650/what-is-the-difference-between-and – Jon Spring Mar 18 '22 at 05:34
  • You are so right! I must have deleted it when I was editing my post. I just added it back, but thankfully, someone was able to answer. – bandcar Mar 18 '22 at 06:07

1 Answers1

1

There's an issue with the Rating values in your df. There's an extra space at the end of every rating.

So you need to do something like this:

df1 <- df %>%
  filter(Rating == '1 star ' | Rating == '5 star ')

You can also remove the trailing whitespace using stringr library as follows:

library(stringr)
df1 <- df %>%
  mutate(Rating = str_squish(Rating)) %>%
  filter(Rating == '1 star' | Rating == '5 star')
Vishal A.
  • 1,373
  • 8
  • 19