0

I want to analyse data from a website regarding visitors. Unfortunately, I'm not sure if I can post the df publicly, so I'll describe it the best I can.

I basically have three columns:

  • date: containing the date (YYYY-MM-DD),
  • url: Containing the full url of the page
  • views: The number of visits for that url for that day

What I want to is to categorize the data based on the url, by making new columns. To take stackoverflow as an example, I have urls like:

  • stackoverflow.com/questions
  • stackoverflow.com/job
  • stackoverflow.com/users

For these I want to create a new categorical variable 'Main_cat' with 'Questions', 'Jobs' and 'Users' as its levels. For that I'm currently using this, which I found in another topic here.

   df <- df %>%
      mutate(Main_cat= case_when(
        grepl(".*flow.com/questions.*", url) ~ "Questions",
        grepl(".*flow.com/jobs.*", url) ~ "Jobs",
        grepl(".*flow.com/users.*", url) ~ "Users")) %>% mutate(Main_cat =  as.factor(Main_cat))

This works decently, though not great. The number of main categories I'm working with is about twelve. My full dataset is about 220.000 observations. So processing in a set-up like this takes a while. Feels like I'm working very inefficient.

In addition I'm working with sub-categories based on countries:

  • stackoverflow.com/job/belgium
  • stackoverflow.com/job/brazil
  • stackoverflow.com/job/china
  • stackoverflow.com/job/germany
  • stackoverflow.com/job/france

These I want to divide by new variables like Continent, and Country, since also the countries have subdivisions (...job/belgium/retail, ...job/belgium/it). In the end I would like to sort my data by country, or by sector or both using filter() and then perform an analysis.

I can use the mutate/case_when/grepl for all of the above, but judging from how long it takes R to finish, something doesn't seem right. I'm hoping there's a better way that takes less time to process.

Hope this is clear enough, thanks in advance!

Justin
  • 141
  • 8
  • If the all the urls are of similar structure, you can try splitting the string on "/" , assigning the second element to "Main_cat" and assigning the third element to "country" and so on. This might help https://stackoverflow.com/questions/33683862/first-entry-from-string-split – Shubham Pujan Oct 08 '20 at 16:35
  • All your `grepl` patterns are of `.*PATTERN.*` kind and that is inefficient as `grepl` regex does not need to match the whole string. Remove all `.*` in the regexps to make them work faster. Actually, those are literal texts, add `fixed=TRUE` as an argument to `grepl`. – Wiktor Stribiżew Oct 08 '20 at 20:47

0 Answers0