R: Creating Tidy df based on URL

Question

I want to analyse data from a website regarding visitors. Unfortunately, I'm not sure if I can post the df publicly, so I'll describe it the best I can.

I basically have three columns:

date: containing the date (YYYY-MM-DD),
url: Containing the full url of the page
views: The number of visits for that url for that day

What I want to is to categorize the data based on the url, by making new columns. To take stackoverflow as an example, I have urls like:

stackoverflow.com/questions
stackoverflow.com/job
stackoverflow.com/users

For these I want to create a new categorical variable 'Main_cat' with 'Questions', 'Jobs' and 'Users' as its levels. For that I'm currently using this, which I found in another topic here.

   df <- df %>%
      mutate(Main_cat= case_when(
        grepl(".*flow.com/questions.*", url) ~ "Questions",
        grepl(".*flow.com/jobs.*", url) ~ "Jobs",
        grepl(".*flow.com/users.*", url) ~ "Users")) %>% mutate(Main_cat =  as.factor(Main_cat))

This works decently, though not great. The number of main categories I'm working with is about twelve. My full dataset is about 220.000 observations. So processing in a set-up like this takes a while. Feels like I'm working very inefficient.

In addition I'm working with sub-categories based on countries:

stackoverflow.com/job/belgium
stackoverflow.com/job/brazil
stackoverflow.com/job/china
stackoverflow.com/job/germany
stackoverflow.com/job/france

These I want to divide by new variables like Continent, and Country, since also the countries have subdivisions (...job/belgium/retail, ...job/belgium/it). In the end I would like to sort my data by country, or by sector or both using filter() and then perform an analysis.

I can use the mutate/case_when/grepl for all of the above, but judging from how long it takes R to finish, something doesn't seem right. I'm hoping there's a better way that takes less time to process.

Hope this is clear enough, thanks in advance!

If the all the urls are of similar structure, you can try splitting the string on "/" , assigning the second element to "Main_cat" and assigning the third element to "country" and so on. This might help https://stackoverflow.com/questions/33683862/first-entry-from-string-split — Shubham Pujan, Oct 08 '20 at 16:35
All your `grepl` patterns are of `.*PATTERN.*` kind and that is inefficient as `grepl` regex does not need to match the whole string. Remove all `.*` in the regexps to make them work faster. Actually, those are literal texts, add `fixed=TRUE` as an argument to `grepl`. — Wiktor Stribiżew, Oct 08 '20 at 20:47

R: Creating Tidy df based on URL

0 Answers0