I want to analyse data from a website regarding visitors. Unfortunately, I'm not sure if I can post the df publicly, so I'll describe it the best I can.
I basically have three columns:
- date: containing the date (YYYY-MM-DD),
- url: Containing the full url of the page
- views: The number of visits for that url for that day
What I want to is to categorize the data based on the url, by making new columns. To take stackoverflow as an example, I have urls like:
- stackoverflow.com/questions
- stackoverflow.com/job
- stackoverflow.com/users
For these I want to create a new categorical variable 'Main_cat' with 'Questions', 'Jobs' and 'Users' as its levels. For that I'm currently using this, which I found in another topic here.
df <- df %>%
mutate(Main_cat= case_when(
grepl(".*flow.com/questions.*", url) ~ "Questions",
grepl(".*flow.com/jobs.*", url) ~ "Jobs",
grepl(".*flow.com/users.*", url) ~ "Users")) %>% mutate(Main_cat = as.factor(Main_cat))
This works decently, though not great. The number of main categories I'm working with is about twelve. My full dataset is about 220.000 observations. So processing in a set-up like this takes a while. Feels like I'm working very inefficient.
In addition I'm working with sub-categories based on countries:
- stackoverflow.com/job/belgium
- stackoverflow.com/job/brazil
- stackoverflow.com/job/china
- stackoverflow.com/job/germany
- stackoverflow.com/job/france
These I want to divide by new variables like Continent, and Country, since also the countries have subdivisions (...job/belgium/retail, ...job/belgium/it). In the end I would like to sort my data by country, or by sector or both using filter() and then perform an analysis.
I can use the mutate/case_when/grepl for all of the above, but judging from how long it takes R to finish, something doesn't seem right. I'm hoping there's a better way that takes less time to process.
Hope this is clear enough, thanks in advance!