Tracking observations across different datasets (e.g. sub-reddits)

Question

I have three datasets from three different sub-reddits, and my goal is 1-to check how many users are active in df1 (i.e. a sub-reddit), active in df2, and/or df3 (i.e. another subreddit). Another goal is that once I merge all datasets, I am able to know in which sub-reddit was each user's post written at. For example, I would be interested in knowing if user X is active in sub-reddits 2 and 3, but not 1. And user Y is active in sub-reddits 1 and 3 but not 2.

In each dataset, I have 3 variables shown below:

post   date        username

Here is a sample of df1

post   date          username
xyz    1-03-2016     crashbash
mnz   1-03-2016      crashbash
mnc   1-03-2016      crashbash

Here is a sample of df2

post   date          username
yzh  1-05-2016      crashbash
wzh  1-05-2016      costanza89
zya  1-05-2016      costanza89

Here is a sample of df3

post   date             username
Fleabag is bad          1-05-2016      costanza89
southpark is the bestt!  1-08-2016      crashbash
fleabag is ok           1-08-2016      skunk49

Here is my code:

#Clearing out environment
rm(list = ls())
#Loading packages
library(tidyverse)
library(readxl)
library(writexl)
library(quanteda)
library(stringr)
library(textclean)
library(lubridate)
library(zoo)

## importing 3 datasets
df1 <- read_excel("df1.xlsx")
df2 <- read_excel("df2.xlsx")
df3 <- read_excel("df3.xlsx")

I currently wrote the code below, which works well but it only tells me if a given user has more than one post in a given sub-reddit, but it does not make a distinction between users who have multiple posts within one sub-reddit, versus those who are active in more than one sub-reddit. I am mainly interested in learning the latter group.

all_subreddits <-
  bind_rows(df1,df2,df3,.id = "origin") %>% 
  group_by(username) %>% 
  mutate(active = +(n_distinct(origin) == 2), .keep = "unused")

After the code above, the data looks as follows, where active= 1 if a user appears more than once and 0 otherwise.

sapply(all_subreddits, class)
       post        date    username      active 
"character" "character" "character"   "integer"

Ideally, however, I would like to have the following outcome with a variable indicating the sub-reddit where each user has been active in:

post              date           username               active
xyz               1-03-2016     crashbash         in df1 & df2
zya               1-05-2016      costanza89       in df1 and df3
fleabag is ok     1-08-2016      skunk49          in df3

After running the great solution proposed below, I get the following output:

sapply(all_subreddits, class)
 origin        post        date    username 
"character" "character"      "Date" "character"

print(all_subreddits)
A tibble: 1,037 x 4
   origin post                                                               date   username    
   <chr>  <chr>                                                              <date> <chr>       
748
df2
الشكوى لله ذلونا صراحه
27-09-2012
هتلر المخاريم
678
df2
اقتباس: المشاركة الأصلية كتبت بواسطة حظها العاثر (المشاركة 6775851) ^ والله صادقه يا اختي حسبي الله ونعم الوكيل انا واختي الشئ نفسه غير مؤهلين عشان راتب بابا التقاعدي الله يرحمه والله ظلم :( حسبي الله عليهم انا وخواتي مثلك يارب ياخذ حقنا منهم بالدنيا قبل الآخرة 
23-09-2012
هتانه
679
df2
اقتباس: المشاركة الأصلية كتبت بواسطة بنت القنفذة (المشاركة 6811014) خلاص ماعاد فيه خوف من الله الطمع اعمى عيونهم حسبي الله عليهم اللهم عليك بالظالمين فإنهم لايعجزونك يارب خذ بحقنااا اللهم إنا مظلومون فأنتصر لنا يااااارب والله مقهوووووره يكفي اننا ايتام حرااااام عليهم الله مايرضى بالظلم امين يارب
23-09-2012
هتانه
936
df3
مشكوووووووووووور وماقصرت .
22-01-2012
هانيبال ليكتر
450
df2
نفس الطريقه انا غير مؤهله بسبب راتب التقاعد بس اللي ابغى افهمه احنا ورثه ونصرف على البيت من الراتب على بالهم انو دخل ثابت لنا خاص !!! طيب عاطلين احنا مالنا شي !!
11-09-2012
هاربر
452
df2
^^ الدعوه كلها اذلال والله ذلونا على الـ 2000 المنحوسه !!
11-09-2012
هاربر

After merging the data, I tried the R base solution and it works perfectly!

xtabs( ~ username + origin, hafiz1_combined)

output:

    origin
username           df1 df2 df3
  Ξ الأدميرالي Ξ     1   0   0
  آ فوفه             0   0   1
  آبو طيف            0   0   3
  آر اندرويد         0   0   1
  آلبدر              0   2   0

score 2 · Accepted Answer · answered May 07 '22 at 11:37

Here are tidyverse solutions to both goals.

x <- "
post   date          username
xyz    1-03-2016     crashbash
mnz   1-03-2016      crashbash
mnc   1-03-2016      crashbash"
y <- "post   date          username
yzh  1-05-2016      crashbash
wzh  1-05-2016      costanza89
zya  1-05-2016      costanza89"
z <- "post   date             username
'Fleabag is bad'          1-05-2016      costanza89
'southpark is the best!'  1-08-2016      crashbash
'fleabag is ok'           1-08-2016      skunk49"

df1 <- read.table(textConnection(x), header = TRUE)
df2 <- read.table(textConnection(y), header = TRUE)
df3 <- read.table(textConnection(z), header = TRUE)

suppressPackageStartupMessages(library(tidyverse))

all_subreddits <- bind_rows(df1, df2, df3, .id = "origin") %>%
  mutate(origin = paste0("df", origin),
         date = as.Date(date, "%m-%d-%Y"))

all_subreddits %>% 
  group_by(origin) %>%
  summarise(n_users = n_distinct(username))
#> # A tibble: 3 × 2
#>   origin n_users
#>   <chr>    <int>
#> 1 df1          1
#> 2 df2          2
#> 3 df3          3

all_subreddits %>%
  group_by(username) %>%
  summarise(active = paste(unique(origin), collapse = ", "))
#> # A tibble: 3 × 2
#>   username   active       
#>   <chr>      <chr>        
#> 1 costanza89 df2, df3     
#> 2 crashbash  df1, df2, df3
#> 3 skunk49    df3

^{Created on 2022-05-07 by the reprex package (v2.0.1)}

Base R

There are also base R solutions, very simple after having the data in all_subreddits, but with different output formats.

aggregate(origin ~ username, all_subreddits, \(x) c(unique(x)))
#>     username        origin
#> 1 costanza89      df2, df3
#> 2  crashbash df1, df2, df3
#> 3    skunk49           df3

xtabs( ~ username + origin, all_subreddits)
#>             origin
#> username     df1 df2 df3
#>   costanza89   0   2   1
#>   crashbash    3   1   1
#>   skunk49      0   0   1

^{Created on 2022-05-07 by the reprex package (v2.0.1)}

Thanks, the tidyverse code works well in identifying where each post comes from (i.e. which df), but for some reason I can't reproduce the "active" variable shown below: #> # A tibble: 3 × 2 #> username active #> #> 1 costanza89 df2, df3 #> 2 crashbash df1, df2, df3 #> 3 skunk49 df3 I also ran the same tidyverse code except for the first part which I didn't fully understand: df1 <- read.table(textConnection(x), header = TRUE) df2 <- read.table(textConnection(y), header = TRUE) df3 <- read.table(textConnection(z), header = TRUE) — maldini1990, May 07 '22 at 12:43
@maldini1992 What is your output? The `read.table` code is to read from the character strings just above it, to have the data in the R session without creating xlsx files. — Rui Barradas, May 07 '22 at 12:46

Tracking observations across different datasets (e.g. sub-reddits)

1 Answers1

Base R