I have three datasets from three different sub-reddits, and my goal is 1-to check how many users are active in df1 (i.e. a sub-reddit), active in df2, and/or df3 (i.e. another subreddit). Another goal is that once I merge all datasets, I am able to know in which sub-reddit was each user's post written at. For example, I would be interested in knowing if user X is active in sub-reddits 2 and 3, but not 1. And user Y is active in sub-reddits 1 and 3 but not 2.
In each dataset, I have 3 variables shown below:
post date username
Here is a sample of df1
post date username
xyz 1-03-2016 crashbash
mnz 1-03-2016 crashbash
mnc 1-03-2016 crashbash
Here is a sample of df2
post date username
yzh 1-05-2016 crashbash
wzh 1-05-2016 costanza89
zya 1-05-2016 costanza89
Here is a sample of df3
post date username
Fleabag is bad 1-05-2016 costanza89
southpark is the bestt! 1-08-2016 crashbash
fleabag is ok 1-08-2016 skunk49
Here is my code:
#Clearing out environment
rm(list = ls())
#Loading packages
library(tidyverse)
library(readxl)
library(writexl)
library(quanteda)
library(stringr)
library(textclean)
library(lubridate)
library(zoo)
## importing 3 datasets
df1 <- read_excel("df1.xlsx")
df2 <- read_excel("df2.xlsx")
df3 <- read_excel("df3.xlsx")
I currently wrote the code below, which works well but it only tells me if a given user has more than one post in a given sub-reddit, but it does not make a distinction between users who have multiple posts within one sub-reddit, versus those who are active in more than one sub-reddit. I am mainly interested in learning the latter group.
all_subreddits <-
bind_rows(df1,df2,df3,.id = "origin") %>%
group_by(username) %>%
mutate(active = +(n_distinct(origin) == 2), .keep = "unused")
After the code above, the data looks as follows, where active= 1 if a user appears more than once and 0 otherwise.
sapply(all_subreddits, class)
post date username active
"character" "character" "character" "integer"
Ideally, however, I would like to have the following outcome with a variable indicating the sub-reddit where each user has been active in:
post date username active
xyz 1-03-2016 crashbash in df1 & df2
zya 1-05-2016 costanza89 in df1 and df3
fleabag is ok 1-08-2016 skunk49 in df3
After running the great solution proposed below, I get the following output:
sapply(all_subreddits, class)
origin post date username
"character" "character" "Date" "character"
print(all_subreddits)
A tibble: 1,037 x 4
origin post date username
<chr> <chr> <date> <chr>
748
df2
الشكوى لله ذلونا صراحه
27-09-2012
هتلر المخاريم
678
df2
اقتباس: المشاركة الأصلية كتبت بواسطة حظها العاثر (المشاركة 6775851) ^ والله صادقه يا اختي حسبي الله ونعم الوكيل انا واختي الشئ نفسه غير مؤهلين عشان راتب بابا التقاعدي الله يرحمه والله ظلم :( حسبي الله عليهم انا وخواتي مثلك يارب ياخذ حقنا منهم بالدنيا قبل الآخرة
23-09-2012
هتانه
679
df2
اقتباس: المشاركة الأصلية كتبت بواسطة بنت القنفذة (المشاركة 6811014) خلاص ماعاد فيه خوف من الله الطمع اعمى عيونهم حسبي الله عليهم اللهم عليك بالظالمين فإنهم لايعجزونك يارب خذ بحقنااا اللهم إنا مظلومون فأنتصر لنا يااااارب والله مقهوووووره يكفي اننا ايتام حرااااام عليهم الله مايرضى بالظلم امين يارب
23-09-2012
هتانه
936
df3
مشكوووووووووووور وماقصرت .
22-01-2012
هانيبال ليكتر
450
df2
نفس الطريقه انا غير مؤهله بسبب راتب التقاعد بس اللي ابغى افهمه احنا ورثه ونصرف على البيت من الراتب على بالهم انو دخل ثابت لنا خاص !!! طيب عاطلين احنا مالنا شي !!
11-09-2012
هاربر
452
df2
^^ الدعوه كلها اذلال والله ذلونا على الـ 2000 المنحوسه !!
11-09-2012
هاربر
After merging the data, I tried the R base solution and it works perfectly!
xtabs( ~ username + origin, hafiz1_combined)
output:
origin
username df1 df2 df3
Ξ الأدميرالي Ξ 1 0 0
آ فوفه 0 0 1
آبو طيف 0 0 3
آر اندرويد 0 0 1
آلبدر 0 2 0