I am working on streaming web-server records using PySpark in real-time, and I want to reduce\filter the data of a certain period (Let's say 1 week, which is 10M records) into 1M records to reach sampled data that represents normal data with the most used characteristics. I tried the following strategies in Python:
- find the most used
username
let's say top n like Ali & Eli ---->df['username'].value_counts()
- find the most used APIs (
api
) Ali & Eli accessed individually. - At first we need to filter records belongs to Ali & Eli
df_filter_Ali = df[df["username"] == "Ali"]
and find the most used APIs (api
) by Ali ---->df_filter_Ali['username'].value_counts()
let's say \a\s\d\ & \a\b\c\ - filter the records of Ali which contains the most accessed APis \a\s\d\ & \a\b\c\
- but do them separately, in other words:
df.filter(username=ali).filter(api=/a).sample(0.1).union(df.filter(username=ali).filter(api=/b).sample(0.1)).union(df.filter(username=pejman).filter(api=/a).sample(0.1)).union(df.filter(username=ali).filter(api=/z).sample(0.1)) .union(df.filter(username=pej or ALI).filter(api=/a,/b, /z)
Then we can expect other features belonging to these events contextualized as normal data distribution.
I think the groupby()
doesn't give us the right distribution
# Task1: normal data sampling
df = pd.read_csv("df.csv", sep=";")
df1 = []
for first_column in df["username"].value_counts().index[:50]:
second_column_most_values = df.loc[df["username"] == first_column]["normalizedApi"].value_counts().index
for second_column in second_column_most_values[:10]:
sample = df.loc[(df["username"] == first_column) & (df["normalizedApi"] == second_column)].sample(frac=0.1)
df1.append(sample)
df1 = pd.concat(df1)
df2 = []
for first_column in df["username"].value_counts().index[:50]:
second_column_most_values = df.loc[df["username"] == first_column]["normalizedApi"].value_counts().index
user_specific_data = []
for second_column in second_column_most_values[:10]:
sample = df.loc[(df["username"] == first_column) & (df["normalizedApi"] == second_column)]
user_specific_data.append(sample)
df2.append(pd.concat(user_specific_data).sample(frac=0.1))
df2 = pd.concat(df2)
df3 = []
for first_column in df["username"].value_counts().index[:50]:
second_column_most_values = df.loc[df["username"] == first_column]["normalizedApi"].value_counts().index
user_specific_data = []
for second_column in second_column_most_values[:10]:
sample = df.loc[(df["username"] == first_column) & (df["normalizedApi"] == second_column)]
user_specific_data.append(sample)
df3.append(pd.concat(user_specific_data))
df3 = pd.concat(df3)
df3 = df3.sample(frac=0.1)
sampled_napi_df = pd.concat([df1, df2, df3])
sampled_napi_df = sampled_napi_df.drop_duplicates()
sampled_napi_df = sampled_napi_df.reset_index(drop=True)
I checked the post in this regard, but I can't find any interesting way except a few posts: post1 and Filtering streaming data to reduce noise, kalman filter , How correctly reduce stream to another stream which are c++ or Java solutions!
Edit1: I tried to use Scala and pick top 50 username
and loop over top 10 APIs they accessed and reduced/sampled and reunion and return back over filtered df
:
val users = df.groupBy("username").count.orderBy($"count".desc).select("username").as[String].take(50)
val user_apis = users.map{
user =>
val users_apis = df.filter($"username"===user).groupBy("normalizedApi").count.orderBy($"count".desc).select("normalizedApi").as[String].take(50)
(user, users_apis)
import org.apache.spark.sql.functions.rand
val df_sampled = user_apis.map{
case (user, userApis) =>
userApis.map{
api => df.filter($"username"===user).filter($"normalizedApi"===api).orderBy(rand()).limit(10)
}.reduce(_ union _)
}.reduce(_ union _)
}
I still can't figure it out how can be done efficiently in PySpark? Any help will be appreciate it.
Edit1:
// desired users number 100
val users = df.groupBy("username").count.orderBy($"count".desc).select("username").as[String].take(100)
// desired APIs number selected users they accessed 100
val user_apis = users.map{
user =>
val users_apis = df.filter($"username"===user).groupBy("normalizedApi").count.orderBy($"count".desc).select("normalizedApi").as[String].take(100)
(user, users_apis)
}
import org.apache.spark.sql.functions._
val users_and_apis_of_interest = user_apis.toSeq.toDF("username", "apisOfInters")
val normal_df = df.join(users_and_apis_of_interest, Seq("username"), "inner")
.withColumn("keep", array_contains($"apisOfInters", $"normalizedApi"))
.filter($"keep"=== true)
.distinct
.drop("keep", "apisOfInters")
.sample(true, 0.5)