What is the best practice to reduce\filter stream data into normal data with the most used characteristics using PySpark?

Question

I am working on streaming web-server records using PySpark in real-time, and I want to reduce\filter the data of a certain period (Let's say 1 week, which is 10M records) into 1M records to reach sampled data that represents normal data with the most used characteristics. I tried the following strategies in Python:

find the most used username let's say top n like Ali & Eli ----> df['username'].value_counts()
find the most used APIs (api) Ali & Eli accessed individually.
At first we need to filter records belongs to Ali & Eli df_filter_Ali = df[df["username"] == "Ali"] and find the most used APIs (api) by Ali ----> df_filter_Ali['username'].value_counts() let's say \a\s\d\ & \a\b\c\
filter the records of Ali which contains the most accessed APis \a\s\d\ & \a\b\c\
- but do them separately, in other words:

df.filter(username=ali).filter(api=/a).sample(0.1).union(df.filter(username=ali).filter(api=/b).sample(0.1)).union(df.filter(username=pejman).filter(api=/a).sample(0.1)).union(df.filter(username=ali).filter(api=/z).sample(0.1)) .union(df.filter(username=pej or ALI).filter(api=/a,/b, /z)

Then we can expect other features belonging to these events contextualized as normal data distribution.

I think the groupby() doesn't give us the right distribution

# Task1: normal data sampling
df = pd.read_csv("df.csv", sep=";")

df1 = []
for first_column in df["username"].value_counts().index[:50]:
    second_column_most_values = df.loc[df["username"] == first_column]["normalizedApi"].value_counts().index
    for second_column in second_column_most_values[:10]:
        sample = df.loc[(df["username"] == first_column) & (df["normalizedApi"] == second_column)].sample(frac=0.1)
        df1.append(sample)

df1 = pd.concat(df1)

df2 = []
for first_column in df["username"].value_counts().index[:50]:
    second_column_most_values = df.loc[df["username"] == first_column]["normalizedApi"].value_counts().index
    user_specific_data = []
    for second_column in second_column_most_values[:10]:
        sample = df.loc[(df["username"] == first_column) & (df["normalizedApi"] == second_column)]
        user_specific_data.append(sample)

    df2.append(pd.concat(user_specific_data).sample(frac=0.1))

df2 = pd.concat(df2)

df3 = []
for first_column in df["username"].value_counts().index[:50]:
    second_column_most_values = df.loc[df["username"] == first_column]["normalizedApi"].value_counts().index
    user_specific_data = []
    for second_column in second_column_most_values[:10]:
        sample = df.loc[(df["username"] == first_column) & (df["normalizedApi"] == second_column)]
        user_specific_data.append(sample)

    df3.append(pd.concat(user_specific_data))

df3 = pd.concat(df3)
df3 = df3.sample(frac=0.1)

sampled_napi_df = pd.concat([df1, df2, df3])
sampled_napi_df = sampled_napi_df.drop_duplicates()
sampled_napi_df = sampled_napi_df.reset_index(drop=True)

I checked the post in this regard, but I can't find any interesting way except a few posts: post1 and Filtering streaming data to reduce noise, kalman filter , How correctly reduce stream to another stream which are c++ or Java solutions!

Edit1: I tried to use Scala and pick top 50 username and loop over top 10 APIs they accessed and reduced/sampled and reunion and return back over filtered df:

val users = df.groupBy("username").count.orderBy($"count".desc).select("username").as[String].take(50)

val user_apis = users.map{
  user => 
    val users_apis = df.filter($"username"===user).groupBy("normalizedApi").count.orderBy($"count".desc).select("normalizedApi").as[String].take(50)
  (user, users_apis)

import org.apache.spark.sql.functions.rand


val df_sampled = user_apis.map{
  case (user, userApis) => 
   userApis.map{
       api => df.filter($"username"===user).filter($"normalizedApi"===api).orderBy(rand()).limit(10)
    }.reduce(_ union _)
}.reduce(_ union _)
}

I still can't figure it out how can be done efficiently in PySpark? Any help will be appreciate it.

Edit1:

// desired users number 100
val users = df.groupBy("username").count.orderBy($"count".desc).select("username").as[String].take(100)

// desired APIs number selected users they accessed 100
val user_apis = users.map{
  user => 
    val users_apis = df.filter($"username"===user).groupBy("normalizedApi").count.orderBy($"count".desc).select("normalizedApi").as[String].take(100)
  (user, users_apis)
}


import org.apache.spark.sql.functions._

val users_and_apis_of_interest = user_apis.toSeq.toDF("username", "apisOfInters")

val normal_df = df.join(users_and_apis_of_interest, Seq("username"), "inner")
      .withColumn("keep", array_contains($"apisOfInters", $"normalizedApi"))
      .filter($"keep"=== true)
      .distinct
      .drop("keep", "apisOfInters")
      .sample(true, 0.5)

What are you trying to do in a simple sentence? You provide lots of details but I don't feel your ask is clear. — Matt Andruff, Aug 16 '22 at 12:40
@MattAndruff in short, I want to reduce/sampled huge data of enterprise and prepare *fair* normal data distribution for applying baseline outlier detection algorithms. For this, I try to select top *N* `username`s and pick their top *M* `API`s they accessed and then randomly pick other observations. I already explained the strategies above. you can check edit1 too. ofc I need *PySpark* scripts not *Scala* — Mario, Aug 16 '22 at 14:10

Matt Andruff · Accepted Answer · 2022-08-18T14:53:34.400

I think this does what you want in pyspark. I'll confess I didn't run the code but It does give you the spirit of what I think you need to do.

The important thing you want to start doing is avoid 'collect' this is because that requires what you are doing fits in memory in the driver. Also it's a sign you are doing "small data" things instead of using big data tools like 'limit'. Where possible try and use datasets/dataframes to do work as that will give you the most amount of big data power.

I do use window in this and I've provided a link to help explain what it does.

Again this code hasn't been run but I am fairly certain the spirit of my intent is here. If you provide a runnable data set (in the question) I'll test/run/debug.


from pyspark.sql.functions import count, collect_set, row_number, lit, col, explode
from pyspark.sql.window import Window

windowSpec  = Window.partitionBy("username").orderBy("count")

top_ten = 10
top_apis = lit(100)

users_and_apis_of_interest = df.groupBy("username")\
 .agg( 
  count('username').alias("count"), 
  collect_list("normalizedApi").alias("apis")#willl collect all the apis we need for later 
 ).sort( col("count").desc() )\
 .limit( top_ten )\
 .select( 'username', explode("apis").alias("normalizedApi" ) )\#turn apis we collected back into rows.
 .groupBy("username","normalizedApi" )\
 .agg( count("normalizedApi" ).alias("count") )\
 .select(
  "username",
  "normalizedApi",
  row_number().over(windowSpec).alias("row_number")#create row numbers to be able to select top X apis
 ).where(col("row_number") > top_apis ) #filter out anything that isn't a top

normal_df = df.join(users_and_apis_of_interest, ["username","normalizedApi"])\
      .drop("row_number", "count")\
      .distinct()\
      .sample(True, 0.5)
normal_df.show(truncate=False)

What is the best practice to reduce\filter stream data into normal data with the most used characteristics using PySpark?

1 Answers1