0

Any suggestion or help or references are most welcome for the below problem statement. I am performing big data analysis on the data that is currently stored on Azure. The actual implementation is more complex than the set of equations provided below. However, a solution to the below problem statement would largely help in the development of the solution to my actual set of equations.

Problem Statement:

Develop a program using PySpark on Databricks that solves a system of multi-variable equations and generates columns x and y based on the provided formulas.

Requirements:

Input Columns: Two columns v and z will be provided, both containing floating-point data. The columns should have a consistent number of entries and contain valid floating-point values.

Output columns:

Two columns x and y with first entry of x initialized with 0 and first entry of y initialized as 1.

System of equations:

x[k+1] = v[k] * y[k] + x[k]

y[k+1] = 0.01 * z[k] + y[k]

Output:

Show all the columns: v, z, x and y.

Question:

How would you solve the equations using PySpark? What would you suggest differently than my decided solution approach?

My research so far says that parallelization is not an option for my problem statement on PySpark. With Python, I would definitely go for 'Pandas'. What I have learnt is that databricks came up with 'Koalas' library, which is equivalent to Pandas. This was finally integrated as 'PySpark.Pandas' from Spark 3.2+. Since I could not parallelize the computation on PySpark, I have now decided to go ahead with 'PySpark.Pandas', although not been benchmarked, is the best option. 'Pandas' would run on local nodes posing challenges to memory and resource utilization. 'PySpark.Pandas' strictly runs on the cluster and is its equivalent. Although 'PySpark.Pandas' challenges the true capability of distributed computing of PySpark, it is the only option I could consider to go ahead for my problem statement.

The solution using pandas was coded as below.

import pandas as pd

# Sample input data
data = [(0.0,1,10),
        (0.5,1,20),
        (1.0,2,10),
        (1.3,2,20),
        (1.6,2,30),
        (2.0,1,10),
        (2.5,1,20),
        (3.0,2,10),
        (3.3,2,20),
        (3.6,2,30)]

# Create a DataFrame with input columns v and z using pandas
df = pd.DataFrame(data, columns=["t", "v", "z"])

# Initialize the first entry of x and y
df["x"] = 0.0
df["y"] = 0.0
x = 0.0
y = 1.0

# Initialize the variables to keep track of time
prev_second = int(df.at[0, "t"])

# System of equations
# x(k) = v(k) * y(k-1) + x(k-1)
# y(k) = y(k-1) + 0.01 * z(k)

# Apply the equations to generate the values for columns x and y using the index
for i in df.index:
    current_second = int(df.at[i, "t"])
    
    # Check if a new second has started
    if current_second > prev_second:
        x = 0.0
        y = 1.0
        prev_second = current_second
    
    # Calculate x and y
    x = df.at[i, "v"] * y + x
    df.at[i, "x"] = x
    y = 0.01 * df.at[i, "z"] + y 
    df.at[i, "y"] = y

# Show the resulting DataFrame
print(df)

The solution using koalas was coded as below.

import databricks.koalas as ks

# Sample input data
data = [(0.0,1,10),
        (0.5,1,20),
        (1.0,2,10),
        (1.3,2,20),
        (1.6,2,30),
        (2.0,1,10),
        (2.5,1,20),
        (3.0,2,10),
        (3.3,2,20),
        (3.6,2,30)]

# Create a Koalas DataFrame with input columns t, v, and z
df = ks.DataFrame(data, columns=["t", "v", "z"])

# Define a class to encapsulate the state and calculation logic
class diffEq_ks:
    def __init__(self):
        self.x = 0.0
        self.y = 1.0
        self.prev_second = None

    def calculate_values(self, row):
        if self.prev_second is None:
            self.prev_second = int(row["t"])
        current_second = int(row["t"])
        
        # Check if a new second has started
        if current_second > self.prev_second:
            self.x = 0.0
            self.y = 1.0
            self.prev_second = current_second
        
        # Calculate x and y
        self.x = row["v"] * self.y + self.x
        row["x"] = self.x
        self.y = 0.01 * row["z"] + self.y
        row["y"] = self.y
        return row

# Initialize the calculator
Eq_ks = diffEq_ks()

# Apply the calculate_values function to the DataFrame
df = df.apply(Eq_ks.calculate_values, axis=1)

# Show the resulting DataFrame
print(df)

The solution using PySpark.Pandas was coded as below. I have chosen this as my solution approach.

import pyspark.pandas as ps

# Sample input data
data = [(0.0,1,10),
        (0.5,1,20),
        (1.0,2,10),
        (1.3,2,20),
        (1.6,2,30),
        (2.0,1,10),
        (2.5,1,20),
        (3.0,2,10),
        (3.3,2,20),
        (3.6,2,30)]

# Create a pyspark.pandas DataFrame with input columns t, v, and z
df = ps.DataFrame(data, columns=["t", "v", "z"])

# Define a class to encapsulate the state and calculation logic
class diffEq_pyps:
    def __init__(self):
        self.x = 0.0
        self.y = 1.0
        self.prev_second = None

    def calculate_values(self, row):
        if self.prev_second is None:
            self.prev_second = int(row["t"])
        current_second = int(row["t"])
        
        # Check if a new second has started
        if current_second > self.prev_second:
            self.x = 0.0
            self.y = 1.0
            self.prev_second = current_second
        
        # Calculate x and y
        self.x = row["v"] * self.y + self.x
        row["x"] = self.x
        self.y = 0.01 * row["z"] + self.y
        row["y"] = self.y

        return row
    
# Initialize the calculator
Eq_pyps = diffEq_pyps()

# Apply the calculate_values function to the DataFrame
df = df.apply(Eq_pyps.calculate_values, axis=1)

# Show the resulting DataFrame
print(df)

The outputs in all the three implementations were as below and is the required output:

     t    v     z    x    y
0  0.0  1.0  10.0  1.0  1.1
1  0.5  1.0  20.0  2.1  1.3
2  1.0  2.0  10.0  2.0  1.1
3  1.3  2.0  20.0  4.2  1.3
4  1.6  2.0  30.0  6.8  1.6
5  2.0  1.0  10.0  1.0  1.1
6  2.5  1.0  20.0  2.1  1.3
7  3.0  2.0  10.0  2.0  1.1
8  3.3  2.0  20.0  4.2  1.3
9  3.6  2.0  30.0  6.8  1.6

1 Answers1

1

Not sure if this answer is helpful or not since I couldn't cast your iterative equation into a normal format or find an iterative equation solver. But you can definitely use scipy's fsolve to solve non-linear equations.

Here's an example below :

import sys

from pyspark import SQLContext
from pyspark.sql.functions import *
import pyspark.sql.functions as F
from functools import reduce
import pyspark
from pyspark.sql.types import *
import numpy as np
from scipy.optimize import fsolve
from math import exp



sc = SparkContext('local')
sqlContext = SQLContext(sc)
sqlContext.setConf("spark.streaming.backpressure.enabled", True)

import pyspark.pandas as ps

# Sample input data
data = [(0.0, 1, 10),
        (0.5, 1, 20),
        (1.0, 2, 10),
        (1.3, 2, 20),
        (1.6, 2, 30),
        (2.0, 1, 10),
        (2.5, 1, 20),
        (3.0, 2, 10),
        (3.3, 2, 20),
        (3.6, 2, 30)]

# Create a pyspark.pandas DataFrame with input columns t, v, and z
column_list = ["t", "v", "z"]

df_spark = sqlContext.createDataFrame(data=data, schema =column_list)
print("Printing out df_spark")
df_spark.show(20, truncate=False)

def equations_provided(args_tuple):
    x, y = args_tuple[0], args_tuple[1]
    eq1 = x+y**2-4
    eq2 = exp(x) + x*y - 3
    return [eq1, eq2]

def fsolve_call_scipy(x, y):
    ## https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fsolve.html
    result = fsolve(equations_provided,(x, y))
    python_result_list = result.tolist()
    #print(f"Input : ({x}, {y}) -- Output : {result} datatype {type(result)} python result list {result.tolist()}")
    return python_result_list

solver_udf = udf(fsolve_call_scipy)

answer_df = df_spark.select(*df_spark.columns, solver_udf(df_spark["t"], df_spark["v"]).alias("soln"))
print("answer_df dataframe")
answer_df.show(n=100, truncate=False)

Output is as follows :

Printing out df_spark
+---+---+---+
|t  |v  |z  |
+---+---+---+
|0.0|1  |10 |
|0.5|1  |20 |
|1.0|2  |10 |
|1.3|2  |20 |
|1.6|2  |30 |
|2.0|1  |10 |
|2.5|1  |20 |
|3.0|2  |10 |
|3.3|2  |20 |
|3.6|2  |30 |
+---+---+---+

answer_df dataframe
+---+---+---+----------------------------------------+
|t  |v  |z  |soln                                    |
+---+---+---+----------------------------------------+
|0.0|1  |10 |[0.6203445234850499, 1.838383930661961] |
|0.5|1  |20 |[0.6203445234785517, 1.8383839306684822]|
|1.0|2  |10 |[0.6203445234852288, 1.8383839306615979]|
|1.3|2  |20 |[0.6203445234852818, 1.8383839306616214]|
|1.6|2  |30 |[0.6203445234851762, 1.838383930661591] |
|2.0|1  |10 |[0.6203445234852258, 1.8383839306615946]|
|2.5|1  |20 |[0.620344523485133, 1.8383839306616825] |
|3.0|2  |10 |[0.6203445234858643, 1.8383839306615082]|
|3.3|2  |20 |[0.6203445234852366, 1.8383839306615928]|
|3.6|2  |30 |[0.6203445234863979, 1.8383839306614038]|
+---+---+---+----------------------------------------+
user238607
  • 1,580
  • 3
  • 13
  • 18