Querying a big data table using Py-spark

Question

I have two tables that I'm working with using Py-spark

File 1:

Schema: CustomerName:STRING, DOB:STRING, UIN:STRING, MailID:STRING, PhoneNumber:LONG, City:STRING, State:STRING, LivingStatus:STRING, PinCode:STRING, LoanAmount:LONG

Sample Data: Sakshi, 22-03-86, UIN0043, Sakshi@mail.com, 3344990876, Ahmedabad, Gujarat, BPL ,380001, 23000 Shivani, 22-02-83, UIN0044, Shivani@mail.com, 3344990876, Thiruvananthpuram, Kerala, APL, 695001,24500

File 2:

schema: CustomerName:STRING, DOB:STRING,UIN:STRING, City:STRING, State:STRING, PinCode:LONG, CibilScore:LONG, DefaulterFlag:STRING

Sample data: Shubham, 23-08-86, UIN0007, Thiruvananthpuram, Kerala, 695001, 3530, N Anushka, 25-08-82, UIN0008, Thiruvananthpuram, Kerala, 695001, 1530, Y

I need to evaluate and apply the status to be Approved if the client is not a defaulter and the credit score is more than 800, using both pyspark core and SQL.

I'm new to this and have tried solving it using core and was getting wrong results.

I have tried solving the same problem using sql after loading the dataset into mysql db and got proper results. However, with pyspark core I'm not able to.

score 0 · Answer 1 · edited May 03 '23 at 12:34

from collections import namedtuple

HomeLoanApplicationData = namedtuple('HomeLoanApplication',['CustomerName', 'DOB', 'UIN', 'MailID', 'PhoneNumber', 'City', 'State', 'LivingStatus', 'PinCode', 'LoanAmount'])

ClientReference = namedtuple('ClientReference',['CustomerName', 'DOB','UIN', 'City', 'State', 'PinCode', 'CibilScore', 'DefaulterFlag'])

HomeLoanDF = HomeLoanRDD.map(lambda l: l.split(',')).map(lambda c: HomeLoanApplicationData(c[0], c[1], c[2], c[3], long(c[4]), c[5], c[6], c[7], c[8], long(c[9]))).toDF()

ClientReferenceDF = ClientReferenceRDD.map(lambda l: l.split(',')).map(lambda c: ClientReference(c[0],c[1],c[2],c[3],c[4], long(c[5]), long(c[6]), c[7] )).toDF()

HomeLoanDF.createOrReplaceTempView('Home')

Once this is view is created, you can work on it as a normal SQL table using:

spark.sql("Enter query here").show()

Querying a big data table using Py-spark

1 Answers1