Loading Selected Column from SQL to PySpark

Question

I am new to (Py)Spark. I have a very big dataset. I have two table which I want to join. The dataset is residing in SQL database. I am using the Jupyter Notebook.

So, I just want load from SQL table only column I need for my analysis.

vod_raw_data = spark.read.jdbc(url="jdbc:sqlserver://000.110.000.71", 
                                table="BBBBBBB", 
                                properties={"user": "uuu", 
                                            "password": "xxxx"})

First question

Can some tell me how can get only needed columns (e.g. like in SQL. select cola, colb, colc) instead of the whole table

and the same from the second table and then join them?

Second question

Shall I import both tables in PySpark and then join or I could do join in some other ways?

Thanks in advance

score 0 · Accepted Answer · answered Jul 10 '18 at 12:38

You can use select for this.

needed_cols = ['cols here']
vod_raw_data = spark.read.jdbc(url="jdbc:sqlserver://000.110.000.71", 
                                table="BBBBBBB", 
                                properties={"user": "uuu", 
                                            "password": "xxxx"}).select(*needed_cols)

Loading Selected Column from SQL to PySpark

1 Answers1