I am new to (Py)Spark. I have a very big dataset. I have two table which I want to join. The dataset is residing in SQL database. I am using the Jupyter Notebook.
So, I just want load from SQL table only column I need for my analysis.
vod_raw_data = spark.read.jdbc(url="jdbc:sqlserver://000.110.000.71",
table="BBBBBBB",
properties={"user": "uuu",
"password": "xxxx"})
First question
- Can some tell me how can get only needed columns (e.g. like in SQL. select cola, colb, colc) instead of the whole table
and the same from the second table and then join them?
Second question
- Shall I import both tables in PySpark and then join or I could do join in some other ways?
Thanks in advance