Quickly, my need: create a Spark dataframe from a more or less complex query in T-SQL (SQL Server) and/or from the output of a SQL Server stored procedure.
As far I understand, Spark does not allow to execute queries in the dialect of the underlying data source. Yes, there is a way to obtain low level object and perform stored procedures but in this manner I don't have the Spark DF in output.
So, I thought to perform a query in the classical pyodbc way, obtain the results and then build the Spark dataframe with the function SparkSession.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True) providing the data and the schema. I can obtain the data, but I can't build the schema (a list of pairs (column name, data type)) from the output cursor. Follows a working example to (generate and) extract sample data from a local instance of SQL Server:
import pyodbc
connection_string = "Driver={SQL Server};Server=LOCALHOST;Database=master;Trusted_Connection=yes;"
db_connection = pyodbc.connect(connection_string)
sql_query = """
SET NOCOUNT ON
DECLARE @TBL_TEST AS TABLE (
column_1 INT NOT NULL PRIMARY KEY CLUSTERED IDENTITY(1, 1),
column_2 VARCHAR(10) NOT NULL,
column_3 VARCHAR(20) NULL,
column_4 INT NOT NULL
)
INSERT INTO @TBL_TEST (column_2, column_3, column_4)
VALUES
('test1_col2', 'test1_col3', 100),
('test2_col2', 'test2_col3', 200),
('test3_col2', NULL, 300)
SET NOCOUNT OFF
SELECT t.* FROM @TBL_TEST AS t
"""
cursor = db_connection.cursor()
rows = cursor.execute(sql_query).fetchall()
cursor.close()
db_connection.close()
print(rows)
How can I extract the schema from the returned cursor and obtain a schema object to give to the createDataFrame() function?
Remember that my goal is that on the topic, so other ways are also welcome!
Thank you in advance!