I've been working on a pyspark based content pull / sync tool ( akin to sqoop but applying some transforms as a pipeline ). I'm creating a jdbc connection via:
datatype_boundry_mappings = {
# https://learn.microsoft.com/en-us/sql/t-sql/data-types/int-bigint-smallint-and-tinyint-transact-sql
# We are not not using negative numbers below 10 anywhere so our lower limit does not match the sqlserver range
'sqlserver': {
'tinyint': {
'lower': 0,
'upper': 255
},
'smallint': {
'lower': -10,
'upper': 32767
},
'int': {
'lower': -10,
'upper': 2147483647
},
'bigint': {
'lower': -10,
'upper': 9223372036854775807
}
}
}
(...)
jdbc_df = spark.read \
.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", table_name) \
.option("user", db_username) \
.option("password", db_password) \
.option("numPartitions", partitions) \
.option("partitionColumn", primary_key_column) \
.option("lowerBound", datatype_boundry_mappings['sqlserver'][primary_key_type]['lower']) \
.option("upperBound", datatype_boundry_mappings['sqlserver'][primary_key_type]['upper']) \
.option("selectMethod", 'direct') \
.load()
In the pyspark repl I can confirm that I have the 4 partitions however I do not know how to tell what each partitions upper/lower share is.
In [43]: delta_df.rdd.getNumPartitions()
Out[43]: 4
When I run a save I'll end up with 3 of the executors/tasks completing very quickly and the last executor does all of the work. Same thing when I scale to a very large ( 1 billion row + ) table - 200 tasks of which 199 complete in < 100ms and the last one does all of the work.
A few questions I have:
- How can I debug what each partition / task set is
- How can I see what each task's query would be ( similar to #1 )
I have tried converting the query to spark.sql and doing create temp view etc. The optimizer there does the same thing. ( makes sense its sharing the same dataframe / connection settings )
upstream is the mssql jdbc driver mssql-jdbc-6.2.2.jre8 , jdk 1.8.112 ( oracle ). pyspark 2.2.0.2.6.3.0-235 ( hdp 2.6.3 )