0

I've been working on a pyspark based content pull / sync tool ( akin to sqoop but applying some transforms as a pipeline ). I'm creating a jdbc connection via:

datatype_boundry_mappings = {
    # https://learn.microsoft.com/en-us/sql/t-sql/data-types/int-bigint-smallint-and-tinyint-transact-sql
    # We are not not using negative numbers below 10 anywhere so our lower limit does not match the sqlserver range
    'sqlserver': {
        'tinyint': {
            'lower': 0,
            'upper': 255
        },
        'smallint': {
            'lower': -10,
            'upper': 32767
        },
        'int': {
            'lower': -10,
            'upper': 2147483647
        },
        'bigint': {
            'lower': -10,
            'upper': 9223372036854775807
        }
    }
}

(...)

    jdbc_df = spark.read \
        .format("jdbc") \
        .option("url", jdbc_url) \
        .option("dbtable", table_name) \
        .option("user", db_username) \
        .option("password", db_password) \
        .option("numPartitions", partitions) \
        .option("partitionColumn", primary_key_column) \
        .option("lowerBound", datatype_boundry_mappings['sqlserver'][primary_key_type]['lower']) \
        .option("upperBound", datatype_boundry_mappings['sqlserver'][primary_key_type]['upper']) \
        .option("selectMethod", 'direct') \
        .load()

In the pyspark repl I can confirm that I have the 4 partitions however I do not know how to tell what each partitions upper/lower share is.

In [43]: delta_df.rdd.getNumPartitions()
Out[43]: 4

When I run a save I'll end up with 3 of the executors/tasks completing very quickly and the last executor does all of the work. Same thing when I scale to a very large ( 1 billion row + ) table - 200 tasks of which 199 complete in < 100ms and the last one does all of the work.

A few questions I have:

  1. How can I debug what each partition / task set is
  2. How can I see what each task's query would be ( similar to #1 )

I have tried converting the query to spark.sql and doing create temp view etc. The optimizer there does the same thing. ( makes sense its sharing the same dataframe / connection settings )

upstream is the mssql jdbc driver mssql-jdbc-6.2.2.jre8 , jdk 1.8.112 ( oracle ). pyspark 2.2.0.2.6.3.0-235 ( hdp 2.6.3 )

  • for grins I ran the code mentioned in http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=foreach#pyspark.RDD.foreachPartition As I suspected of the 4 executors only 1 had the data from the print statement in its stderr log, the rest had only the task initialization log statements – Bill Schwanitz Dec 26 '17 at 17:03
  • So I can at least provide some details as to what is going on but I don't know why. When I changed the upper bound for the bigint to 2b ( same as int ) I actually get the workload splits how I was expecting. I'm guessing this is a variable type error somewhere – Bill Schwanitz Dec 26 '17 at 22:13
  • err, it would not let me put this in my last comment. SELECT FROM WHERE (("snip" >= '01/01/70 00:00:00')) AND (snip < 92233720368547748 or dim_script_id is null) was the query. It only executed a single sql statement where the field was < the upper bound value. Anyone know how/why this is? – Bill Schwanitz Dec 26 '17 at 22:14
  • Hi Bill. Did you figure out the reason yet? I'm trying to find the reason behind this too as I'm querying from an Oracle database. Do you think this functionality only works with distributed database architecture but not for dis like MSSQL nor Oracle (just my thought)? – Ray Aug 02 '18 at 19:55

0 Answers0