3

I am trying to load a table from Sqlserver database to Bigquery which is of size 27gb, 218million rows and 28columns.

The source table doesn't have any column that contain unique values inorder for Spark to partition the incoming data evenly. So I applied row_number() to the data I am reading as below:

spark = SparkSession.builder.appName('Read_from_Source').getOrCreate()
dataframe = spark.read.format('jdbc').option('url', URL).option('driver', 'con.microsoft.sqlserver.jdbc.SQLServerDriver).
            option('user', user).
            option('password', password).
            option('dbtable', f'select *, row_number() over(ORDER BY (SELECT 1)) as row_num from {tablename}) as temp_load').
            option('partitionColumn',row_num).
            option('numPartitions', 400).
            option('lowerBound', 1).
            option('upperBound', countOfNumOfRows).
            load()
dataframe.write.format('bigquery').option('table','tablename').mode('overwrite').save()

I gave the below configuration to the job:

> spark.submit.deployMode = Cluster 
> spark.executor.instances=4
> spark.execuor.memory=2g spark.executor.cores=4 spark.driver.memory=2g
> spark.network.timeout=240

When I submit the job, it starts well but after some time if fails with the an exception connection reset which can be seen below.

Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:210)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:1981)

The executors and the stage could be seen below. Executors enter image description here

enter image description here

Stages of my job (There is only one stage since it is just read and load):

enter image description here

I thought the connection was breaking/interrupting while the data is being read and hence the connection reset exception. So I tried to increase spark.network.timeout to 100000 and increased the numPartitions from 200 to 400. The error is still the same.

I also increased the executor instances to 6 (I have 2 worker nodes) and executor memory to 6gb, driver memory to 4gb. I still see the same issue.

I am applying partitioning here while reading the data in order to avoid data skew. The memory given is also very high.

Edit1: There is no data going to driver here since the given lines is the only code I am running.

Is there anything wrong with the way I framed the code ? Can anyone suggest me a fix to this exception ?

Metadata
  • 2,127
  • 9
  • 56
  • 127
  • 1
    Did you test with a smaller dataset? You can try add a `limit 1000` in your SQL query. – Dagang Mar 03 '21 at 23:54
  • You have 27gb data but total 4*2gb + 2gb = 10gb memory at your spark context. And giving row_number to 218 million records on sql server may causing problem. You can do this partitioning job after fetching it from sql server. – M.Selman SEZGİN Mar 04 '21 at 12:13
  • Still facing the same issue without partition while reading. – Metadata Mar 04 '21 at 12:17
  • @Dagang It is working with limit. The job even runs with 54million rows of size 20gb. – Metadata Mar 04 '21 at 12:18
  • @Metadata it might be that the JDBC driver cannot correctly estimate a working RS fetch size. Have you tried to explicitly set it to some low value? Something like `option("fetchsize", "1000")` will do the trick. – egordoe Mar 05 '21 at 09:26
  • Did you take screenshots with your camera? – Atais Mar 06 '21 at 23:24
  • @Atais My laptop is not letting me attach the snap of the screen directly. Had to do it this way. – Metadata Mar 07 '21 at 08:05
  • Sorry, I know it's irrelevant but I had to ask. – Atais Mar 07 '21 at 11:02

1 Answers1

3

It seems like you are reading data from SQL Server using at least 200 partitions. Probably SQL Server can not handle so many concurrent queries of this size and drops/resets connections.

If this is the case, then to workaround this issue you need to either reduce parallelism of Spark job (number of partitions could be the same but number of executors/cluster nodes smaller to reduce concurrency) or beef up (either hardware or configuration) SQL Server so it can handle a higher load or more concurrent large queries.

Also, you may want try to disable Conscrypt on the cluster: Google Dataproc to SQL Server(based on centos 7) connection error?

Generally speaking probably you should use something like Apache Sqoop to export data from SQL Server.

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31