SnappyData Spark Scall java.sql.BatchUpdateException

Question

So, I have around 35 GB of zip files, each one contains 15 csv files, I have created a scala script that processes each one of the zip files and each one of the csv files per each zip file.

The problem is that after some amount of files the script lunches this error

ERROR Executor: Exception in task 0.0 in stage 114.0 (TID 3145) java.io.IOException: java.sql.BatchUpdateException: (Server=localhost/127.0.0.1[1528] Thread=pool-3-thread-63) XCL54.T : [0] insert of keys [7243901, 7243902,

And the string continues with all the keys (records) that were not inserted.

So what I have found is that apparently (I said apparently because of my lack of knowledge about scala and snappy and spark) the memory that is been used is full... my question... how do I increment the size of the memory used? or how do I empty the data that is in memory and save it in the disk?

Can I close the session started and that way free the memory? I have had to restart the server, remove the files processed and then I can continue with the importation but after some other files... again... same exception

My csv files are big... the biggest one is around 1 GB but this exception happens not just with the big files but when accumulating multiple files... until some size is reached... so where do I change that memory use size?

I have 12GB RAM...

score 1 · Answer 1 · answered Jul 11 '17 at 22:02

1

You can use RDD persistance and store to disk/memory or a combination : https://spark.apache.org/docs/2.1.0/programming-guide.html#rdd-persistence

Also, try adding a large number of partitions when reading the file(s): sc.textFile(path, 200000)

answered Jul 11 '17 at 22:02

jsdeveloper

3,945
1
15
14

score 1 · Answer 2 · answered Jul 11 '17 at 23:01

I think you are running out of available memory. The exception message is misleading. If you only have 12GB of memory on your machine, I wonder if your data would fit. What I would do is first figure out how memory you need.

 1. Copy conf/servers.template to conf/servers file
 2) Change this file with something like this: localhost -heap-size=3g
    -memory-size=6g //this essentially allocates 3g in your server for computations    (spark, etc) and allocates 6g of off-heap memory for your data    (column tables only).  
3) start your cluster using snappy-start-all.sh 
4) Load some subset of your data  (I doubt you have enough memory) 
5) Check the memory used in the SnappyData Pulse UI (localhost:5050)

if you think you have enough memory, load the full data.

Hopefully that works out.

Hi, thanks, yes I do know my data does not fit in memory, the total amount is around 300+ GB... so the question here is how to increment the memory first to process each csv ( bigger of 1 GB) without any problems and storing the data in disk or combining with memory. I have found information about RDD Persistance and I guess that is part of my solutions but how/where to configure it? And thanks for the tips I am already configuring them... and... testing... — Mauricio Chica Patiño, Jul 11 '17 at 23:31

score 1 · Answer 3 · answered Jul 12 '17 at 04:24

BatchUpdateException tells me that you are creating Snappy tables and inserting data in them. Also, BatchUpdateException in most of the cases means low memory (exception message needs to be better). So, I believe you may be right about the memory. For freeing the memory, you will have to drop the tables that you created. For information about memory size and table sizing, you may want to read these docs:

http://snappydatainc.github.io/snappydata/best_practices/capacity_planning/#memory-management-heap-and-off-heap

http://snappydatainc.github.io/snappydata/best_practices/capacity_planning/#table-memory-requirements

Also if you have lot of data that can't fit in memory, you can overflow it to disk. See the following doc about the overflow configuration:

http://snappydatainc.github.io/snappydata/best_practices/design_schema/#overflow-configuration

Hope it helps.