2

I am reading in a csv file from gcs and I need to go through each row and call an api to get some data back and appended to a new dataframe.

the code goes something like this:

DataFrame<Row> df = sparkSession.read().option("header", true).csv("gs://bucketPath");
df = df.map(MapFunction<Row, Row> row -> callApi(row), RowEncoder.apply(schema)).cache();
df.write().format("bigquery").save()

The main problem I am seeing is that all the logs fr callApi is coming from a single container. I tried messing with openCostInBytes and I tried writing a customUDF, but it still gets processed in one container.

Ideally, I would like to see given 10 rows and 2 executors with 5 cores each. Each executor should process 5 rows where each core gets 1 row. Is my understanding of spark incorrect?

PS. I know I shouldn't call api's from spark and I know scala is better for spark than java. Please just humor me.

  • 1
    I figured out by luck i guess. It appears that when reading csv the container only reads from one container especially if the data is like 2G. When reading from avro, i believe spark is able to split the data further by rows across partitions. – Darshan Kothari Jan 13 '21 at 17:20

0 Answers0