I am reading in a csv file from gcs and I need to go through each row and call an api to get some data back and appended to a new dataframe.
the code goes something like this:
DataFrame<Row> df = sparkSession.read().option("header", true).csv("gs://bucketPath");
df = df.map(MapFunction<Row, Row> row -> callApi(row), RowEncoder.apply(schema)).cache();
df.write().format("bigquery").save()
The main problem I am seeing is that all the logs fr callApi is coming from a single container. I tried messing with openCostInBytes and I tried writing a customUDF, but it still gets processed in one container.
Ideally, I would like to see given 10 rows and 2 executors with 5 cores each. Each executor should process 5 rows where each core gets 1 row. Is my understanding of spark incorrect?
PS. I know I shouldn't call api's from spark and I know scala is better for spark than java. Please just humor me.