1

Reading a csv file using below code

rdd= (sc.sparkContext
       .textFile("./test123.txt")
       .flatMap(lambda line: line.split("\n\r"))
       .map(lambda x: x.split("|"))
      )

On running the above code spark just creates one partition(on my local machine) , wanted to understand why is that ? The below display 1

rdd.getNumPartitions()

I want to parallelize this operation so can be run on the cluster. For the work to be distributed the rdd should have more then one partition (that's my understanding ) so that task can be sent to other nodes. Any light on this ?

Infinite
  • 704
  • 8
  • 27

1 Answers1

0

"The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks."

please try for example:

.textFile("./test123.txt", 2)

Hubert Dudek
  • 1,666
  • 1
  • 13
  • 21