0

I have a requirement to save partitions to the text file with different names for each partition. But while running below code snippet only one file is saving by overwriting the previous partition.

def chunks(iterator):
    chunks.counter += 1
    l = (list(iterator))
    df = pd.DataFrame(l,index=None)
    df.to_csv(parent_path+"C"+str(chunks.counter+1)+".txt", header=None, index=None, sep=' ')

chunks.counter=0
sc.parallelize([1,2,3,4,5,6],num_partions).foreachPartition(chunks)

Is there any way that I can know which partition is currently running in pySpark?

shai
  • 37
  • 2

1 Answers1

0
def chunks(lst, n):
    # Yield successive n-sized chunks from the lst...
    for i in (range(0, len(lst), n)):
        yield i, lst[i:i + n]

for (index, values) in chunks(range(0, 1e5), 1e3): # change this to int's as per your need otherwise it will give float error or will write range obj itself..
    with open(f"{parent_path}_C_{index}.txt", "w") as output:
        output.write(str(values)) # converting to str

And you can even wrap this into joblib easily ;) We don't need PySpark as such in my opinion..

Aditya
  • 2,380
  • 2
  • 14
  • 39
  • Thanks, Aditya for your code. I did the chucking in native python in 1st and now am trying to do the same using pyspark. here in this line sc.parallelize(patient_ids,num_partions).foreachPartition(chunk_patients) foreachPartition() is taking single partition at one run and processing the above function here am unable to save different chunks with different names. I tried the counter thing in another way also, but iterator resetting the counter value every time. – shai Jun 21 '20 at 05:40
  • I am not a native PySpark user but the below seems promising, https://stackoverflow.com/questions/31631791/how-to-get-id-of-a-map-task-in-spark – Aditya Jun 21 '20 at 05:46
  • Thank You, Aditya. with minimal changes, it worked. :} – shai Jun 21 '20 at 06:09