1

Why when I run a for-loop in the code below, dask prefers to firstly do the 'Four' then 'One', and so on instead of starting from the first and finishing with the last element? Is it possible that I get some mixed (wrong) results where for example it puts the content of one file/folder into another? or if there are conditions within the for-loop they are ignored etc.?

Thanks in advance!

def compa(filename):
    filex=pd.read_json('folder/{}'.format(filename))    
    for jj in ['Zero', 'One', 'Two', 'Three','Four']:
        filexz=filex[filex[jj]==1].reset_index(drop=True)


        newpath = 'Newfolder/{}'.format(jj)
        if not os.path.exists(newpath):
            os.makedirs(newpath)
        filexz.to_json('{}/{}'.format(newpath,filename))

delayed_results=[delayed(compa)(filename) for filename in filelist]
compute(*delayed_results, scheduler='processes')

Code for replication purposes:

import pandas as pd
sof1=pd.DataFrame({'minus': ['a', 'b', 'c', 'd', 'e'],'Zero': [1, 0, 0, 0, 0],'One': [0, 0, 1, 0, 0],'Two': [0, 1, 0, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})
sof2=pd.DataFrame({'minus': ['aa', 'bb', 'cc', 'dd', 'ee'],'Zero': [1, 0, 0, 0, 0],'One': [0, 0, 1, 0, 0],'Two': [0, 1, 0, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})
sof3=pd.DataFrame({'minus': ['az', 'bz', 'cz', 'dz', 'ez'],'Zero': [1, 0, 0, 0, 0],'One': [0, 0, 1, 0, 0],'Two': [0, 1, 0, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})
sof4=pd.DataFrame({'minus': ['azy', 'bzy', 'czy', 'dzy', 'ezy'],'Zero': [1, 0, 0, 0, 0],'One': [0, 0, 1, 0, 0],'Two': [0, 1, 0, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})
sof5=pd.DataFrame({'minus': ['azx', 'bzx', 'czx', 'dzx', 'ezx'],'Zero': [1, 0, 0, 0, 0],'One': [0, 1, 0, 0, 0],'Two': [0, 0, 1, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})
sof6=pd.DataFrame({'minus': ['azw', 'bzw', 'czw', 'dzw', 'ezw'],'Zero': [1, 0, 0, 0, 0],'One': [0, 0, 1, 0, 0],'Two': [0, 1, 0, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})
sof7=pd.DataFrame({'minus': ['azyq', 'bzyq', 'czyq', 'dzyq', 'ezyq'],'Zero': [1, 0, 0, 0, 0],'One': [0, 0, 1, 0, 0],'Two': [0, 1, 0, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})
sof8=pd.DataFrame({'minus': ['azxq', 'bzxq', 'czxq', 'dzxq', 'ezxq'],'Zero': [1, 0, 0, 0, 0],'One': [0, 0, 1, 0, 0],'Two': [0, 1, 0, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})
sof9=pd.DataFrame({'minus': ['azwq', 'bzwq', 'czwq', 'dzwq', 'ezwq'],'Zero': [1, 0, 0, 0, 0],'One': [0, 0, 1, 0, 0],'Two': [0, 1, 0, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})

filelist=[sof1,
sof2,
sof3,
sof4,
sof5,
sof6,
sof7,
sof8,
sof9]

import pandas as pd
import dask
from dask import compute, delayed
import os

def compa(filename):
    filex=filename
    for jj in ['Zero', 'One', 'Two', 'Three','Four']:
        filexz=filex[filex[jj]==1].reset_index(drop=True)
        newpath = 'Newfolderstackoverflow/{}'.format(jj)
        if not os.path.exists(newpath):
            os.makedirs(newpath)
        filexz.to_json('{}/{}'.format(newpath,filename.loc[1,'minus']))

delayed_results=[delayed(compa)(filename) for filename in filelist]
compute(*delayed_results, scheduler='processes')

As the code above runs immediately I don't know how to record the creation order but first "four" and "one" folders are created then the rest! (and the order of creation of the files within each folder does not follow the order in the filelist neither which is understandable to me as THOSE FILES are supposed to be computed in parallel)

Thanks to the comments and answers specially those of @MichaelDelgado here is how it got solved: I added sleep for 60 seconds noticing that after 60sec it creates two files at each time and add it starting from folder Zero up to Four. The reason for my initial problem was that as the last couple of files were added within the same minute to the 5 folders, sorting folders based on time was meaningless and my OS sorted them alphabetically (hence "four" then "one")

sepehr
  • 23
  • 6
  • 3
    It looks like you will repeatedly overwrite the target file, since `'folder2/{}'.format(filename)` is independent of `jj`. Is this what you want? – mdurant May 03 '22 at 15:06
  • Note also that the for loop is not parallelized (or touched at all) by dask - while each function call may be out of order and executed by different threads, as the code is currently written everything inside the function will be normal single-threaded python. So dask’s task ordering doesn’t really come into play in the issue. It’s just the bug mdurant points out. – Michael Delgado May 03 '22 at 15:36
  • @mdurant Thanks for pointing to my mistake. I had mistakenly removed the related rows. Now it should not overwrite any file (and that's I expect it to do). – sepehr May 03 '22 at 15:51
  • @MichaelDelgado I fixed the bug which mdurant had mentioned. My problem however still remains. the jj for loop as you mentioned should be done as a regular for-loop however when I run the code, first the folder "Four" is created and then "Two"! So I suppose the files within each folder (filename in filelist) are parallelized- which is my goal- but cannot understand how the order of the folders is messed up – sepehr May 03 '22 at 15:54
  • me neither. as it stands, this isn't fully reproducible. can you create a fully [mre]? See also [this guide](http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports). – Michael Delgado May 03 '22 at 16:41
  • @MichaelDelgado Added it to the main question:) – sepehr May 03 '22 at 17:27
  • sorry - there is no way this is true. dask doesn't modify your function at all. it's possible your filesystem is caching the result of your `ls` operation or something else, or maybe you need to restart your dask scheduler/session. but there's no way Four gets written first using the above function. you could put a `time.sleep(60)` after each write and then see if you can tell which gets written first. also make sure the directory is clear before you start. – Michael Delgado May 03 '22 at 17:38
  • ok sweet! yeah I don't know what was going on there... real time file system responses can be tricky, depending on your system (like if you're working with a networked filesystem??). but who knows. maybe it's chaos monkeys. good news is you understand dask! :D – Michael Delgado May 03 '22 at 18:49
  • @MichaelDelgado I first tried with 2 second sleep and now I am waiting for 60 seconds but it seems like I can already tell what's going on (And it was much simpler than all of these that we discussed here) : After 60sec it creates two files at each time and add it starting from folder Zero up to fourth. The reason for my initial problem was that as the last couple of files were added within the same minute to the 5 folders, sorting folders based on time was meaningless and my OS sorted them alphabetically (hence "four" then "one"). Thanks for your help and following it up:) – sepehr May 03 '22 at 19:06

1 Answers1

1

The order in which tasks are executed is determined by several factors:

  • user-specified priorities;
  • FIFO order;
  • graph structure.

With regards to the possibility of a mix-up, as long as the internal code is correct (so no multiple processes writing to the same file at the same time), this should not be possible. As noted in the comment by @mdurant, it looks like your loop writes to the same file multiple times.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Thanks for your reply:) I had mistakenly removed some related line of codes and now after @mdurant mentioned, I put them back in the code above. My problem however still remains. When I run the code, first the folder "Four" is created and then "Two"! So I suppose the files within each folder (filename in filelist) are parallelized- which is my goal- but cannot understand how the order of the folders is messed up – sepehr May 03 '22 at 16:12
  • It's not messed up, because if the tasks are submitted at the same time and are otherwise similar, then their execution order is going to be random. – SultanOrazbayev May 03 '22 at 16:15
  • That's good news that it is not messed up:) I expected it to go through the list (jj for-loop) as a usual for-loop because I cannot see the disadvantage of going in order which necessitates the randomness. – sepehr May 03 '22 at 16:21
  • I'm not that familiar with the internals, but there is probably some hashing involved, so randomness comes out as a by-product. – SultanOrazbayev May 03 '22 at 16:25
  • 1
    @SultanOrazbayev this would affect the order in which dask moves through `filelist`, but not through the list `['Zero', 'One', 'Two', 'Three','Four']`, which is not touched by dask, right? – Michael Delgado May 03 '22 at 17:40
  • 1
    That's correct, @MichaelDelgado, I was thinking of four from one of the tasks appearing before two from another of the tasks... – SultanOrazbayev May 04 '22 at 02:45