8

I am trying to vertically concatenate two Dask DataFrames

I have the following Dask DataFrame:

d = [
    ['A','B','C','D','E','F'],
    [1, 4, 8, 1, 3, 5],
    [6, 6, 2, 2, 0, 0],
    [9, 4, 5, 0, 6, 35],
    [0, 1, 7, 10, 9, 4],
    [0, 7, 2, 6, 1, 2]
    ]
df = pd.DataFrame(d[1:], columns=d[0])
ddf = dd.from_pandas(df, npartitions=5)

Here is the data as a Pandas DataFrame

          A         B      C      D      E      F
0         1         4      8      1      3      5
1         6         6      2      2      0      0
2         9         4      5      0      6     35
3         0         1      7     10      9      4
4         0         7      2      6      1      2

Here is the Dask DataFrame

Dask DataFrame Structure:
                   A      B      C      D      E      F
npartitions=4                                          
0              int64  int64  int64  int64  int64  int64
1                ...    ...    ...    ...    ...    ...
2                ...    ...    ...    ...    ...    ...
3                ...    ...    ...    ...    ...    ...
4                ...    ...    ...    ...    ...    ...
Dask Name: from_pandas, 4 tasks

I am trying to concatenate 2 Dask DataFrames vertically:

ddf_i = ddf + 11.5
dd.concat([ddf,ddf_i],axis=0)

but I get this error:

Traceback (most recent call last):
      ...
      File "...", line 572, in concat
        raise ValueError('All inputs have known divisions which cannot '
    ValueError: All inputs have known divisions which cannot be concatenated
    in order. Specify interleave_partitions=True to ignore order

However, if I try:

dd.concat([ddf,ddf_i],axis=0,interleave_partitions=True)

then it appears to be working. Is there a problem with setting this to True (in terms of performance - speed)? Or is there another way to vertically 2 concatenate Dask DataFrames?

edesz
  • 11,756
  • 22
  • 75
  • 123
  • Note: in the current version of Dask (`1.2.1`), I can successfully run the original line `dd.concat([ddf,ddf_i],axis=0)` without any errors. However, see [below](https://stackoverflow.com/a/43813709/4057186) for a detailed response to the question. Also, from the Dask docs: (1) by default, `dd.concat()` uses [`interleave=False`](http://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.multi.concat) - this means if divisions are known and **ordered**, `dd.concat` will respect the divisions, (2) with `interleave=True`, then the order of the divisions are ignored during concatenation. – edesz May 28 '19 at 15:35

2 Answers2

3

If you inspect the divisions of the dataframe ddf.divisions, you will find, assuming one partition, that it has the edges of the index there: (0, 4). This is useful to dask, as it knows when you do some operation on the data, not to use a partition not including required index values. This is also why some dask operations are much faster when the index is appropriate for the job.

When you concatenate, the second dataframe has the same index as the first. Concatenation would work without interleaving if the values of the index had different ranges in the two partitions.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • 1
    could you please provide with some example for how to concatenate a dask.series of pd.dataframes into one dask dataframe without passing all via memory and concatenate with pandas? – thebeancounter Nov 14 '17 at 13:43
  • 1
    @mdurant - I'm having a similar issue. I'm not sure what is the correct why to manage this situation. maybe to use the `concat` issue and then run the `reset_index()` followed by `drop_duplicate()`. Somehow it seams to me to be an awkward solution. I just need to be able to concat the rows that are not within the index. I tried [this solution](https://stackoverflow.com/questions/27097492/identify-records-not-in-another-dataframe) but it is not valid for `dask.dataframe`. I also tried using the `append` method without success – skibee Dec 31 '17 at 11:45
  • I don't see any solution to the problem, just an explanation of why it's not working. The masses, myself included, would love to see what the solution is. – SummerEla Jan 09 '19 at 23:32
  • 1
    The solution was already in the question: instruct dask not to attempt to use the divisions information – mdurant Jan 10 '19 at 00:52
  • As in `interleave=False` ? – SummerEla Jan 14 '19 at 04:33
  • 1
    Basically, I'm trying my best to learn the Dask API because it's awesome and powerful. But the tutorials and documentation are so wretched that I'm actually having a harder time than I did learning python itself and it's a shame. I'm fluent in several languages; it shouldn't be this difficult. – SummerEla Jan 14 '19 at 04:37
  • Sorry you feel like that, @SummerEla - if you can propose changes to docs or tutorials, glad to consider. One problem is, that Dask is still being rapidly developed, so things go out of date quickly. – mdurant Jan 14 '19 at 14:02
  • @mdurant If I could ever figure out the proper workflow, I'd be happy to help. I've pieced together a workflow from many sources I've found, but I'm pretty sure I'm not doing things properly. If you all could post a workflow for dealing with large datasets on a local cluster, I'd be more than happy to help polish it up. Feel free to contact me here. – SummerEla Jan 14 '19 at 20:13
  • 1
    There certainly are some [examples](https://examples.dask.org/dataframes/02-groupby.html) around, but an infinite number of possible workflows... – mdurant Jan 14 '19 at 20:55
1

mdurant's answer is correct and this answer elaborate with MCVE code snippets using Dask v2021.08.1. Examples make it easier to understand divisions and interleaving.

Vertically concatenating DataFrames

Create two DataFrames, concatenate them, and view the results.

df = pd.DataFrame(
    {"nums": [1, 2, 3, 4, 5, 6], "letters": ["a", "b", "c", "d", "e", "f"]}
)
ddf1 = dd.from_pandas(df, npartitions=2)

df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"]})
ddf2 = dd.from_pandas(df, npartitions=1)

ddf3 = dd.concat([ddf1, ddf2])

print(ddf3.compute())

   nums letters
0     1       a
1     2       b
2     3       c
3     4       d
4     5       e
5     6       f
0    88      xx
1    99      yy

Divisions metadata when vertically concatenating

Create two DataFrames, concatenate them, and illustrate that sometimes this operation will cause divisions metadata to be lost.

def print_partitions(ddf):
    for i in range(ddf.npartitions):
        print(ddf.partitions[i].compute())

df = pd.DataFrame(
    {"nums": [1, 2, 3, 4, 5, 6], "letters": ["a", "b", "c", "d", "e", "f"]}
)
ddf1 = dd.from_pandas(df, npartitions=2)

ddf1.divisions # (0, 3, 5)

df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"]})
ddf2 = dd.from_pandas(df, npartitions=1)

ddf2.divisions # (0, 1)

ddf3 = dd.concat([ddf1, ddf2])

ddf3.divisions # (None, None, None, None)

Set interleave_partitions=True to avoid losing the divisions metadata.

ddf3_interleave = dd.concat([ddf1, ddf2], interleave_partitions=True)

ddf3_interleave.divisions # (0, 1, 3, 5)

When interleaving isn't necessary

Create two DataFrames without overlapping divisions, concatenate them, and confirm that the divisions metadata is not lost:

df = pd.DataFrame(
    {"nums": [1, 2, 3, 4], "letters": ["a", "b", "c", "d"], "some_index": [4, 5, 6, 7]}
)
ddf1 = dd.from_pandas(df, npartitions=2)

ddf1 = ddf1.set_index("some_index")

df = pd.DataFrame({"nums": [88, 99], "letters": ["xx", "yy"], "some_index": [10, 20]})
ddf2 = dd.from_pandas(df, npartitions=1)

ddf2 = ddf2.set_index("some_index")

ddf3 = dd.concat([ddf1, ddf2])

ddf3.divisions # (4, 6, 10, 20)

I wrote a blog post to explain this in more detail. Let me know if you'd like the link.

Powers
  • 18,150
  • 10
  • 103
  • 108