17

Action Reading two csv (data.csv and label.csv) to a single dataframe.

df = dd.read_csv(data_files, delimiter=' ', header=None, names=['x', 'y', 'z', 'intensity', 'r', 'g', 'b'])
df_label = dd.read_csv(label_files, delimiter=' ', header=None, names=['label'])

Problem Concatenation of columns requires known divisions. However setting an index will sort the data, which I explicitly do not want, because order of both files is their match.

df = dd.concat([df, df_label], axis=1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-e6c2e1bdde55> in <module>()
----> 1 df = dd.concat([df, df_label], axis=1)

/uhome/hemmest/.local/lib/python3.5/site-packages/dask/dataframe/multi.py in concat(dfs, axis, join, interleave_partitions)
    573             return concat_unindexed_dataframes(dfs)
    574         else:
--> 575             raise ValueError('Unable to concatenate DataFrame with unknown '
    576                              'division specifying axis=1')
    577     else:

ValueError: Unable to concatenate DataFrame with unknown division specifying axis=1

Tried Adding an 'id' column

df['id'] = pd.Series(range(len(df)))

However, the length of Dataframe results in a Series larger than memory.

Question Apparently Dask knows both Dataframe have the same length:

In [15]:
df.index.compute()
Out[15]:
Int64Index([      0,       1,       2,       3,       4,       5,       6,
                  7,       8,       9,
            ...
            1120910, 1120911, 1120912, 1120913, 1120914, 1120915, 1120916,
            1120917, 1120918, 1120919],
           dtype='int64', length=280994776)
In [16]:
df_label.index.compute()
Out[16]:
Int64Index([1, 5, 5, 2, 2, 2, 2, 2, 2, 2,
            ...
            3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
           dtype='int64', length=280994776)

How to exploit this knowledge to simply concatenate?

Tom Hemmes
  • 2,000
  • 2
  • 17
  • 23
  • added the concatenation statement for complete overview – Tom Hemmes Oct 24 '17 at 13:36
  • try adding `interleave_partitions=True` to your `dd.concat()` – Primer Oct 24 '17 at 13:41
  • Adding `interleave_partitions=True` works for `axis=0`, which in this case would result in a Dataframe of double the length as it concatenates vertically. However, for `axis=1` it does not solve the problem. – Tom Hemmes Oct 24 '17 at 14:00
  • what does `dask.__version__` show? – Primer Oct 24 '17 at 14:07
  • Currently running `0.15.4` – Tom Hemmes Oct 24 '17 at 14:08
  • Looks like a bug to me. But if the indexes of the data are not meaningful you could just use `.reset_index(drop=True)` on both dask.DataFrames and then call `.assign` on you main `df` to put your labels column to new column in `df`. This shouldn't sort the data and therefore should be fast. – Primer Oct 24 '17 at 14:24
  • I don't understand, I thought the reason for `.assign` not to work was not having an index in the first place. By dropping the index this problem remains... – Tom Hemmes Oct 24 '17 at 14:41
  • `.assign` doesn't work because the indexes of two dataframes are not aligned. In dask indexes besides being aligned that have to have aligned divisions. Divisions are set based on `npartitions`. If you compare `.npartitions` for both dataframes you will probably see different output. In that case you might re-partition them first with `df.repartition(npartitions=1)` and then try `reset_index` and `.assign`. – Primer Oct 24 '17 at 15:01
  • @Primer Thank you very much, I'll give it a try! – Tom Hemmes Oct 24 '17 at 15:42
  • Any update on this thread? I'm concatenating but getting a warning `UserWarning: Concatenating dataframes with unknown divisions. We're assuming that the indexes of each dataframes are aligned. This assumption is not generally safe. warn("Concatenating dataframes with unknown divisions.\n"`. – Asif Ali Aug 14 '18 at 08:41
  • @AsifAli did you try the method by @Primer? It works for me – Tom Hemmes Aug 14 '18 at 13:48

4 Answers4

9

The solution (from the comments by @Primer):

  • both repartitioning and resetting the index
  • use assign instead of concatenate

The final code;

import os
from pathlib import Path
import dask.dataframe as dd
import numpy as np
import pandas as pd



df = dd.read_csv(['data/untermaederbrunnen_station1_xyz_intensity_rgb.txt'], delimiter=' ', header=None, names=['x', 'y', 'z', 'intensity', 'r', 'g', 'b'])
df_label = dd.read_csv(['data/untermaederbrunnen_station1_xyz_intensity_rgb.labels'], header=None, names=['label'])
# len(df), len(df_label), df_label.label.isnull().sum().compute()

df = df.repartition(npartitions=200)
df = df.reset_index(drop=True)
df_label = df_label.repartition(npartitions=200)
df_label = df_label.reset_index(drop=True)

df = df.assign(label = df_label.label)
df.head()
Tom Hemmes
  • 2,000
  • 2
  • 17
  • 23
  • 2
    Following up on the comment by @AsifAli above, what if the concatenated dataframe has a lot of columns, do I really need to explicitly specify each column by it's name in `assign`. Currently `dask.concat` gives a warning (not error) when concatenating two dataframes with unknown divisions. If we know for sure both df's are the same length, is this warning safe to ignore? – stav Apr 23 '19 at 19:38
1

I had the same problem and solved it by making sure that both dataframes have the same number of partitions (since we know already that both have the same length):

df = df.repartition(npartitions=200)
df_label = df_label.repartition(npartitions=200)
df = dd.concat([df, df_label], axis=1)
architectonic
  • 2,871
  • 2
  • 21
  • 35
  • Thanks for this suggestion, however Dask simply returns `ValueError: Concatenated DataFrames of different lengths` – Tom Hemmes Aug 14 '18 at 13:28
1

I had similar problem and the solution was simply to compute the chunk sizes of each dask array that I was going to put into dataframe using .compute_chunk_sizes(). After that there was no issues to concatenate them into dataframe on axis=1.

foxof
  • 11
  • 2
  • Welcome to stackoverflow. When answering a question, make an effort to explain how you solution does solve the issue. E.g. How does simply calculating chunksize help concatenation? Explain that in your answer. – Serge de Gosson de Varennes Nov 26 '20 at 20:38
0

I have 5 dataframes and applied compute on one of them. After removing compute, the error is gone

Talha Anwar
  • 2,699
  • 4
  • 23
  • 62