3

I am trying to transpose a very large dataframe. I used Dask due to the size of the file and searched up how to transpose a dask dataframe.

    import pandas as pd
    import numpy as np
    import dask.dataframe as dd
    genematrix = r"C:\Users\fnafee\Desktop\tobeMerged\GENEMATRIX.csv"
    genematrix_df = dd.read_csv(genematrix)
    new_df = np.transpose(genematrix_df)
    new_df.head()

It returns the following

---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Input In [39], in <cell line: 6>()
        4 genematrix = r"C:\Users\fnafee\Desktop\tobeMerged\TSVSMERGED.csv"
        5 genematrix_df = dd.read_csv(genematrix)
  ----> 6 new_df = np.transpose(genematrix_df)
        7 new_df.head()

File <__array_function__ internals>:5, in transpose(*args, **kwargs)

File ~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:660, in transpose(a, axes)
      601 @array_function_dispatch(_transpose_dispatcher)
      602 def transpose(a, axes=None):
      603     """
      604     Reverse or permute the axes of an array; returns the modified array.
      605 
     (...)
      658 
      659     """
  --> 660     return _wrapfunc(a, 'transpose', axes)

File ~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:54, in _wrapfunc(obj, method, *args, **kwds)
       52 bound = getattr(obj, method, None)
       53 if bound is None:
  ---> 54     return _wrapit(obj, method, *args, **kwds)
       56 try:
       57     return bound(*args, **kwds)

File ~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:47, in _wrapit(obj, method, *args, **kwds)
       45     if not isinstance(result, mu.ndarray):
       46         result = asarray(result)
  ---> 47     result = wrap(result)
       48 return result

File ~\Anaconda3\lib\site-packages\dask\dataframe\core.py:4213, in DataFrame.__array_wrap__(self, array, context)
     4210     else:
     4211         index = context[1][0].index
  -> 4213 return pd.DataFrame(array, index=index, columns=self.columns)

UnboundLocalError: local variable 'index' referenced before assignment

The problem seems to come from some internal function that I have no control over. Do I need to change the way my file is formatted or should I try do this in small chunks at a time instead of one massive dataframe?

Michael Delgado
  • 13,789
  • 3
  • 29
  • 54

2 Answers2

2

This seems like you've uncovered an unrelated bug in dask. This is a known issue (GH#6954), which so far seems like it only crops up in situations like this where you're using dask in a way that doesn't work anyway :)

This bug is just masking the true issue, which is that you cannot transpose a dask.dataframe. This is because a key feature of dask.dataframes is to allow the rows to be unknown, though the columns must be known. Therefore, transposing the dataframe would require computing the entire frame. If this really is a matrix, it seems you perhaps should be using dask.array, or xarray with a dask backend if you need the dimensions to be labeled.

For example, given a dask.dataframe:

import dask.dataframe as dd, pandas as pd, numpy as np
df = dd.from_pandas(pd.DataFrame({'A': np.arange(100, 200), 'B': np.random.random(size=100)}), npartitions=4)

This can be converted to a dask.Array using dask.dataframe.to_dask_array, specifying lengths=True to define the chunk sizes:

In [13]: arr = df.to_dask_array(lengths=True)

In [14]: arr
Out[14]: dask.array<values, shape=(100, 2), dtype=float64, chunksize=(25, 2), chunktype=numpy.ndarray>

This array can now be transposed without computing the graph using dask.Array.transpose or the equivalent .T property:

In [15]: arr.T
Out[15]: dask.array<transpose, shape=(2, 100), dtype=float64, chunksize=(2, 25), chunktype=numpy.ndarray

This could be wrapped in an xarray.DataArray if using coordinate labels is desired:

In [22]: import xarray as xr
    ...: da = xr.DataArray(
    ...:     df.to_dask_array(lengths=True),
    ...:     dims=['index', 'columns'],
    ...:     coords=[df.index.compute(), df.columns],
    ...: )

In [23]: da
Out[23]:
<xarray.DataArray 'values-8d50dbfa8ed951a8ffb2ae5d5cd554bb' (index: 100,
                                                             columns: 2)>
dask.array<values, shape=(100, 2), dtype=float64, chunksize=(25, 2), chunktype=numpy.ndarray>
Coordinates:
  * index    (index) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
  * columns  (columns) object 'A' 'B'

In [24]: da.T
Out[24]:
<xarray.DataArray 'values-8d50dbfa8ed951a8ffb2ae5d5cd554bb' (columns: 2,
                                                             index: 100)>
dask.array<transpose, shape=(2, 100), dtype=float64, chunksize=(2, 25), chunktype=numpy.ndarray>
Coordinates:
  * index    (index) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
  * columns  (columns) object 'A' 'B'
Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
  • Why is it so difficult to implement transpose for a dataframe, if it already works for an array? Are the datastructures that different? – Soerendip Dec 07 '22 at 00:39
  • yes. they way partitions are managed is very different. Arrays must have known sizes. DataFrames don't - the length of each dataframe partition can be unknown length, but must have known column names. Additionally, partitions are always row-wise; dataframe partitions cannot be column-wise in dask; whereas chunks can be along any dimensions of a dask.array. Because of this, transposing a dataframe results in an invalid data structure. – Michael Delgado Dec 07 '22 at 01:16
0

Seems to be an indentation problem, since this error is saying that the variable index is not assigned before the line

return pd.DataFrame(array, index=index, columns=self.columns)
Guinther Kovalski
  • 1,629
  • 1
  • 7
  • 15