I am trying to transpose a very large dataframe. I used Dask due to the size of the file and searched up how to transpose a dask dataframe.
import pandas as pd
import numpy as np
import dask.dataframe as dd
genematrix = r"C:\Users\fnafee\Desktop\tobeMerged\GENEMATRIX.csv"
genematrix_df = dd.read_csv(genematrix)
new_df = np.transpose(genematrix_df)
new_df.head()
It returns the following
---------------------------------------------------------------------------
UnboundLocalError Traceback (most recent call last)
Input In [39], in <cell line: 6>()
4 genematrix = r"C:\Users\fnafee\Desktop\tobeMerged\TSVSMERGED.csv"
5 genematrix_df = dd.read_csv(genematrix)
----> 6 new_df = np.transpose(genematrix_df)
7 new_df.head()
File <__array_function__ internals>:5, in transpose(*args, **kwargs)
File ~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:660, in transpose(a, axes)
601 @array_function_dispatch(_transpose_dispatcher)
602 def transpose(a, axes=None):
603 """
604 Reverse or permute the axes of an array; returns the modified array.
605
(...)
658
659 """
--> 660 return _wrapfunc(a, 'transpose', axes)
File ~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:54, in _wrapfunc(obj, method, *args, **kwds)
52 bound = getattr(obj, method, None)
53 if bound is None:
---> 54 return _wrapit(obj, method, *args, **kwds)
56 try:
57 return bound(*args, **kwds)
File ~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:47, in _wrapit(obj, method, *args, **kwds)
45 if not isinstance(result, mu.ndarray):
46 result = asarray(result)
---> 47 result = wrap(result)
48 return result
File ~\Anaconda3\lib\site-packages\dask\dataframe\core.py:4213, in DataFrame.__array_wrap__(self, array, context)
4210 else:
4211 index = context[1][0].index
-> 4213 return pd.DataFrame(array, index=index, columns=self.columns)
UnboundLocalError: local variable 'index' referenced before assignment
The problem seems to come from some internal function that I have no control over. Do I need to change the way my file is formatted or should I try do this in small chunks at a time instead of one massive dataframe?