I'm working on my first correlation analysis. I've received the data through an excel file, I've imported it as Dataframe (had to pivot it) and now I have a set of almost 3000 rows and 25000 columns. I can't choose a subset from it, as every column is important for this project and I also don't know what information every column stores in order to choose the most interesting ones, because it is encoded with integer numbers (it is an university project). It is like a big questionnaire, where every person has his/hers own row and the answers for every question are stored in a different column.
I really need to solve this issue because later I'll have to replace the many Nans with the medians of the columns and then start the correlation analysis. I tried this part first and it didn't go because of the size so that's why I've tried downcasting first
The dataset has 600 MB and I used the downcasting instruction for the floats and saved 300 MB but when I try to replace the new columns in a copy of my dataset, it runs for 30 minutes and it doesn't do anything. No warning, no error until I interrupt the kernel and it still gives me no hint why it doesn't work.
I can't use the delete Nans instruction first, because there are so many, that it will erase almost everything.
#i've got this code from https://www.dataquest.io/blog/pandas-big-data/
def mem_usage(pandas_obj):
if isinstance(pandas_obj,pd.DataFrame):
usage_b = pandas_obj.memory_usage(deep=True).sum()
else: # we assume if not a df it's a series
usage_b = pandas_obj.memory_usage(deep=True)
usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
return "{:03.2f} MB".format(usage_mb)
gl_float = myset.select_dtypes(include=['float'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')
print(mem_usage(gl_float)) #almost 600
print(mem_usage(converted_float)) #almost 300
optimized_gl = myset.copy()
optimized_gl[converted_float.columns]= converted_float #this doesn't end
after the replacement works, I want to use the Imputer function for the Nans-replacement and print the correlation result for my dataset