0

I'm working on my first correlation analysis. I've received the data through an excel file, I've imported it as Dataframe (had to pivot it) and now I have a set of almost 3000 rows and 25000 columns. I can't choose a subset from it, as every column is important for this project and I also don't know what information every column stores in order to choose the most interesting ones, because it is encoded with integer numbers (it is an university project). It is like a big questionnaire, where every person has his/hers own row and the answers for every question are stored in a different column.

I really need to solve this issue because later I'll have to replace the many Nans with the medians of the columns and then start the correlation analysis. I tried this part first and it didn't go because of the size so that's why I've tried downcasting first

The dataset has 600 MB and I used the downcasting instruction for the floats and saved 300 MB but when I try to replace the new columns in a copy of my dataset, it runs for 30 minutes and it doesn't do anything. No warning, no error until I interrupt the kernel and it still gives me no hint why it doesn't work.

I can't use the delete Nans instruction first, because there are so many, that it will erase almost everything.

#i've got this code from https://www.dataquest.io/blog/pandas-big-data/
def mem_usage(pandas_obj):
    if isinstance(pandas_obj,pd.DataFrame):
        usage_b = pandas_obj.memory_usage(deep=True).sum()
    else: # we assume if not a df it's a series
        usage_b = pandas_obj.memory_usage(deep=True)
    usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
    return "{:03.2f} MB".format(usage_mb)

gl_float = myset.select_dtypes(include=['float'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')

print(mem_usage(gl_float)) #almost 600
print(mem_usage(converted_float)) #almost 300

optimized_gl = myset.copy()
optimized_gl[converted_float.columns]= converted_float #this doesn't end

after the replacement works, I want to use the Imputer function for the Nans-replacement and print the correlation result for my dataset

cribb
  • 21
  • 3

1 Answers1

0

in the end I've decided to use this:

 column1 = myset.iloc[:,0]
 converted_float.insert(loc=0, column='ids', value=column1)

instead of the lines with optimized_gl and it solved it but it was possible only because every column changed except for the first one. So I just had to add the first to the others.

cribb
  • 21
  • 3