Unable to concat dataframes - MemoryError

Question

I am having an issue concatenating two dataframes. The strange part is that it worked- but just once - the first time, and after I made some "clever" changes (that I will discuss later), it did not do it again and started spewing a MemoryError. I also rebooted my machine and it's still the same issue. So here's what's going on:

There are two data files - a) train.parquet b) metadata.csv
metadata.csv has the information about the type of data train.parquet is.
There are 4 columns in the metadata.csv file, out of which I am interested in three.
Due to the type of problem I am working on, I am asked to transpose train.parquet. Doing that, the number of rows in train.parquet WILL match metadata.csv. No problems there.
So, to begin with, I do metadata = pd.read_csv("metadata.csv")
Then I do train = pd.read_parquet("train.parquet", engine = 'pyarrow').T (.T to transpose)
Then, when I try df = pd.concat([train, metadata.col1, metadata.col2, metadata.col3], axis = 'columns'), I am thrown a MemoryError error.

Here's the 'clever' change I referred to earlier:

I initially DID NOT do a .T on the training data directly (train = pd.read_parquet("train.parquet", engine = 'pyarrow').T)
I did original_data = pd.read_parquet("train.parquet", engine = 'pyarrow')
Then train = original_data.T - Does pretty much the same thing, at least to me.
After this, when I did df = pd.concat([train, metadata.col1, metadata.col2, metadata.col3], axis = 'columns'), it had worked -> This is the only time it worked.
But then I realized, I actually wanted metadata.col1 at the end (as it is the target variable), so I thought I'd rerun it rearranging the order of metadata columns - something like this: df = pd.concat([train, metadata.col2, metadata.col3, metadata.col1], axis = 'columns').
Seemed fair.
But before I could run that, I tried what I would later regret doing - transposing the dataframe directly (train = pd.read_parquet("train.parquet", engine = 'pyarrow').T)
After this, each time I try to concatenate the two dataframes, I get the memory error. I have also rebooted my machine.

What can be causing this?

Any help is greatly appreciated.

Thanks in advance.

EDIT - It's a 64 Gb Azure VM.

Tried with `engine = 'fastparquet'` and it's the same behavior. Although it took significantly longer for the train dataset to load as that. It was faster with `'pyarrow'` — Anonymous Person, Mar 13 '19 at 11:47
I created a new conda environment altogether and it's the same thing on that too - MemoryError — Anonymous Person, Mar 13 '19 at 12:48

Unable to concat dataframes - MemoryError

0 Answers0