I am having an issue concatenating two dataframes. The strange part is that it worked- but just once - the first time, and after I made some "clever" changes (that I will discuss later), it did not do it again and started spewing a MemoryError. I also rebooted my machine and it's still the same issue. So here's what's going on:
- There are two data files - a) train.parquet b) metadata.csv
- metadata.csv has the information about the type of data train.parquet is.
- There are 4 columns in the
metadata.csv
file, out of which I am interested in three. - Due to the type of problem I am working on, I am asked to transpose train.parquet. Doing that, the number of rows in train.parquet WILL match metadata.csv. No problems there.
- So, to begin with, I do
metadata = pd.read_csv("metadata.csv")
- Then I do
train = pd.read_parquet("train.parquet", engine = 'pyarrow').T
(.T to transpose) - Then, when I try
df = pd.concat([train, metadata.col1, metadata.col2, metadata.col3], axis = 'columns')
, I am thrown aMemoryError
error.
Here's the 'clever' change I referred to earlier:
- I initially DID NOT do a
.T
on the training data directly (train = pd.read_parquet("train.parquet", engine = 'pyarrow').T
) - I did
original_data = pd.read_parquet("train.parquet", engine = 'pyarrow')
- Then
train = original_data.T
- Does pretty much the same thing, at least to me. - After this, when I did
df = pd.concat([train, metadata.col1, metadata.col2, metadata.col3], axis = 'columns')
, it had worked -> This is the only time it worked. - But then I realized, I actually wanted
metadata.col1
at the end (as it is the target variable), so I thought I'd rerun it rearranging the order of metadata columns - something like this:df = pd.concat([train, metadata.col2, metadata.col3, metadata.col1], axis = 'columns')
. - Seemed fair.
- But before I could run that, I tried what I would later regret doing - transposing the dataframe directly (
train = pd.read_parquet("train.parquet", engine = 'pyarrow').T
) - After this, each time I try to concatenate the two dataframes, I get the memory error. I have also rebooted my machine.
What can be causing this?
Any help is greatly appreciated.
Thanks in advance.
EDIT - It's a 64 Gb Azure VM.