0

I am having an issue concatenating two dataframes. The strange part is that it worked- but just once - the first time, and after I made some "clever" changes (that I will discuss later), it did not do it again and started spewing a MemoryError. I also rebooted my machine and it's still the same issue. So here's what's going on:

  1. There are two data files - a) train.parquet b) metadata.csv
  2. metadata.csv has the information about the type of data train.parquet is.
  3. There are 4 columns in the metadata.csv file, out of which I am interested in three.
  4. Due to the type of problem I am working on, I am asked to transpose train.parquet. Doing that, the number of rows in train.parquet WILL match metadata.csv. No problems there.
  5. So, to begin with, I do metadata = pd.read_csv("metadata.csv")
  6. Then I do train = pd.read_parquet("train.parquet", engine = 'pyarrow').T (.T to transpose)
  7. Then, when I try df = pd.concat([train, metadata.col1, metadata.col2, metadata.col3], axis = 'columns'), I am thrown a MemoryError error.

Here's the 'clever' change I referred to earlier:

  1. I initially DID NOT do a .T on the training data directly (train = pd.read_parquet("train.parquet", engine = 'pyarrow').T)
  2. I did original_data = pd.read_parquet("train.parquet", engine = 'pyarrow')
  3. Then train = original_data.T - Does pretty much the same thing, at least to me.
  4. After this, when I did df = pd.concat([train, metadata.col1, metadata.col2, metadata.col3], axis = 'columns'), it had worked -> This is the only time it worked.
  5. But then I realized, I actually wanted metadata.col1 at the end (as it is the target variable), so I thought I'd rerun it rearranging the order of metadata columns - something like this: df = pd.concat([train, metadata.col2, metadata.col3, metadata.col1], axis = 'columns').
  6. Seemed fair.
  7. But before I could run that, I tried what I would later regret doing - transposing the dataframe directly (train = pd.read_parquet("train.parquet", engine = 'pyarrow').T)
  8. After this, each time I try to concatenate the two dataframes, I get the memory error. I have also rebooted my machine.

What can be causing this?

Any help is greatly appreciated.

Thanks in advance.

EDIT - It's a 64 Gb Azure VM.

Yuan JI
  • 2,927
  • 2
  • 20
  • 29
Anonymous Person
  • 1,437
  • 8
  • 26
  • 47

0 Answers0