0

I have a huge data frame 1194 rows and 14.000.000 columns. I need the sum of each column and only save the column name and sum if the sum is higher than 1. When I try to load in the text file (which is +30gb large) the process gets killed. the text file is tab delimerated and looks something like this:

cell 17472131 17472132 17472133..
cell_0 1 0 1
cell_1 0 0 0
cell_2 0 1 1
cell_3 1 0 0
.
.
.

is there a way I can do this in a column like fashion, So I dont use to much memory?

Mark Wekking
  • 391
  • 1
  • 5
  • 14
  • 3
    don't use pandas for this, use SQL or Dask see [this](https://stackoverflow.com/questions/54208323/how-can-i-efficiently-transpose-a-67-gb-file-dask-dataframe-without-loading-it-e) – Umar.H Aug 04 '20 at 18:25

1 Answers1

1

pandas.read_csv() has parameters skiprows and nrows to read specific block of rows. function manual here

I suggest to set up array of sums (size 14mln) and then have a cycle to read few rows at a time, updating the sums and then loaging next few rows.

Poe Dator
  • 4,535
  • 2
  • 14
  • 35