pandas - processing huge file column by column

Question

I have a huge data frame 1194 rows and 14.000.000 columns. I need the sum of each column and only save the column name and sum if the sum is higher than 1. When I try to load in the text file (which is +30gb large) the process gets killed. the text file is tab delimerated and looks something like this:

cell 17472131 17472132 17472133..
cell_0 1 0 1
cell_1 0 0 0
cell_2 0 1 1
cell_3 1 0 0
.
.
.

is there a way I can do this in a column like fashion, So I dont use to much memory?

don't use pandas for this, use SQL or Dask see [this](https://stackoverflow.com/questions/54208323/how-can-i-efficiently-transpose-a-67-gb-file-dask-dataframe-without-loading-it-e) — Umar.H, Aug 04 '20 at 18:25

score 1 · Answer 1 · answered Aug 04 '20 at 21:27

1

pandas.read_csv() has parameters skiprows and nrows to read specific block of rows. function manual here

I suggest to set up array of sums (size 14mln) and then have a cycle to read few rows at a time, updating the sums and then loaging next few rows.

answered Aug 04 '20 at 21:27

Poe Dator

4,535
2
14
35

pandas - processing huge file column by column

1 Answers1