3

I have 3 rather large files (67gb, 36gb, 30gb) that I need to train models on. However, the features are rows and the samples are columns. Since Dask hasn't implemented transpose and stores DataFrames split by row, I need to write something to do this myself. Is there a way I can efficiently transpose without loading into memory?

I've got 16 gb of ram at my disposal and am using jupyter notebook. I have written some rather slow code, but would really appreciate a faster solution. The speed of the code below will take a month to finish all the files. The slowest step by a few orders of magnitude is awk.

import dask.dataframe as dd
import subprocess
from IPython.display import clear_output

df = dd.read_csv('~/VeryLarge.tsv')
with open('output.csv','wb') as fout:
    for i in range(1, len(df.columns)+1):
        print('AWKing')
        #read a column from the original data and store it elsewhere
        x = "awk '{print $"+str(i)+"}' ~/VeryLarge.tsv > ~/file.temp"
        subprocess.check_call([x], shell=True)

        print('Reading')
        #load and transpose the column
        col = pd.read_csv('~/file.temp')
        row = col.T
        display(row)

        print('Deleting')
        #remove the temporary file created
        !rm ../file.temp

        print('Storing')
        #store the row in its own csv just to be safe. not entirely necessary
        row.to_csv('~/columns/col_{:09d}'.format(i), header=False)

        print('Appending')
        #append the row (transposed column) to the new file
        with open('~/columns/col_{:09d}', 'rb') as fin:
            for line in fin:
                fout.write(line)

        clear_output()
        #Just a measure of progress
        print(i/len(df.columns))

The data itself is 10million rows (features) and 2000 columns (samples). It just needs to be transposed. Currently, it looks like this: DataFrame

Joe B
  • 912
  • 2
  • 15
  • 36
  • Can you post a sample of input to desired output? – Mike Jan 16 '19 at 01:09
  • Just edited to show a sample of the data. The data is real, but I changed the names of features and samples because of company privacy. – Joe B Jan 16 '19 at 01:20
  • 2
    at risk of wandering off topic, if your data set has a high proportion of zeros you might consider using a sparse matrix representation. Many common matrix operations can be much more efficient that way. – smarchese Jan 16 '19 at 01:59
  • So you want 10m columns and 2000 rows? – Mike Jan 16 '19 at 02:48
  • Try http://docs.dask.org/en/latest/array-api.html#dask.array.transpose – mdurant Jan 16 '19 at 15:58
  • @mdurant this is essentially the same thing I am doing currently ie a column-wise transposition. array.transpose can't be executed over an entire dask dataframe. With the above script the transposition wasn't the slow step...awk was sooo slow. – Joe B Jan 17 '19 at 07:26
  • @smarchese Can you explain more? I read some articles online, but I do not see how it keeps the data in the same representation. Can you use sparse matrix representation to feed into an ML algorithm...say Random Forest? – Joe B Jan 17 '19 at 07:27
  • Sure if you matrix has say `M` rows, `N` columns then its got `M*N` elements. You could (usually wastefully) store this as an array of 3-tuples `(i,j,x)` with i,j the row and column indices. If a high percent of your data is one value (esp. zero) then you can forget about storing those. So it's more compact if less than a third of your matrix is non-zero. Transposing is a very simple operation in this representation - just swap the 'i' and 'j' inside each tuple (may need to sort after). You'd likely have to 're-hydrate' it into a dense representation to feed into many algorithms though. – smarchese Jan 18 '19 at 04:28

2 Answers2

3

I've modified my original script to deploy on any number of cpus. It worked much faster since I could use multiple threads and deployed on aws. I used a 96 core machine that completed the task in about 8 hours. I was quite surprised since that is nearly linear scaling! The idea is to make some repetitive task distributable. Then you will be able to assign tasks to cpus. Here the parallelizing is done with the command pool.map().

The usage of this script from command line is quite simple:

python3 transposer.py -i largeFile.tsv

you can specify other args as well if required.

import argparse, subprocess
import numpy as np
import pandas as pd
import dask.dataframe as dd
from IPython.display import clear_output
from contextlib import closing
from os import cpu_count
from multiprocessing import Pool

parser = argparse.ArgumentParser(description='Transpose csv')
parser.add_argument('-i', '--infile', help='Path to input folder',
                    default=None)
parser.add_argument('-s', '--sep', help='input separator',
                    default='\t')

args = parser.parse_args()
infile = args.infile
sep = args.sep    
df = pd.read_csv(infile, sep='\t', nrows=3)    

def READ_COL(item):
    print(item)
    outfile = 'outfile{}.temp'.format(item)
    if item !=0:
                x = "awk '{print $"+str(item)+"}' "+infile+" > "+outfile
                subprocess.check_call([x], shell=True)
                col = pd.read_csv(outfile)
                row = col.T
                display(row)
                row.to_csv('col_{:09d}.csv'.format(item), header=False)
                subprocess.check_call(['rm '+outfile], shell=True)
                print(item/len(df.columns))

with closing(Pool(processes=cpu_count())) as pool:
    pool.map(READ_COL, list(range(1, len(df.columns)+1)))

After this, you should have a number of files that are transposed columns. You'll just need to join them together with cat or some other command line tool. I just ran cat col_* > full_file_transposed.csv

Joe B
  • 912
  • 2
  • 15
  • 36
1

I would create an intermediate file and use fp.seek to write them in binary format in the new order before converting it back into a new CSV. Given row, column becoming column, row - sys.float_info will give you the size of each element, the position of each element ((is column * old_row_length + row) * size of float).

You then re-combine them into a CSV by converting them back into text and reading in old_count_rows per line.

Mike
  • 2,721
  • 1
  • 15
  • 20