1

I would like do perform division with respect to its alphabet. Given an example as below:

The binary file given is in csv format:

A=1000, C=0100, G=0010, T=0001

binary.csv: CAT, GAA

0,1,0,0,1,0,0,0,0,0,0,1
0,0,1,0,1,0,0,0,1,0,0,0

The binary.csv need to multiply with a single line values which is in csv file.

single.csv:

0.28,0.22,0.23,0.27,0.12,0.29,0.34,0.21,0.44,0.56,0.51,0.65

The code below do multiplication for values in both files and output:

0.22,0.12,0.65
0.23,0.12,0.44

Code

import csv

with open('single.csv', 'rb') as csvfile:
    for row in csv.reader(csvfile, delimiter=','):
        reals = row

with open('binary.csv', 'rb') as csvfile:
    pwreader = csv.reader(csvfile, delimiter=',')

    with open('output.csv','wb') as testfile:
        csv_writer=csv.writer(testfile)
        for row in pwreader:
            result = []            
            for i,b in enumerate(row):
                if b == '1' :
                    result.append(reals[i])
            csv_writer.writerow(result)

I have additional csv file that I would like to perform division for previous output and the values that are divided with respect to its alphabet:

 A   C   G   T
 0.4,0.5,0.7,0.1
 0.2,0.8,0.9,0.3

the value for CAT is divided by 0.5,0.4,0.1 and GAA is divided by 0.9,0.2,0.2 respectively so that I can get a whole new output as follows:

 0.44,0.3,6.5
 0.26,0.6,2.2

using numpy on array can possibly solve this but when used on more than couple of thousands data it might not be suitable. Out of memory was occurred when I tried on 60,000++ data.

Can anyone help me?

Eric
  • 95,302
  • 53
  • 242
  • 374
Xiong89
  • 767
  • 2
  • 13
  • 24
  • Also, that's a very strange format to store your ACTG in – Eric Feb 23 '16 at 05:38
  • 1
    Removed most of your tags - this is _not_ integer division, and the fact its division at all is irrelevant. CSV is also irrelevant here, since you're successfully reading in the data – Eric Feb 23 '16 at 06:09

1 Answers1

2
import numpy as np

Lets assume you can extract these from the files:

actg = np.array([
    [0,1,0,0,1,0,0,0,0,0,0,1],
    [0,0,1,0,1,0,0,0,1,0,0,0]
])

single = np.array([0.28,0.22,0.23,0.27,0.12,0.29,0.34,0.21,0.44,0.56,0.51,0.65])

division = np.array([
    [0.4,0.5,0.7,0.1],
    [0.2,0.8,0.9,0.3]
])

First, lets get actg into a more useful format:

>>> actg = actg.reshape((-1, 3, 4))
array([[[0, 1, 0, 0],
        [1, 0, 0, 0],
        [0, 0, 0, 1]],

       [[0, 0, 1, 0],
        [1, 0, 0, 0],
        [1, 0, 0, 0]]])

We do the same for single:

>>> single = single.reshape((-1, 4))
array([[ 0.28,  0.22,  0.23,  0.27],
       [ 0.12,  0.29,  0.34,  0.21],
       [ 0.44,  0.56,  0.51,  0.65]])

So now our objects are indexed as:

  • actg[row, col, symbol]
  • single[col, symbol]
  • division[row, symbol]

At this point, we just multiply and sum

>>> res_1 = (single * actg).sum(axis=-1)
array([[ 0.22,  0.12,  0.65],
       [ 0.23,  0.12,  0.44]])

For the division, we need to insert a dimension to match col above, using np.newaxis

>>> divide_by = (division[:,np.newaxis,:] * actg).sum(axis=-1)
array([[ 0.5,  0.4,  0.1],
       [ 0.9,  0.2,  0.2]])

Then finally, we just do the division

>>> res2 = res_1 / divide_by
array([[ 0.44      ,  0.3       ,  6.5       ],
       [ 0.25555556,  0.6       ,  2.2       ]])

Bonus one liner:

res2 = (single[np.newaxis,:,:] / division[:,np.newaxis,:] * actg).sum(axis=-1)
Eric
  • 95,302
  • 53
  • 242
  • 374
  • hi, when I tried on more data this method showed out of memory error. I think it would be better if using csv file to keep those values. I still prefer using csv format. – Xiong89 Feb 23 '16 at 08:42
  • to make this more memory efficient, you could make all your operations happen when reading the file line by line, and then write them into a new file line by line in "a" mode (append open mode) – bmbigbang Feb 23 '16 at 09:09
  • the file is csv format, do you mean to read them as in array? because according to the example answer shown, the array is include all the rows of data. I don't quite get it. – Xiong89 Feb 23 '16 at 09:24
  • from numpy import genfromtxt my_data = genfromtxt('my_file.csv', delimiter=',') – Xiong89 Feb 23 '16 at 09:28
  • Yes, this code assumes you read all the files in as an array at the beginning. _'Using csv format'_ is (still) irrelevant here. The distinction your're making is that you can't afford to keep the whole thing in memory, so need a solution operating on streams/iterators – Eric Feb 23 '16 at 20:34