0

I would like to make a file with 3 main columns but my current file has different number of columns per row. an example of my file is like this:

BPIFB3,chr20;ENST00000375494.3      
PXDN,chr2,ENST00000252804.4;ENST00000483018.1   
RP11,chr2,ENST00000607956.1     
RNF19B,chr1,ENST00000373456.7;ENST00000356990.5;ENST00000235150.4

and here is what I want to make:

BPIFB3  chr20   ENST00000375494.3       
PXDN    chr2    ENST00000252804.4
PXDN    chr2    ENST00000483018.1   
RP11    chr2    ENST00000607956.1       
RNF19B  chr1    ENST00000373456.7
RNF19B  chr1    ENST00000356990.5
RNF19B  chr1    ENST00000235150.4

in fact if in the 3rd row we have more than 3 columns, per any extra column, I want to make a new row in which the first two columns are the same but the 3rd column is different(which is the extra column in original file).

I tried the following code in python but did not get what I am looking for:

from collections import defaultdict
with open('data.tbl') as f, open('out.tbl', 'w') as out:
    for line in f.split('\t'):
        if len(line) > 2:
            d[line[0]] = line[3]
        out.write(d.items)
john
  • 263
  • 1
  • 9
  • People have given you the general shape of an answer, but you have changed the sample input (esp the delimiters) several times in you redits. What you split on, or use as a delimitier should match the file format. `\t` means tab. The import thing in all the answers is the list slicing. – doctorlove Dec 22 '16 at 11:23
  • 1
    Please do **not** make significant changes to your question after it's started receiving valid answers. – PM 2Ring Dec 22 '16 at 11:36

4 Answers4

1

If your input is actually well-formed, you can do this:

for row in reader:
    for thing in row[2].split(';'):
        writer.writerow(row[:2]+[thing])

But as it exists, your first row has malformed data that doesn't match the rest of your rows. So if that is an example of your data, then you could try replacing ; with , before you feed it to the csv reader, and then you can do:

for thing in row[3:]: 

instead.


old answer to your prevous question:

You just want list slicing. Also, if it's a tab separated file you can just use the csv module. And you're importing defaultdict that you're not using.

import csv

with open('data.tbl') as f, open('out.tbl', 'w') as out:
    reader = csv.reader(f, delimiter='\t')
    writer = csv.writer(out, delimiter='\t')
    for row in reader:
        writer.writerow(row[:3])
Community
  • 1
  • 1
Wayne Werner
  • 49,299
  • 29
  • 200
  • 290
  • it returns the same file – john Dec 22 '16 at 10:54
  • I also modified the question – john Dec 22 '16 at 10:56
  • The data that you copied into your question wasn't tab separated. Your new stuff is csv, so all you have to do is remove the `delimiter` fields. – Wayne Werner Dec 22 '16 at 11:09
  • 1
    You also should've asked a new question as your edits *entirely* changed your problem. – Wayne Werner Dec 22 '16 at 11:12
  • if I replace semicolons in my file manually, the code from dan works perfectly. but I would like to get what I want without changing the data.tbl. do you have any idea how to do so? – john Dec 22 '16 at 11:15
  • Your sample input is not sane at all, unless you manually changed your input because `BPIFB3,chr20;ENST00000375494.3` *should* be `BPIFB3,chr20,ENST00000375494.3`. Or your other rows should be different. – Wayne Werner Dec 22 '16 at 11:19
1

You don't need dictionary for this problem list is sufficient for this. And delimiter '\t' won't work in your problem coz there are multiple space not tab. so we need to remove multiple space with re. so below program will work for your solution.

import re

with open('data.tbl') as f, open('out.tbl', 'w') as out:
    for line in f:
        line = re.sub('\s+',' ',line)
        line = line.strip().split(' ')
        if len(line) > 3:
            for l in range(2,len(line)):
                out.write(str(line[0])+' '+line[1]+ ' '+line[l]+'\n')
        else:
            out.write(' '.join(line)+'\n')

Hope this will help you.

Chintak
  • 61
  • 5
0

Try this:

import csv

with open('data.tbl') as f, open('out.tbl', 'w') as out:
    reader = csv.reader(f, delimiter='\t')
    writer = csv.writer(out, delimiter='\t')
    for row in reader:
        if len(row) == 3:
            writer.writerow(row)
        else:
            n = len(row)
            writer.writerow(row[:3])
            for j in range(3,n):
                writer.writerow([row[0], row[1], row[j]])
dan
  • 53
  • 1
  • 8
  • if I replace all semicolons in "data.tbl" by comma, then it works perfectly. to do so, first I have to do that manually. how can I edit your code to do that without replacing separately? – john Dec 22 '16 at 11:12
0

Try splitting on either commas or semicolons with re:

import re
import csv



with open('data.tbl') as infile, open('out.tbl','w') as outfile:
    data = [re.split(",|;",x.strip("\n")) for x in infile]
    output = []
    for line in data:
        if len(line) > 3:
            output.append(line[:3])
            for elm in line[3:]:
                output.append(line[:2]+[elm])
        else:
            output.append(line)

    writer = csv.writer(outfile)
    writer.writerows(output)

I hope this helps.

Abdou
  • 12,931
  • 4
  • 39
  • 42