1

I need to replace '|' into tab so that I can analyze my human annotation genomic data (200+mb). I'm a research assistant learning how to analyze/manipulate sequencing data in the easiest/simplest way so that I can replicate this on more data.

Here how my data looks like. There are ~400,000 lines of this type of data in one file.

       ANN=C|downstream_gene_variant|MODIFIER|OR4G4P|ENSG00000268020|transcript|ENST00000606857|unprocessed_pseudogene||n.*1414T>C|||||1414|,C|intron_variant|MODIFIER|OR4G4P|ENSG00000268020|transcript|ENST00000594647|unprocessed_pseudogene|1/1|n.20-104T>C||||||;DP=11;SS=1;VT=SNP

I tried to use this code to replace '|' into '\t' for several lines.

import csv
infile = 'Book2.xlsx'
with open(infile , 'r') as inf: 
    for line in inf:    
        w =csv.writer(inf, delimiter = '\t')
        print w

All I'm getting is this :

<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
<_csv.writer object at 0x7f8beebaafc8>
Jan Shamsani
  • 321
  • 2
  • 5
  • 14
  • The `csv` module cannot read and write Excel format `.xlsx` files, only text file containing Character Separated Value. There are a number of third party modules for Python that allow processing Excel files so you will need to use one (although in theory you could write your own). Secondly, even if `csv` did support the format, your code is creating a new `csv.writer` _class instance_ for every line of the input file it reads, which is why you're getting the output shown. – martineau Oct 20 '15 at 03:45

3 Answers3

1

Try using regular expression. Example for one line:

a = "ANN=C|downstream_gene_variant|MODIFIER|OR4G4P|ENSG00000268020|transcript|ENST00000606857|unprocessed_pseudogene||n.*1414T>C|||||1414|,C|intron_variant|MODIFIER|OR4G4P|ENSG00000268020|transcript|ENST00000594647|unprocessed_pseudogene|1/1|n.20-104T>C||||||;DP=11;SS=1;VT=SNP"


import re
regex= re.compile(r'\|')
regex.sub("\t",a)
flamenco
  • 2,702
  • 5
  • 30
  • 46
0

print w is just printing the brand new csv.writer object you created (and you probably don't mean to be creating a new writer on every loop). If you want to write to it, use the actual writerow method. There is a lot of example code on the csv docs page for reading and writing, you just need to combine them (and use an alternate dialect for each).

You're not actually using a csv.reader, and even if you were, your input file is .xlsx, which is not CSV, pipe delimited, or anything else, it's a much more complex data format (Office XML, and it's a zip file, so it looks like binary gibberish when read directly) that csv can't read.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
0

Your filename indicates excel, but for now I will assume you actually have a tsv (Tab Seperated) file and not an Excel file. If so

infile = 'Book2.tsv'
of = 'Book2.csv'
with open(infile , 'r') as inf: 
    for line in inf:
         of.write(line.replace('|',','))
of.close()

Of course if it is Excel that's different try xlrd Udacity has some nice lessons in "Data Wrangling with MongoDB"

MrSteve
  • 499
  • 5
  • 13