How to split tsv file into smaller tsv file based on row values

Question

I have a tsv file in.txt which I would like to split into a smaller tsv file called out.txt.

I would like to import only the rows of in.txt which contain a string value My String Value in column 6 into out.txt.

import csv

# r is textmode
# rb is binary mode
# binary mode is faster

with open('in.txt','rb') as tsvIn, open('out.txt', 'w') as tsvOut:
    tsvIn = csv.reader(tsvIn, delimiter='\t')
    tsvOut = csv.writer(tsvOut)

    for row in tsvIn:
        if "My String Value" in row:
            tsvOut.writerows(row)

My output looks like this.

D,r,a,m,a

1,9,6,1,-,0,4,-,1,3
H,y,u,n, ,M,o,k, ,Y,o,o
B,e,o,m,-,s,e,o,n, ,L,e,e
M,u,-,r,y,o,n,g, ,C,h,o,i,",", ,J,i,n, ,K,y,u, ,K,i,m,",", ,J,e,o,n,g,-,s,u,k, ,M,o,o,n,",", ,A,e,-,j,a, ,S,e,o

A, ,p,u,b,l,i,c, ,a,c,c,o,u,n,t,a,n,t,',s, ,s,a,l,a,r,y, ,i,s, ,f,a,r, ,t,o,o, ,s,m,a,l,l, ,f,o,r, ,h,i,m, ,t,o, ,e,v,e,n, ,g,e,t, ,a, ,c,a,v,i,t,y, ,f,i,x,e,d,",", ,l,e,t, ,a,l,o,n,e, ,s,u,p,p,o,r,t, ,h,i,s, ,f,a,m,i,l,y,., ,H,o,w,e,v,e,r,",", ,h,e, ,m,u,s,t, ,s,o,m,e,h,o,w, ,p,r,o,v,i,d,e, ,f,o,r, ,h,i,s, ,s,e,n,i,l,e,",", ,s,h,e,l,l,-,s,h,o,c,k,e,d, ,m,o,t,h,e,r,",", ,h,i,s, ,.,.,.

K,o,r,e,a,n,",", ,E,n,g,l,i,s,h

S,o,u,t,h, ,K,o,r,e,a

It should look like this with tab separated values

Drama     Hyn Mok Yoo     A public accountant's salary is far to small for him...etc

Awesome. I have an output now. But the cell values are aligned vertically rather than horizontally as in the original file. Additionally, the the individual characters in the cell values are separated by tabs. I was expecting the cell values to be aligned horizontally and separated by tabs. — the_prole, Mar 18 '16 at 19:52
To write a single row, use `writerow()`, not `writerows()`. And if you want tabs as delimiter in the output file, you need to specify that as well. — Sven Marnach, Mar 18 '16 at 19:57

score 2 · Accepted Answer · edited May 23 '17 at 12:23

There are a few things wrong with your code. Let's look at this line by line..

import csv

Import module csv. Ok.

with open('in.txt','rb') as tsvIn, open('out.txt', 'w') as tsvOut:

With auto-closed binary file read handle tsvIn from in.txt, and text write handle tsvOut from out.txt, do... (Note: you probably want to use mode wb instead of mode w; see this post)

    tsvIn = csv.reader(tsvIn, delimiter='\t')

Let tsvIn be the result of the call of function reader in module csv with arguments tsvIn and delimiter='\t'. Ok.

    tsvOut = csv.writer(tsvOut)

Let tsvOut be the result of the call of function writer in module csv with argument tsvOut. You proably want to add another argument, delimiter='\t', too.

    for row in tsvIn:

For each element in tsvIn as row, do...

        if "My String Value" in row:

If string "My String Value" is present in row. You mentioned that you wanted to show only those rows whose sixth element was equal to the string, thus you should use something like this instead...

        if len(row) >= 6 and row[5] == "My String Value":

This means: If the length of row is at least 6, and the sixth element of row is equal to "My String Value", do...

            tsvOut.writerows(row)

Call method writerows of object tsvOut with argument row. Remember that in Python, a string is just a sequence of characters, and a character is a single-element string. Thus, a character is a sequence. Then, we have that row is, according to the docs, a list of strings, each representing a column of the row. Thus, a row is a list of strings. Then, we have the writerows method, that expects a list of rows, that is, a list of lists of strings, that is, a list of lists of sequences of characters. It happens that you can interpret each of row's elements as a row, when it's actually a string, and each element of that string as a string (as characters are strings!). All of this means is that you'll get a messy, character-by-character output. You should try this instead...

            tsvOut.writerow(row)

Method writerow expects a single row as an argument, not a list of rows, thus this will yield the expected result.

You definitely want to use `'rb'` and `'wb`' for the CSV module (see also [this answer](http://stackoverflow.com/a/4250021/279627)). This has nothing to do with what is faster, it's rather a question of what works on all platforms. Also note that a string isn't a "list" of characters. It's a _sequence_ of characters, which is something less specific in Python terminology. — Sven Marnach, Mar 18 '16 at 21:48

score 1 · Answer 2 · answered Mar 18 '16 at 20:12

try this:

import csv

# r is textmode
# rb is binary mode
# binary mode is faster

with open('in.txt','r') as tsvIn, open('out.txt', 'w') as tsvOut:
    reader = csv.reader(tsvIn, delimiter='\t')
    writer = csv.writer(tsvOutm, delimiter='\t')

    [writer.writerow(row) for row in reader if "My String Value" in row]

How to split tsv file into smaller tsv file based on row values

2 Answers2