How to find the number of columns in a tab separated file

Question

I have a tab separated file with 1 billion lines of these(imagine 200+ columns instead of 3):

abc -0.123  0.6524  0.325
foo -0.9808 0.874   -0.2341 
bar 0.23123 -0.123124   -0.1232

If the number of columns are unknown, how do I find the number of columns in a tab separated file?

I've tried this:

import io
with io.open('bigfile', 'r') as fin:
    num_columns = len(fin.readline().split('\t'))

And (from @EdChum, Read a tab separated file with first column as key and the rest as values):

import pandas as pd
num_columns = pd.read_csv('bigfile', sep='\s+', nrows=1).shape[1]

How else can I get the number of columns? And which is the most efficient way? (Imagine that i suddenly receive a file with unknown number of columns, like more than 1 million columns)

What's wrong with the last snippet (which I authored) it just reads a single line and spits out a number? — EdChum, Apr 28 '15 at 14:16
or in general, what's wrong with reading the first line of the file and calculating the number of columns? — Julien Spronck, Apr 28 '15 at 14:21
@EdChum, I just want to check whether there are other ways to get the number of columns and then benchmark them. — alvas, Apr 28 '15 at 14:23
Well let me know if it is the fastest, I'd be interest to know how pandas stacks up — EdChum, Apr 28 '15 at 14:30
I tried timing different code but pandas is giving me a `StopIteration:` error — Padraic Cunningham, Apr 28 '15 at 14:42

Padraic Cunningham · Accepted Answer · 2015-04-28T15:56:25.830

Some timings on a file with 100000 columns, count seems fastest but is off by one:

In [14]: %%timeit                    
with open("test.csv" ) as f:
    r = csv.reader(f, delimiter="\t")
    len(next(r))
   ....: 
10 loops, best of 3: 88.7 ms per loop

In [15]: %%timeit                    
with open("test.csv" ) as f:
    next(f).count("\t")
   ....: 
100 loops, best of 3: 11.9 ms per loop
with io.open('test.csv', 'r') as fin:
    num_columns = len(next(fin).split('\t'))
    ....: 
 10 loops, best of 3: 133 ms per loop

Using str.translate actually seems the fastest although again you need to add 1:

In [5]: %%timeit
with open("test.csv" ) as f:
    n = next(f)
    (len(n) - len(n.translate(None, "\t")))
   ...: 
100 loops, best of 3: 9.9 ms per loop

The pandas solution gives me an error:

in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7977)()

StopIteration:

Using readline adds more overhead:

In [19]: %%timeit
with open("test.csv" ) as f:
    f.readline().count("\t")
   ....: 
10 loops, best of 3: 28.9 ms per loop
In [30]: %%timeit
with io.open('test.csv', 'r') as fin:
    num_columns = len(fin.readline().split('\t'))
   ....: 
10 loops, best of 3: 136 ms per loop

Different results using python 3.4:

In [7]: %%timeit
with io.open('test.csv', 'r') as fin:
    num_columns = len(next(fin).split('\t'))
   ...: 
10 loops, best of 3: 102 ms per loop

In [8]: %%timeit
with open("test.csv" ) as f:
    f.readline().count("\t")
   ...: 

100 loops, best of 3: 12.7 ms per loop   
In [9]:     
In [9]: %%timeit
with open("test.csv" ) as f:
    next(f).count("\t")
   ...: 
100 loops, best of 3: 11.5 ms per loop    
In [10]: %%timeit
with io.open('test.csv', 'r') as fin:
    num_columns = len(next(fin).split('\t'))
   ....: 
10 loops, best of 3: 89.9 ms per loop    
In [11]: %%timeit
with io.open('test.csv', 'r') as fin:
    num_columns = len(fin.readline().split('\t'))
   ....: 
10 loops, best of 3: 92.4 ms per loop   
In [13]: %%timeit     
with open("test.csv" ) as f:
    r = csv.reader(f, delimiter="\t")
    len(next(r))
   ....: 
10 loops, best of 3: 176 ms per loop

@alvas, added the difference, I will add a few more different methods in a few minutes — Padraic Cunningham, Apr 28 '15 at 14:49
Ah OK, what happens if you explicitly set `low_memory=False`? — EdChum, Apr 28 '15 at 15:15
Huh, that sucks! What version pandas, numpy and python are you using? — EdChum, Apr 28 '15 at 15:22
Of course all code except the `csv.reader()` one assumes each tab character separates two columns and that the first line is the complete first row of the data. And I wonder what code looks like where this very operation is the bottleneck. ;-) — BlackJack, Apr 28 '15 at 15:22
@EdChum, 0.16.0 and 1.9.2 respectively, I suppose pandas processes all the data where there is very little done calling next on the file object or the reader and one pass across the first line/row. — Padraic Cunningham, Apr 28 '15 at 15:32

mike.k · Answer 2 · 2015-04-28T14:34:39.563

0

There is a str.count() method:

h = file.open('path', 'r')
columns = h.readline().count('\t') + 1
h.close()

edited Apr 28 '15 at 14:34

answered Apr 28 '15 at 14:24

mike.k

3,277
1
12
18

How to find the number of columns in a tab separated file

2 Answers2