41

I have a problem with reading CSV(or txt file) on pandas module Because numpy's loadtxt function takes too much time, I decided to use pandas read_csv instead.

I want to make a numpy array from txt file with four columns separated by space, and has very large number of rows (like, 256^3. In this example, it is 64^3).

The problem is that I don't know why but it seems that pandas's read_csv always skips the first line (first row) of the csv (txt) file, resulting one less data.

here is the code.

from __future__ import division
import numpy as np
import pandas as pd
ngridx = 4
ngridy = 4
ngridz = 4
size = ngridx*ngridy*ngridz
f = np.zeros((size,4))
a = np.arange(size)
f[:, 0] = np.floor_divide(a, ngridy*ngridz)
f[:, 1] = np.fmod(np.floor_divide(a, ngridz), ngridy)
f[:, 2] = np.fmod(a, ngridz)
f[:, 3] = np.random.rand(size)
print f[0]
np.savetxt('Testarray.txt',f,fmt='%6.16f')
g = pd.read_csv('Testarray.txt',delimiter=' ').values
print g[0]
print len(g[:,3])

f[0] and g[0] that are displayed in the output have to match but it doesn't, indicating that pandas is skipping the first line of the Testarray.txt. Also, length of loaded file g is less than the length of the array f.

starball
  • 20,030
  • 7
  • 43
  • 238
Tom
  • 758
  • 1
  • 6
  • 22
  • why are you saving in numpy and then reading in pandas? It could be slow, instead convert the array in numpy to pandas dataframe then write to csv. It is much much faster. – pbu Feb 16 '15 at 22:40
  • oh, it is just an example. Im interested in reading It not saving It. thank you! – Tom Feb 17 '15 at 16:41

2 Answers2

87

By default, pd.read_csv uses header=0 (when the names parameter is also not specified) which means the first (i.e. 0th-indexed) line is interpreted as column names.

If your data has no header, then use

pd.read_csv(..., header=None)

For example,

import io
import sys
import pandas as pd
if sys.version_info.major == 3:
    # Python3
    StringIO = io.StringIO 
else:
    # Python2
    StringIO = io.BytesIO

text = '''\
1 2 3
4 5 6
'''

print(pd.read_csv(StringIO(text), sep=' '))

Without header, the first line, 1 2 3, sets the column names:

   1  2  3
0  4  5  6

With header=None, the first line is treated as data:

print(pd.read_csv(StringIO(text), sep=' ', header=None))

prints

   0  1  2
0  1  2  3
1  4  5  6
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • oh! yeah! it worked! It was confusing, that pandas documentation on read_csv said that header is none by default so i was very confused. after all it was header. Thank you so much for the help! – Tom Feb 07 '15 at 13:57
  • But we cannot access the values from dataframes when iterating over Panda dataframes via `iterrows` and using `row[column]` to access the value from data frames. – Krishna Oza Feb 22 '19 at 06:31
  • As of pandas 0.24.2 the documentation suggests that the default value of header is 'infer' rather than 0. This means that when no names are passed you get the behaviour of header=0, but you get header=None behaviour if names are passed – Ng Oon-Ee May 25 '19 at 23:00
1

If your file doesn't have a header row you need to tell Pandas so by using header=None in your call to pd.read_csv().

RustProof Labs
  • 1,247
  • 10
  • 9