87

This is an elaboration of a previous question, but as I delve deeper into python, I just get more confused as to how python handles csv files.

I have a csv file, and it must stay that way (e.g., cannot convert it to text file). It is the equivalent of a 5 rows by 11 columns array or matrix, or vector.

I have been attempting to read in the csv using various methods I have found here and other places (e.g. python.org) so that it preserves the relationship between columns and rows, where the first row and the first column = non-numerical values. The rest are float values, and contain a mixture of positive and negative floats.

What I wish to do is import the csv and compile it in python so that if I were to reference a column header, it would return its associated values stored in the rows. For example:

>>> workers, constant, age
>>> workers
    w0
    w1
    w2
    w3
    constant
    7.334
    5.235
    3.225
    0
    age
    -1.406
    -4.936
    -1.478
    0

And so forth...

I am looking for techniques for handling this kind of data structure. I am very new to python.

Unihedron
  • 10,902
  • 13
  • 62
  • 72
myClone
  • 1,609
  • 5
  • 17
  • 28
  • 1
    Python doesn't handle csv files by itself, but there are various libraries. You can look at the standard `csv` module and, if this is not enough, look at `pandas`. Basically, there is no magic: if you want something, just write the code for it, or find a library. – Sergey Orshanskiy Jan 27 '15 at 04:35

4 Answers4

161

For Python 3

Remove the rb argument and use either r or don't pass argument (default read mode).

with open( <path-to-file>, 'r' ) as theFile:
    reader = csv.DictReader(theFile)
    for line in reader:
        # line is { 'workers': 'w0', 'constant': 7.334, 'age': -1.406, ... }
        # e.g. print( line[ 'workers' ] ) yields 'w0'
        print(line)

For Python 2

import csv
with open( <path-to-file>, "rb" ) as theFile:
    reader = csv.DictReader( theFile )
    for line in reader:
        # line is { 'workers': 'w0', 'constant': 7.334, 'age': -1.406, ... }
        # e.g. print( line[ 'workers' ] ) yields 'w0'

Python has a powerful built-in CSV handler. In fact, most things are already built in to the standard library.

gitaarik
  • 42,736
  • 12
  • 98
  • 105
Katriel
  • 120,462
  • 19
  • 136
  • 170
  • Always open csv files in binary mode. – John Machin Aug 07 '10 at 00:24
  • 6
    @JohnMachin I don't know if that's true anymore. I tried this code (Python 3.4.3) with the `.csv` I created in a text editor, and I got an error: `_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)`. It worked when I removed the `rb` argument (default read mode). – NauticalMile May 28 '16 at 18:04
  • 2
    @NauticalMile is right: with python 2, @JohnMachin 's advice to open csv files `rb` applies, but with python 3.5 it causes that `_csv.Error: iterator should return strings, not bytes …` error. – Honore Doktorr Jul 09 '16 at 12:20
  • Be aware if you need to preserve the colums order. Dictionary does not preserve it (as expected). Thanks for very concise solution! – selyunin Jul 21 '16 at 09:31
122

Python's csv module handles data row-wise, which is the usual way of looking at such data. You seem to want a column-wise approach. Here's one way of doing it.

Assuming your file is named myclone.csv and contains

workers,constant,age
w0,7.334,-1.406
w1,5.235,-4.936
w2,3.2225,-1.478
w3,0,0

this code should give you an idea or two:

>>> import csv
>>> f = open('myclone.csv', 'rb')
>>> reader = csv.reader(f)
>>> headers = next(reader, None)
>>> headers
['workers', 'constant', 'age']
>>> column = {}
>>> for h in headers:
...    column[h] = []
...
>>> column
{'workers': [], 'constant': [], 'age': []}
>>> for row in reader:
...   for h, v in zip(headers, row):
...     column[h].append(v)
...
>>> column
{'workers': ['w0', 'w1', 'w2', 'w3'], 'constant': ['7.334', '5.235', '3.2225', '0'], 'age': ['-1.406', '-4.936', '-1.478', '0']}
>>> column['workers']
['w0', 'w1', 'w2', 'w3']
>>> column['constant']
['7.334', '5.235', '3.2225', '0']
>>> column['age']
['-1.406', '-4.936', '-1.478', '0']
>>>

To get your numeric values into floats, add this

converters = [str.strip] + [float] * (len(headers) - 1)

up front, and do this

for h, v, conv in zip(headers, row, converters):
  column[h].append(conv(v))

for each row instead of the similar two lines above.

slhck
  • 36,575
  • 28
  • 148
  • 201
John Machin
  • 81,303
  • 11
  • 141
  • 189
  • Thanks a lot John, this is very helpful. I had tried some techniques using some of the functions you used in the above example, but was unable to "package" the multiple csv functions appropriately. This will help tremendously. How then would I go about "stacking" these columns to generate a table of sorts? Could I use numpy.hstack (or is it vstack?) – myClone Aug 08 '10 at 18:38
  • 1
    I don't understand "stacking". You already have "a table of sorts" whose contents you can access by `column['column_name'][row_index]`. I don't use `numpy`; I'd need to read the manual (hint, hint). Perhaps you could ask another question, specifying what you need to do with the table. – John Machin Aug 08 '10 at 21:44
  • 20
    Note: with Python 3.5, I got `AttributeError: '_csv.reader' object has no attribute 'next'`. This was solved by using `next(reader, None)` instead of `reader.next()`. – Lindsay Ward Sep 11 '16 at 04:24
  • 1
    Thanks @LindsayWard, I was moving some code from Python 2.7 and questioning reality >.> – Joe Jun 21 '18 at 16:51
15

You can use pandas library and reference the rows and columns like this:

import pandas as pd

input = pd.read_csv("path_to_file");

#for accessing ith row:
input.iloc[i]

#for accessing column named X
input.X

#for accessing ith row and column named X
input.iloc[i].X
Ankur
  • 5,613
  • 1
  • 15
  • 14
4

I recently had to write this method for quite a large datafile, and i found using list comprehension worked quite well

      import csv
      with open("file.csv",'r') as f:
        reader = csv.reader(f)
        headers = next(reader)
        data = [{h:x for (h,x) in zip(headers,row)} for row in reader]
        #data now contains a list of the rows, with each row containing a dictionary 
        #  in the shape {header: value}. If a row terminates early (e.g. there are 12 columns, 
        #  it only has 11 values) the dictionary will not contain a header value for that row.
David Colwell
  • 2,450
  • 20
  • 31