35

I have data stored in a CSV where the first row is strings (column names) and the remaining rows are numbers. How do I store this to a numpy array? All I can find is how to set data type for columns but not for rows.

Right now I'm just skipping the headers to do the calculations but I need to have the headers in the final version. But if I leave the headers in it sets the whole array as string and the calculations fail.

This is what I have:

 data = np.genfromtxt(path_to_csv, dtype=None, delimiter=',', skip_header=1) 
postelrich
  • 3,274
  • 5
  • 38
  • 65

3 Answers3

53

You can keep the column names if you use the names=True argument in the function np.genfromtxt

 data = np.genfromtxt(path_to_csv, dtype=float, delimiter=',', names=True) 

Please note the dtype=float, that will convert your data to float. This is more efficient than using dtype=None, that asks np.genfromtxt to guess the datatype for you.

The output will be a structured array, where you can access individual columns by their name. The names will be taken from your first row. Some modifications may occur, spaces in a column name will be changed to _ for example. The documentation should cover most questions you could have.

o12d10
  • 800
  • 3
  • 17
  • 31
Pierre GM
  • 19,809
  • 3
  • 56
  • 67
  • 2
    I did it this way but it made an array with no columns. Just stored the whole row in one column – postelrich Sep 10 '12 at 03:58
  • What did you do *exactly*? What's your traceback? – Pierre GM Sep 10 '12 at 08:02
  • I did exactly your above line of code. I dont know what traceback is. – postelrich Sep 11 '12 at 14:11
  • Then, could you pastebin a portion of your input file, so that we can try? The "whole row in one column" looks fairly strange to me... The traceback is a copy of the screen you get after executing the code (when it fails). – Pierre GM Sep 11 '12 at 14:34
  • heres the pastebin from the interpreter just using the genfromtxt and the first two rows of the resulting matrix. You can see the data is stored only in rows and no columns. I did a .shape at the end. http://bpaste.net/show/45175/ – postelrich Sep 12 '12 at 22:40
  • Well, the shape of your array is `(750,)`, meaning you have 750 different rows, each row consisting of about 20 individual fields. You could access each field by its name: you can find the name in the `dtype` of your array. – Pierre GM Sep 13 '12 at 07:53
  • oh ok didn't know you had to access columns by the dtype. Thanks for your help! – postelrich Sep 14 '12 at 17:24
  • Just one note: Names like `-1` are stripped of the minus (and if there is also a `1`, one of both becomes `1_1`), see [this question](http://stackoverflow.com/q/29097917/321973) – Tobias Kienzler Mar 25 '15 at 08:31
13

I'm not sure what you mean when you say you need the headers in the final version, but you can generate a structured array where the columns are accessed by strings like this:

data = np.genfromtxt(path_to_csv, dtype=None, delimiter=',', names=True)

and then access columns with data['col1_name'], data['col2_name'], etc.

user545424
  • 15,713
  • 11
  • 56
  • 70
3

The whole idea of a numpy array is that all elements are the same type. Read the headers into a Python list and manage them separately from the numbers. You can also create a structured array (an array of records) and in this case you can use the headers to name the fields in the records. Storing them in the array would be redundant in that case.

kindall
  • 178,883
  • 35
  • 278
  • 309
  • but the genfromtxt function stores the data into a ndarray and allows for the option to choose the data type for each column. If there were a way to do this by rows, I would be set. My calculating functions extract the numbers into another array. If I could get to keep the headers, I would be able to label my outputs. – postelrich Sep 09 '12 at 04:57
  • 1
    But you *can* keep the headers, you just can't store them directly in the array. So, go ahead and do that. Having them stored in the array will be more of a hindrance than a help. – kindall Sep 09 '12 at 05:09
  • so if i understand your method, you're saying to declare an array of structs where each struct contains the name and a dynamic array to hold the numbers? – postelrich Sep 09 '12 at 05:15
  • user545424's answer is more along the lines of what I was thinking. – kindall Sep 09 '12 at 14:36