11

I have an existing two-column numpy array to which I need to add column names. Passing those in via dtype works in the toy example shown in Block 1 below. With my actual array, though, as shown in Block 2, the same approach is having an unexpected (to me!) side-effect of changing the array dimensions.

How can I convert my actual array, the one named Y in the second block below, to an array having named columns, like I did for array A in the first block?

Block 1: (Columns of A named without reshaping dimension)

import numpy as np
A = np.array(((1,2),(3,4),(50,100)))
A
# array([[  1,   2],
#        [  3,   4],
#        [ 50, 100]])
dt = {'names':['ID', 'Ring'], 'formats':[np.int32, np.int32]}
A.dtype=dt
A
# array([[(1, 2)],
#        [(3, 4)],
#        [(50, 100)]], 
#       dtype=[('ID', '<i4'), ('Ring', '<i4')])

Block 2: (Naming columns of my actual array, Y, reshapes its dimension)

import numpy as np
## Code to reproduce Y, the array I'm actually dealing with
RING = [1,2,2,3,3,3]
ID = [1,2,3,4,5,6]
X = np.array([ID, RING])
Y = X.T
Y
# array([[1, 3],
#        [2, 2],
#        [3, 2],
#        [4, 1],
#        [5, 1],
#        [6, 1]])

## My unsuccessful attempt to add names to the array's columns    
dt = {'names':['ID', 'Ring'], 'formats':[np.int32, np.int32]}
Y.dtype=dt
Y
# array([[(1, 2), (3, 2)],
#        [(3, 4), (2, 1)],
#        [(5, 6), (1, 1)]], 
#       dtype=[('ID', '<i4'), ('Ring', '<i4')])

## What I'd like instead of the results shown just above
# array([[(1, 3)],
#        [(2, 2)],
#        [(3, 2)],
#        [(4, 1)],
#        [(5, 1)],
#        [(6, 1)]],
#       dtype=[('ID', '<i4'), ('Ring', '<i4')])
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455

5 Answers5

10

First because your question asks about giving names to arrays, I feel obligated to point out that using "structured arrays" for the purpose of giving names is probably not the best approach. We often like to give names to rows/columns when we're working with tables, if this is the case I suggest you try something like pandas which is awesome. If you simply want to organize some data in your code, a dictionary of arrays is often much better than a structured array, so for example you can do:

Y = {'ID':X[0], 'Ring':X[1]}

With that out of the way, if you want to use a structured array, here is the clearest way to do it in my opinion:

import numpy as np

RING = [1,2,2,3,3,3]
ID = [1,2,3,4,5,6]
X = np.array([ID, RING])

dt = {'names':['ID', 'Ring'], 'formats':[int, int]}
Y = np.zeros(len(RING), dtype=dt)
Y['ID'] = X[0]
Y['Ring'] = X[1]
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
Bi Rico
  • 25,283
  • 3
  • 52
  • 75
  • 1
    Great. Especially as an R user, **pandas** does look awesome, but I won't be able to use it for this application. (I'm needing the Python code to perform some manipulations in an arcpy script, to be distributed as an ArcGIS toolbox to users who will only have available to them the Python installation provided to them by the ArcGIS installer. It includes **numpy** but not **pandas**, which is why I've got one hand (at least!) tied behind my back here.) So special thanks for supplying a pure **numpy** solution. As soon as I get a chance, I'll check how it plays with the rest of my code. – Josh O'Brien Jun 11 '14 at 19:35
  • FWIW, the "Code Sample" on the [help page for `arcpy.da.ExtendTable`](http://resources.arcgis.com/en/help/main/10.1/index.html#//018w0000000m000000) shows why I specifically need a **numpy** "structured array". – Josh O'Brien Jun 11 '14 at 19:46
  • Is there any longer a need to for the intermediate creation of `X=np.array([ID,RING])`? Seems like doing `Y['ID'] = ID` has the same effect as `Y['ID'] = X[0]`, right? – Josh O'Brien Jun 11 '14 at 22:44
  • @JoshO'Brien, you're right. I just assumed your data started off in an array like `X`, but if you have it as `ID` and `RING` there is no need to create `X`. – Bi Rico Jun 11 '14 at 23:19
  • @BiRico I realize this is an old question, but what is your rationale to avoid using numpy structured arrays to be able to refer to columns by a specific name? – Blitzkoder May 02 '18 at 00:34
  • @Blitzkoder I might have overstated the case in the answer above. That being said, you should pick the right type for what you need. If you need to access data by name or key, use a class or a dictionary. If you need a dataframe, use pandas. If you need an array of C structures, use a structured array. – Bi Rico May 04 '18 at 01:19
  • @BiRico understood, thanks for the insight. I am unsure if it makes sense, but I was trying to understand if both performance and usability were being considered. Vectorization and automatic low-level optimizations may become part of what one may need if working with huge batches of large files. But I imagine that could merit its own separate question. – Blitzkoder May 04 '18 at 11:58
  • @Blitzkoder, numpy doesn't really have "low-level" optimizations for structured arrays. It has some nice optimizations for numeric types, but when your'e talking about mixed typed data you really want something more like pandas. See this [nice writeup](http://wesmckinney.com/blog/apache-arrow-pandas-internals/) about moving pandas away from using numpy to support Apache Arrow. – Bi Rico May 07 '18 at 05:27
  • I like the dictionary approach, because it avoids clutter (format spec, names of unused columns). With `ID = X[0]; Ring = X[1]`, the usage of the data would be even less noisy. Worth mentioning for beginners: neither approach implies copying of data. – Rainald62 Jul 22 '22 at 16:42
7

store-different-datatypes-in-one-numpy-array another page including a nice solution of adding name to an array which can be used as column Example:

r = np.core.records.fromarrays([x1,x2,x3],names='a,b,c')
# x1, x2, x3 are flatten array
# a,b,c are field name
lX-Xl
  • 160
  • 1
  • 6
  • Could you please explain this link and why it is appropriate? – Mozahler Mar 31 '18 at 01:17
  • @Mozahler [link](https://docs.scipy.org/doc/numpy/reference/generated/numpy.core.records.fromarrays.html#numpy.core.records.fromarrays) it create record array from flatten numpy arrays...like: r = np.core.records.fromarrays([x1,x2,x3],names='a,b,c')....x1, x2, x3 are flatten array....a,b,c are field name – lX-Xl Mar 31 '18 at 14:16
  • Instead of replying in a comment, you should update the answer you provided with this information. Your answer should make sense without someone having to read all the comments. – Mozahler Mar 31 '18 at 14:50
  • this is definitely the most straightforward and useful answer. We like one-liners. Indexing the result `r` at a position e.g. `r[0]` gives you the first row, while `r['a']` gives you that column. This is fantastic for storing points in N-dimensional space, where each dimension is named. – pretzlstyle Mar 11 '21 at 03:30
3

This is because Y is not C_CONTIGUOUS, you can check it by Y.flags:

  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

You can call Y.copy() or Y.ravel() first:

dt = {'names':['ID', 'Ring'], 'formats':[np.int32, np.int32]}
print Y.ravel().view(dt) # the result shape is (6, )
print Y.copy().view(dt)  # the result shape is (6, 1)
HYRY
  • 94,853
  • 25
  • 187
  • 187
  • Fascinating. Can't say that I fully understand *why* this makes the difference that it does, but I see now *that* it does. Also, for the record, it looks like `np.require(Y, requirements=['C'])` does +/- the same thing as `Y.copy()`, in terms of converting to C_CONTIGUOUS storage without changing the array dimension. – Josh O'Brien Jun 12 '14 at 04:04
  • Am I correct in thinking [this documentation](http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html#internal-memory-layout-of-an-ndarray) is likely the best place for me to learn about why exactly the F_CONTIGUOUS transposed array `Y` acted the way it did when passed that dtype? – Josh O'Brien Jun 12 '14 at 04:52
  • 2
    @JoshO'Brien, try this `a = np.array((1, 2, 3, 4), dtype='>i4'); print(a); dt = {'names':['i1', 'i2'], 'formats':[np.int32, np.int32]}; a.dtype = dt; print(a)`. Changing the dtype of a numpy array is a very low lever operation which breaks all the abstractions that make numpy awesome. In fact I'm surprised the numpy developers allow it to be done so easily. In order to know what you'll get after the change, you have to know quite a bit about how numpy works and keep track of all the array's internals ie memory layout, endian order, input/output data type, strides. – Bi Rico Jun 12 '14 at 19:23
  • @BiRico Excellent, helpful comment. I see now that `ravel()` and `copy()` take care of the `C_CONTIGUOUS` issue, but aren't 'safe' w.r.t. changes of storage format. The strategy you laid out -- initializing a new array with the desired storage type, and then copying data into it -- looks to be much safer, as the copy/insertion operations takes care of any necessary type conversion. – Josh O'Brien Jun 12 '14 at 19:52
2

Are you completely sure about the outputs for A and Y? I get something different using Python 2.7.6 and numpy 1.8.1.

My initial output for A is the same as yours, as it should be. After running the following code for the first example

dt = {'names':['ID', 'Ring'], 'formats':[np.int32, np.int32]}
A.dtype=dt

the contents of array A are actually

array([[(1, 0), (3, 0)],
   [(2, 0), (2, 0)],
   [(3, 0), (2, 0)],
   [(4, 0), (1, 0)],
   [(5, 0), (1, 0)],
   [(6, 0), (1, 0)]], 
  dtype=[('ID', '<i4'), ('Ring', '<i4')])

This makes somewhat more sense to me than the output you added because dtype determines the data-type of every element in the array and the new definition states that every element should contain two fields, so it does, but the value of the second field is set to 0 because there was no preexisting value for the second field.

However, if you would like to make numpy group columns of your existing array so that every row contains only one element, but with each element having two fields, you could introduce a small code change.

Since a tuple is needed to make numpy group elements into a more complex data-type, you could make this happen by creating a new array and turning every row of the existing array into a tuple. Here is a simple working example

import numpy as np
A = np.array(((1,2),(3,4),(50,100)))
dt = np.dtype([('ID', np.int32), ('Ring', np.int32)])
B = np.array(list(map(tuple, A)), dtype=dt)

Using this short piece of code, array B becomes

array([(1, 2), (3, 4), (50, 100)], 
  dtype=[('ID', '<i4'), ('Ring', '<i4')])

To make B a 2D array, it is enough to write

B.reshape(len(B), 1) # in this case, even B.size would work instead of len(B)

For the second example, the similar thing needs to be done to make Y a structured array:

Y = np.array(list(map(tuple, X.T)), dtype=dt)

After doing this for your second example, array Y looks like this

array([(1, 3), (2, 2), (3, 2), (4, 1), (5, 1), (6, 1)], 
  dtype=[('ID', '<i4'), ('Ring', '<i4')])

You can notice that the output is not the same as the one you expect it to be, but this one is simpler because instead of writing Y[0,0] to get the first element, you can just write Y[0]. To also make this array 2D, you can also use reshape, just as with B.

hgazibara
  • 1,832
  • 1
  • 19
  • 22
  • Yep, I'm sure about that output. I just confirmed it by rerunning in a fresh Python session ('2.7.6 |Anaconda 1.9.2 (32-bit) and numpy 1.8.0 on a Windows 7 64-bit laptop). Thanks for the rest of your answer, which, as a Python newbie, will take me a while to digest. Two questions, though. (1) For `B`, it looks like I'd want to do `B.reshape(len(B),1)` to get a two column array, right? (2) How would I "do[] a similar thing for [my] second example", in a scripted way. (Because the real array will have 1000+ elements, I won't be able to write out the array like you did in that final code block.) – Josh O'Brien Jun 11 '14 at 19:11
  • @JoshO'Brien, that's the key, 32 bit. You're block A only works because you're using a 32 bit version of python/numpy. In general changing the dtype of an array is probably a bad idea and, as wee see here, will produce different results on different systems. – Bi Rico Jun 11 '14 at 19:18
  • @BiRico Thanks. It's bad idea even though `A.dtype` (from before my manual assignment to it) returns `dtype('int32')`? How then should I assign names to the array without specifying **some** value for `format`, or should I be assigning it some different format? (FTR, I'm sure it's me who's missing something here, but I do suspect the the bits and formats are key, which is why I had first checked the value of `A.dtype`...) – Josh O'Brien Jun 11 '14 at 19:26
  • @JoshO'Brien You are right about a way of transforming `B` into 2D array. To answer your second question, I have updated my answer. – hgazibara Jun 11 '14 at 21:57
1

Try re-writing the definition of X:

X = np.array(zip(ID, RING))

and then you don't need to define Y = X.T

jonnybazookatone
  • 2,188
  • 15
  • 21
  • That seems bizarre. Running `Z = np.array(zip(ID, RING))` and then `numpy.array_equiv(Y,Z)` (with `Y=X.T`) yields `'True'`. So `Y` and `Z` look identical (and display equivalent values for `dtype`). But when I do `Z.dtype=dt`, I get the array I was after, whereas `Y.dtype=dt` still gives me the array I don't. Any idea why that'd be?? – Josh O'Brien Jun 11 '14 at 19:21
  • Re-write the entire `X`? I thought only `columns names` of `X` should be affected? – Gathide Sep 15 '21 at 08:33