1

I have a multiindex pandas dataframe that looks like this (called p_z):

                     p_z
entry subentry
0     0         0.338738
      1         0.636035
      2        -0.307365
      3        -0.167779
      4         0.243284
...                  ...
26692 891      -0.459227
      892       0.055993
      893      -0.469857
      894       0.192554
      895       0.155738

[11742280 rows x 1 columns]

I want to be able to select certain rows based on another dataframe (or numpy array) which is multidimensional. It would look like this as a pandas dataframe (called tofpid):

                tofpid
entry subentry
0     0              0
      1              2
      2              4
      3              5
      4              7
...                ...
26692 193          649
      194          670
      195          690
      196          725
      197          737

[2006548 rows x 1 columns]

I also have it as an awkward array, where it's a (26692, ) array (each of the entries has a non-standard number of subentries). This is a selection df/array that tells the p_z df which rows to keep. So in entry 0 of p_z, it should keep subentries 0, 2, 4, 5, 7, etc.

I can't find a way to get this done in pandas. I'm new to pandas, and even newer to multiindex; but I feel there ought to be a way to do this. If it's able to be broadcast even better as I'll be doing this over ~1500 dataframes of similar size. If it helps, these dataframes are from a *.root file imported using uproot (if there's another way to do this without pandas, I'll take it; but I would love to use pandas to keep things organised).

Edit: Here's a reproducible example (courtesy of Jim Pavinski's answer; thanks!).

import awkward as ak
import pandas as pd

>>> p_z = ak.Array([[ 0.338738, 0.636035, -0.307365, -0.167779, 0.243284,  
                      0.338738, 0.636035],
                    [-0.459227, 0.055993, -0.469857,  0.192554, 0.155738, 
                     -0.459227]])
>>> p_z = ak.to_pandas(p_z)
>>> tofpid = ak.Array([[0, 2, 4, 5], [1, 2, 4]])
>>> tofpid = ak.to_pandas(tofpid)

Both of these dataframes are produced natively in uproot, but this will reproduce the same dataframes that uproot would (using the awkward library).

HEP N008
  • 187
  • 8

2 Answers2

1

IIUC:

Input data:

>>> p_z
                     p_z
entry subentry
0     0         0.338738
      1         0.636035
      2        -0.307365
      3        -0.167779
      4         0.243284

>>> tofpid
                tofpid
entry subentry
0     0              0
      1              2
      2              4
      3              5
      4              7

Create a new multiindex from the columns (entry, tofpid) of your second dataframe:

mi = pd.MultiIndex.from_frame(tofpid.reset_index(level='subentry', drop=True)
                                    .reset_index())

Output result:

>>> p_z.loc[mi.intersection(p_z.index)]
              p_z
entry
0     0  0.338738
      2 -0.307365
      4  0.243284
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • 1
    Does this work if `tofpid` has a second entry with values like `1, 2, 4` meaning "index `1` of the second entry of `p_z`, index `2` of the second entry of `p_z`, index `4` of the second entry of `p_z`"? I think this is what the OP wants, though they could clarify if I'm wrong. Note that the last entry number (26692) times the last subentry number (895) is approximately equal to the number of rows (11742280) in their example, suggesting that the subentry numbers are not absolute positions but positions-within-the-entry. – Jim Pivarski Aug 04 '21 at 13:24
  • This works! It's a mite slow, but it does slice the dataframe the way I wanted it to. Thanks! – HEP N008 Aug 04 '21 at 15:39
1

Here's a reproducible example with enough structure to represent the problem (using the awkward library):

>>> import awkward as ak
>>> 
>>> p_z = ak.Array([
...     [ 0.338738, 0.636035, -0.307365, -0.167779, 0.243284,  0.338738, 0.636035],
...     [-0.459227, 0.055993, -0.469857,  0.192554, 0.155738, -0.459227],
... ])
>>> p_z
<Array [[0.339, 0.636, ... 0.156, -0.459]] type='2 * var * float64'>
>>> 
>>> tofpid = ak.Array([[0, 2, 4, 5], [1, 2, 4]])
>>> tofpid
<Array [[0, 2, 4, 5], [1, 2, 4]] type='2 * var * int64'>

In Pandas form, this is:

>>> df_p_z = ak.to_pandas(p_z)
>>> df_p_z
                  values
entry subentry          
0     0         0.338738
      1         0.636035
      2        -0.307365
      3        -0.167779
      4         0.243284
      5         0.338738
      6         0.636035
1     0        -0.459227
      1         0.055993
      2        -0.469857
      3         0.192554
      4         0.155738
      5        -0.459227
>>> df_tofpid = ak.to_pandas(tofpid)
>>> df_tofpid
                values
entry subentry        
0     0              0
      1              2
      2              4
      3              5
1     0              1
      1              2
      2              4

As an Awkward Array, what you want to do is slice the first array by the second. That is, you want p_z[tofpid]:

>>> p_z[tofpid]
<Array [[0.339, -0.307, ... -0.47, 0.156]] type='2 * var * float64'>
>>> p_z[tofpid].tolist()
[[0.338738, -0.307365, 0.243284, 0.338738], [0.055993, -0.469857, 0.155738]]

Using Pandas, I managed to do it with this:

>>> df_p_z.loc[df_tofpid.reset_index(level=0).apply(lambda x: tuple(x.values), axis=1).tolist()]
                  values
entry subentry          
0     0         0.338738
      2        -0.307365
      4         0.243284
      5         0.338738
1     1         0.055993
      2        -0.469857
      4         0.155738

What's happening here is that df_tofpid.reset_index(level=0) turns the "entry" part of the MultiIndex into a column, then apply executes a Python function on each row if axis=1, each row is x.values, and tolist() turns the result into a list of tuples like

>>> df_tofpid.reset_index(level=0).apply(lambda x: tuple(x.values), axis=1).tolist()
[(0, 0), (0, 2), (0, 4), (0, 5), (1, 1), (1, 2), (1, 4)]

This is what loc needs to select entry/subentry pairs from its MultiIndex.

My Pandas solution has two disadvantages: it's complicated, and it goes through Python iteration and objects, which doesn't scale as well as arrays. There is a good chance that a Pandas expert would find a better solution than mine. There's a lot I don't know about Pandas.

Jim Pivarski
  • 5,568
  • 2
  • 35
  • 47
  • I suppose one could just slice the awkward array and then convert to pandas; I'd like to skip that step if possible as I'm trying to save everything as *.pkl files for later analysis (including later slicing by different criteria), but this will work for now! – HEP N008 Aug 04 '21 at 15:25
  • 1
    Awkward Arrays can be pickled, although Parquet would be a more efficient file format (file size and read/write speed) for both Pandas and Awkward (only matters if your data are bigger than, say, megabytes). – Jim Pivarski Aug 04 '21 at 20:05
  • I didn't know you could pickle awkward arrays; I'll have to look into that. Thanks! – HEP N008 Aug 04 '21 at 22:37