I have a data set that I am loading onto a Pandas dataframe that is a Jagged 3-D array called: Waveform. The dataframe is multi-indexed by three levels: Events (Entry), Photons (Subentry) generated by each event, data points (subsubentry) per photon.
The number of data points and Photons varies randomly per each event, hence why it's jagged. I want to extract the Photons (subentries) for each Event (entry) that contain at minimum an "2*n" number of data points, take the average of the first n data points of the selected photons and save them in a new dataframe that contains their respective event and photon index.
I can't put my actual data here because it's too large and jagged so I will create an example that is the same, just scaled down in terms of data.
import awkward as ak
import pandas as pd
#I'm gonna generate an example awkward array that I want to convert to a Pandas DF
wf = ak.to_pandas(ak.Array([ [[1,2,5,6,8,3,21,3],[5986.472,0,6,1,2,3],[0]],[[1]],[[0.1,23,534,21,53,12],[0]],[[1],[2],[0],[12,12,12,12,125,34]],[[76],[23,23,43],],[[0],[12,12,12,12]] ]))
print(wf)
values
entry subentry subsubentry
0 0 0 1.000
1 2.000
2 5.000
3 6.000
4 8.000
5 3.000
6 21.000
7 3.000
1 0 5986.472
1 0.000
2 6.000
3 1.000
4 2.000
5 3.000
2 0 0.000
1 0 0 1.000
2 0 0 0.100
1 23.000
2 534.000
3 21.000
4 53.000
5 12.000
1 0 0.000
3 0 0 1.000
1 0 2.000
2 0 0.000
3 0 12.000
1 12.000
2 12.000
3 12.000
4 125.000
5 34.000
4 0 0 76.000
1 0 23.000
1 23.000
2 43.000
5 0 0 0.000
1 0 12.000
1 12.000
2 12.000
3 12.000
#This is what I want the filter/Extraction to produce
wf_pF = ak.to_pandas(ak.Array([[[1,2,5,6,8,3,21,3],[5986.472,0,6,1,2,3,5]],[[0.1,23,534,21,53,12]],[[12,12,12,12,125,34]] ]))
print(wf_pF)
values
entry subentry subsubentry
0 0 0 1.000
1 2.000
2 5.000
3 6.000
4 8.000
5 3.000
6 21.000
7 3.000
1 0 5986.472
1 0.000
2 6.000
3 1.000
4 2.000
5 3.000
6 5.000
1 0 0 0.100
1 23.000
2 534.000
3 21.000
4 53.000
5 12.000
2 0 0 12.000
1 12.000
2 12.000
3 12.000
4 125.000
5 34.000
#I then want to take the average of the first n datapoints and place them into a new dataframe as such
averages = ak.to_pandas(ak.Array([[2.666,1997.333],[185.7],[12]]))
print(averages)
values
entry subentry
0 0 2.666
1 1997.333
1 0 185.700
2 0 12.000
I used query to look for the 2n -1 datapoint first (in this case I used n = 3 so "5") in the level subsubentry Wf_n = wf.query('subsubentry == 5')
. I took the index of this new dataframe Wf_n and converted the indices of the Entry and Subentry into their respective NumPy arrays
nQuery = wf.query('subsubentry == 5')
indices = nQuery.index.to_frame()["entry"]
indices2 = nQuery.index.to_frame()["subentry"]
ind = pd.Series.to_numpy(indices)
ind2 = pd.Series.to_numpy(indices2)
Then I used query to extract the Entries with their respective subentries with the following:
wf_AF = wf.query("entry in @ind and subentry in @ind2")
print(wf_AF)
which results in this dataframe wf_AF
values
entry subentry subsubentry
0 0 0 1.000
1 2.000
2 5.000
3 6.000
4 8.000
5 3.000
6 21.000
7 3.000
1 0 5986.472
1 0.000
2 6.000
3 1.000
4 2.000
5 3.000
2 0 0 0.100
1 23.000
2 534.000
3 21.000
4 53.000
5 12.000
1 0 0.000
3 0 0 1.000
1 0 2.000
3 0 12.000
1 12.000
2 12.000
3 12.000
4 125.000
5 34.000
It's still keeping subentries (Photons) that contain less than the desired subsubentries (datapoints) number threshold, 2*n. What am I doing wrong? Is there something I am not understanding? What can I do to achieve this specific method of filtering and can it be implemented in CuDF? Because there's so much data it would be ideal if I could replicate this as well in CuDF.