2

I have a dataframe with X position data for hundreds of participants, and three grouping variables (with each participant's X data being 1000 points in length). Preview of dataframe:

          X    Z  participantNum  obsScenario  startPos  targetPos
16000 -16.0 -5.0         6950203            2         2          3
16001 -16.0 -5.0         6950203            2         2          3
16002 -16.0 -5.0         6950203            2         2          3
16003 -16.0 -5.0         6950203            2         2          3
16004 -16.0 -5.0         6950203            2         2          3
16005 -16.0 -5.0         6950203            2         2          3
16006 -16.0 -5.0         6950203            2         2          3
16007 -16.0 -5.0         6950203            2         2          3
16008 -16.0 -5.0         6950203            2         2          3
16009 -16.0 -5.0         6950203            2         2          3

I need to pass all of the X data into a function, with the X data grouped by the 3 grouping variables and with each X data array in its own column. Right now they are all stacked on top of each other.

These are the functions: (It goes through calc_confidence_interval first)

def mean_confidence_interval(data, confidence=0.95):
    a = 1.0*np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scp.stats.t._ppf((1+confidence)/2., n-1)
    return m, m+h, m-h

def calc_confidence_interval(data):
    mean_ci = []
    top_ci =[]
    bottom_ci=[]
    for column in data.T:
        m, t,b=mean_confidence_interval(column)
        mean_ci.append(m); top_ci.append(t);bottom_ci.append(b)
    return mean_ci, top_ci, bottom_ci

And I'm trying to make something like this work:

calc_CI = df.groupby(['obsScenario', 'startPos', 'targetPos'])['X'].apply(calc_confidence_interval)
calc_CI = calc_CI.join(calc_CI.rename('calc_CI'), 
        on = ['obsScenario', 'startPos', 'targetPos'])

But I'm getting the error: TypeError: object of type 'numpy.float64' has no len(), because it is currently passing the X data as a single array rather than separate columns for each participant, grouped by the three grouping variables.

## Traceback
```python
--------------------------------------------------------------------------    
calc_CI = allDataF.groupby(['obsScenario', 'startPos', 'targetPos'])['X'].apply(calc_confidence_interval)

  File "/opt/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 226, in apply
    return super().apply(func, *args, **kwargs)

  File "/opt/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 870, in apply
    return self._python_apply_general(f, self._selected_obj)

  File "/opt/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 892, in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)

  File "/opt/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 213, in apply
    res = f(group)

  File "/Users/lillyrigoli/Desktop/PhD/PhD_Projects/RouteSelection/Analysis_RS/load_filter_plot_CI_RS.py", line 221, in calc_confidence_interval
    m, t,b=mean_confidence_interval(column)

  File "/Users/lillyrigoli/Desktop/PhD/PhD_Projects/RouteSelection/Analysis_RS/load_filter_plot_CI_RS.py", line 210, in mean_confidence_interval
    n = len(a)

TypeError: object of type 'numpy.float64' has no len()

The functions return the confidence intervals (top, mean & bottom) as lists.

The output I should get at the end is like this, with the output (mean_ci, top_ci, bottom_ci arrays) for each grouping combination.

obsScenario  startPos  targetPos  mean_ci                 top_ci                 bottom_ci
0             1          1     [array of length 1000] [array of length 1000] [array of length 1000]  
0             2          2     [array of length 1000] [array of length 1000] [array of length 1000]  
1             1          1     [array of length 1000] [array of length 1000] [array of length 1000] 
1             2          2     [array of length 1000] [array of length 1000] [array of length 1000] 
CentauriAurelius
  • 504
  • 3
  • 21
  • its unclear what your expected output is. could you maybe show a dummy output that you are expecting for the sample you have provided above? – Akshay Sehgal Feb 05 '21 at 05:06
  • What is the role of `for column in data.T:` in the second function? Do you want to apply this to multiple columns such as X, Z? – Akshay Sehgal Feb 05 '21 at 05:08
  • Yes ill add expected output. The role is to get the confidence interval values (top, middle & bottom) at each timestep. The arrays should each be in their own separate columns (after grouping) so that after transpose, one column is all the X values (across all the arrays) for that timestep. All arrays are 1000 points in length, so after transpose the size would be (N, 1000) where N = the number of arrays for that grouping combination. – CentauriAurelius Feb 05 '21 at 05:11
  • In your function try by adding the column in a list as follows: [column] – Billy Bonaros Feb 05 '21 at 05:59
  • I get the error: AttributeError: 'list' object has no attribute 'T' – CentauriAurelius Feb 05 '21 at 06:02
  • only in this step: m, t,b=mean_confidence_interval([column]) – Billy Bonaros Feb 05 '21 at 06:03
  • please be sure that you are not using the apply(list) – Billy Bonaros Feb 05 '21 at 06:03
  • It did run, but it gave a list of 'nan' for the top_ci and bottom_ci output. I edited the expected output a bit for efficiency and clarity – CentauriAurelius Feb 05 '21 at 06:15
  • Your function returns NULL when you have 1 item only for the bottom and top. -16 in our case – Billy Bonaros Feb 05 '21 at 08:00
  • the top and bottom are just the mean plus a vaue and the mean minus a value. So if its returning the mean ci, it should also return the top & bottom ci – CentauriAurelius Feb 05 '21 at 08:03

1 Answers1

1

I think you may have more success explicitly iterating over the groups than trying to use apply, which seems to be adding complexity to what you are trying to do.

results = []
groupby = df.groupby(['obsScenario', 'startPos', 'targetPos'])
for group_name in groupby:
    groupdf = groupby.get_group(group_name)
    # call your functions here
    # append results to results

It may also be the case that you just need to pass additional arguments to apply for your functions to work as intended. apply has a parameter called args which takes a tuple of positional arguments to pass to the applied function in addition to the array/series.

calc_CI = df.groupby(['obsScenario', 'startPos', 'targetPos'])['X'].apply(calc_confidence_interval, args=(arg1, arg2, ...))

Eric Truett
  • 2,970
  • 1
  • 16
  • 21