0

I am getting NaN for p.value when trying to test the null hypothesis that the mean of the revenue for the phone plan surf is the same as that of the ultimate plan. I don't understand what I am doing wrong. I'm assuming that it may have to do with my DataFrame call_plan_merge. There are some NaN values in the monthly_revenue column (not visible in what I posted here). Could that be the reason why? But at the same time the calculated mean (which we can see was calculated properly while ignoring the NaNs from the monthly_revenue column) is already in the variables used for testing the hypothesis, so I don't understand NaN would be generated for p-value.

Here is my code:


#The average revenue from users of Ultimate and Surf calling plans differs.
average_rev_surf = call_plan_merge.query('tariff == "surf"')
average_rev_surf = average_rev_surf['monthly_revenue'].mean()

average_rev_ultimate = call_plan_merge.query('tariff == "ultimate"')
average_rev_ultimate = average_rev_ultimate['monthly_revenue'].mean()

alpha = 0.05  # critical statistical significance

results = st.ttest_1samp(average_rev_surf, average_rev_ultimate)

print('p-value:', results.pvalue)

if results.pvalue < alpha:
    print('We reject the null hypothesis')
else:
    print("We can't reject the null hypothesis") 
    
print('Average revenue for the surf plan is: {:.2f}$'.format(average_rev_surf))  
print('Average revenue for the ultimate plan is: {:.2f}$'.format(average_rev_ultimate))

Output:

p-value: nan
We can't reject the null hypothesis
Average revenue for the surf plan is: 35.77$
Average revenue for the ultimate plan is: 36.32$

This is what call_plan_merge looks like:

    user_id  call_month  total_calls  duration    tariff  reg_month  churn_month state  monthly_revenue  
0    1000.0        12.0         16.0     124.0  ultimate         12         13.0    GA            70.00  
1    1001.0         8.0         27.0     182.0      surf          8         13.0    WA            20.00  
2    1001.0         9.0         49.0     315.0      surf          8         13.0    WA            20.00  
3    1001.0        10.0         65.0     393.0      surf          8         13.0    WA            90.09  
4    1001.0        11.0         64.0     426.0      surf          8         13.0    WA            60.00  
5    1001.0        12.0         56.0     412.0      surf          8         13.0    WA            60.00  
6    1002.0        10.0         11.0      59.0      surf         10         13.0    NV            20.00  
7    1002.0        11.0         55.0     386.0      surf         10         13.0    NV            60.00  
8    1002.0        12.0         47.0     384.0      surf         10         13.0    NV            20.00  
9    1003.0        12.0        149.0    1104.0      surf          1         13.0    OK           158.12  

Thank you so much for your help!

Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76
lm.bertrand
  • 21
  • 1
  • 3

2 Answers2

0

Your error is caused by average_rev_surf = average_rev_surf['monthly_revenue'].mean(). Moreover, you are not dealing with a single group. You are dealing with two independent groups, so you are using the wrong function.

ttest_1samp() must receive an array-like structure as a and a population mean under the null hypothesis as popmean. By passing a=average_rev_surf, you are making the function calculate a t statistic with 1 - 1 = 0 degrees of freedom, which obviously returns NaN for the statistic as well as the p-value.

It seems like you have many different users, and each user has their own tariff. In order to test if their revenues are different, you should be using scipy.stats.ttest_ind() because your samples are independent.

Try something along the lines of:

# Monthly revs of surf users
surf = average_rev_surf.loc[average_rev_surf['tariff'].eq('surf'), 'monthly_revenue']

# Monthly revs of ultimate users
ulti = average_rev_surf.loc[average_rev_surf['tariff'].eq('ultimate'), 'monthly_revenue']

# t-test for independent samples
results = st.ttest_ind(a=surf, b=ultimate)
Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76
-1
import statsmodels.api as sm
import pandas as pd

# Load the data into a pandas DataFrame
data = pd.read_csv('data.csv')

# Fit a GLM with a logit link function
model = sm.formula.glm(formula='y ~ x1 + x2 + x3', data=data, family=sm.families.Binomial()).fit()

# Print the summary of the model
print(model.summary())

# Get the p-values for the coefficients
p_values = model.pvalues

# Print the p-values
print('P-values:')
print(p_values)
Eric Aya
  • 69,473
  • 35
  • 181
  • 253
BH_PTL
  • 1
  • 1
  • Please read through [how to answer](https://stackoverflow.com/help/how-to-answer) and make sure your solution actually answers the question. The question is not about the `statsmodels` library. – AlexK Mar 31 '23 at 19:32