1

Edit: All sorted, Imanol Luengo set me straight. Here's the end result, in all it's glory.

I don't understand the counts of my variables, maybe someone can explain? I'm filtering two columns of pass/fails for two locations. I want a count of all 4 pass/fails.

Here's the header of the columns. There are 126 values in total:

  WT Result School
0         p  Milan
1         p   Roma
2         p  Milan
3         p  Milan
4         p   Roma

Code so far:

data2 = pd.DataFrame(data[['WT Result', 'School']])
data2.dropna(inplace=True)

# Milan Counts
m_p = (data2['School']=='Milan') & (data2['WT Result']=='p')
milan_p = (m_p==True)
milan_pass = np.count_nonzero(milan_p)   # Count of Trues for Milano

# Rome Counts
r_p = (data2['School']=='Roma') & (data2['WT Result']=='p')
rome_p = (r_p==True)
rome_pass = np.count_nonzero(rome_p)   # Count of Trues for Rome

So what I've done, after stripping the excess columns (data2), is:

  • filter by location and == 'p' (vars m_p and r_p)
  • filter then by ==True (vars milan_p and rome_p)
  • Do a count_nonzero() for a count of 'True' (vars milan_pass and rome_pass)

Here's what I don't understand - these are the lengths of the variables:

data2:  126 
m_p:  126 
r_p:  126 
milan_p:  126 
rome_p:  126 
milan_pass:  55 
rome_pass:  47

Why do the lengths remain 126 once the filtering starts? To me, this shows that neither the filtering by location or by 'p' worked. But when I do the final count_nonzero() the results are suddenly separated into location. What is happening?

Community
  • 1
  • 1

1 Answers1

3

You are not filtering, you are masking. Step by step:

m_p = (data2['School']=='Milan') & (data2['WT Result']=='p')

Here m_p is a boolean array with the same length of a column from data2. Each element of m_p is set to True if it satisfies those 2 conditions, or to False otherwise.

milan_p = (m_p==True)

The above line is completely redundant. m_p is already a boolean array, comparing it to True will just create a copy of it. Thus, milan_p will be another boolean array with the same length as m_p.

milan_pass = np.count_nonzero(milan_p)

This just prints the number of nonzeros (e.g. True) elements of milan_p. Ofcourse, it matches the number of elements that you want to filter, but you are not filtering anything here.

Exactly the same applies to rome condition.


If you want to filter rows in pandas, you have to slice the dataframe with your newly generated mask:

filtered_milan = data2[m_p]

or alternatively

filtered_milan = data2[milan_p] # as m_p == milan_p

The above lines select the rows that have a True value in the mask (or condition), ignoring the False rows in the process.

The same applies for the second problem, rome.

Community
  • 1
  • 1
Imanol Luengo
  • 15,366
  • 2
  • 49
  • 67