So the way I see it is that you do two things when sub-setting your data ready for analysis.
Pandas has a number of ways of doing each of these and some techniques that help get rows and columns. For new Pandas users it can be confusing as there is so much choice.
Do you use iloc, loc, brackets, query, isin, np.where, mask etc...
Method chaining
Now method chaining is a great way to work when data wrangling. In R they have a simple way of doing it, you select()
columns and you filter()
rows.
So if we want to keep things simple in Pandas why not use the filter()
for columns and the query()
for rows. These both return dataframes and so no need to mess-around with boolean indexing, no need to add df[ ]
round the return value.
So what does that look like:-
df.filter(['col1', 'col2', 'col3']).query("col1 == 'sometext'")
You can then chain on any other methods like groupby
, dropna()
, sort_values()
, reset_index()
etc etc.
By being consistent and using filter()
to get your columns and query()
to get your rows it will be easier to read your code when coming back to it after a time.
But filter can select rows?
Yes this is true but by default query()
get rows and filter()
get columns. So if you stick with the default there is no need to use the axis=
parameter.
query()
query() can be used with both and
/or
&
/|
you can also use comparison operators > , < , >= , <=, ==, !=
. You can also use Python in, not in.
You can pass a list to query using @my_list
Some examples of using query to get rows
df.query('A > B')
df.query('a not in b')
df.query("series == '2206'")
df.query("col1 == @mylist")
df.query('Salary_in_1000 >= 100 & Age < 60 & FT_Team.str.startswith("S").values')
filter()
So filter is basicly like using bracket df[]
or df[[]]
in that it uses the labels to select columns. But it does more than the bracket notation.
filter has like=
param so as to help select columns with partial names.
df.filter(like='partial_name',)
filter also has regex to help with selection
df.filter(regex='reg_string')
So to sum up this way of working might not work for ever situation e.g. if you want to use indexing/slicing then iloc is the way to go. But this does seem to be a solid way of working and can simplify your workflow and code.