Questions tagged [pandas]

Pandas is a Python library for data manipulation and analysis, e.g. dataframes, multidimensional time series and cross-sectional datasets commonly found in statistics, experimental science results, econometrics, or finance. Pandas is one of the main data science libraries in Python.

Pandas is a Python library for PAN-el DA-ta manipulation and analysis, e.g. multidimensional time series and cross-sectional data sets commonly found in statistics, experimental science results, econometrics, or finance. pandas is implemented primarily using NumPy and Cython; it is intended to be able to integrate very easily with NumPy-based scientific libraries, such as statsmodels.

To create a reproducible Pandas example:

Main Features:

  • Data structures: for one- and two-dimensional labeled datasets (respectively Series and DataFrames). Some of their main features include:
    • Automatically aligning data and interpolation
    • Handling missing observations in calculations
    • Convenient slicing and reshaping ("reindexing") functions
    • Categorical data types
    • Provide 'group by' aggregation or transformation functionality
    • Tools for merging and joining together data sets
    • Simple Matplotlib integration for plotting and graphing
    • Multi-Indexing providing structure to indices that allow for representation of an arbitrary number of dimensions.
  • Date tools: objects for expressing date offsets or generating date ranges. Dates can be aligned to a specific time zone and converted or compared at will
  • Statistical models: convenient ordinary least squares and panel OLS implementations for in-sample or rolling time series and cross-sectional regressions. These will hopefully be the starting point for implementing models
  • Intelligent Cython offloading; complex computations are performed rapidly due to these optimizations.
  • Static and moving statistical tools: mean, standard deviation, correlation, and covariance
  • Rich User Documentation, using Sphinx

Asking Questions:

  • Before asking the question, make sure you have gone through the 10 Minutes to pandas introduction. It covers all the basic functionality of Pandas.
  • See this question on asking good questions: How to make good reproducible pandas examples
  • Please provide the version of Pandas, NumPy, and platform details (if appropriate) in your questions

Answering Questions:

Useful Canonicals:

More FAQs are at this link.

Resources and Tutorials:

Books:

282843 questions
541
votes
19 answers

How to flatten a hierarchical index in columns

I have a data frame with a hierarchical index in axis 1 (columns) (from a groupby.agg operation): USAF WBAN year month day s_PC s_CL s_CD s_CNT tempf sum sum sum sum amax amin 0 …
Ross R
  • 8,853
  • 7
  • 28
  • 27
540
votes
30 answers

How do I create test and train samples from one dataframe with pandas?

I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples (80% and 20%) for training and testing. Thanks!
tooty44
  • 6,829
  • 9
  • 27
  • 39
536
votes
9 answers

Improve subplot size/spacing with many subplots

I need to generate a whole bunch of vertically-stacked plots in matplotlib. The result will be saved using savefig and viewed on a webpage, so I don't care how tall the final image is, as long as the subplots are spaced so they don't overlap. No…
mcstrother
  • 6,867
  • 5
  • 22
  • 18
527
votes
8 answers

Python Pandas: Get index of rows where column matches certain value

Given a DataFrame with a column "BoolCol", we want to find the indexes of the DataFrame in which the values for "BoolCol" == True I currently have the iterating way to do it, which works perfectly: for i in range(100,3000): if…
I want badges
  • 6,155
  • 5
  • 23
  • 38
513
votes
10 answers

Get first row value of a given column

This seems like a ridiculously easy question... but I'm not seeing the easy answer I was expecting. So, how do I get the value at an nth row of a given column in Pandas? (I am particularly interested in the first row, but would be interested in a…
Ahmed Haque
  • 7,174
  • 6
  • 26
  • 33
503
votes
9 answers

Selecting/excluding sets of columns in pandas

I would like to create views or dataframes from an existing dataframe based on column selections. For example, I would like to create a dataframe df2 from a dataframe df1 that holds all columns from it except two of them. I tried doing the…
Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564
495
votes
11 answers

Sorting columns in pandas dataframe based on column name

I have a dataframe with over 200 columns. The issue is as they were generated the order is ['Q1.3','Q6.1','Q1.2','Q1.1',......] I need to sort the columns as follows: ['Q1.1','Q1.2','Q1.3',.....'Q6.1',......] Is there some way for me to do this…
pythOnometrist
  • 6,531
  • 6
  • 30
  • 50
490
votes
16 answers

Count the frequency that a value occurs in a dataframe column

I have a dataset category cat a cat b cat a I'd like to return something like the following which shows the unique values and their frequencies category freq cat a 2 cat b 1
yoshiserry
  • 20,175
  • 35
  • 77
  • 104
490
votes
13 answers

Pandas conditional creation of a series/dataframe column

How do I add a color column to the following dataframe so that color='green' if Set == 'Z', and color='red' otherwise? Type Set 1 A Z 2 B Z 3 B X 4 C Y
user7289
  • 32,560
  • 28
  • 71
  • 88
488
votes
18 answers

What does `ValueError: cannot reindex from a duplicate axis` mean?

I am getting a ValueError: cannot reindex from a duplicate axis when I am trying to set an index to a certain value. I tried to reproduce this with a simple example, but I could not do it. Here is my session inside of ipdb trace. I have a DataFrame…
Akavall
  • 82,592
  • 51
  • 207
  • 251
485
votes
15 answers

How to add an empty column to a dataframe?

What's the easiest way to add an empty column to a pandas DataFrame object? The best I've stumbled upon is something like df['foo'] = df.apply(lambda _: '', axis=1) Is there a less perverse method?
kjo
  • 33,683
  • 52
  • 148
  • 265
476
votes
7 answers

Create Pandas DataFrame from a string

In order to test some functionality I would like to create a DataFrame from a string. Let's say my test data looks like: TESTDATA="""col1;col2;col3 1;4.4;99 2;4.5;200 3;4.7;65 4;3.2;140 """ What is the simplest way to read that data into a Pandas…
Emil L
  • 20,219
  • 3
  • 44
  • 65
476
votes
4 answers

How to sort a dataFrame in python pandas by two or more columns?

Suppose I have a dataframe with columns a, b and c, I want to sort the dataframe by column b in ascending order, and by column c in descending order, how do I do this?
Rakesh Adhikesavan
  • 11,966
  • 18
  • 51
  • 76
475
votes
15 answers

Get the row(s) which have the max value in groups using groupby

How do I find all rows in a pandas DataFrame which have the max value for count column, after grouping by ['Sp','Mt'] columns? Example 1: the following DataFrame: Sp Mt Value count 0 MM1 S1 a **3** 1 MM1 S1 n 2 2 MM1 S3 …
jojo12
  • 4,853
  • 3
  • 14
  • 7
471
votes
6 answers

How to draw vertical lines on a given plot

Given a plot of a signal in time representation, how can I draw lines marking the corresponding time index? Specifically, given a signal plot with a time index ranging from 0 to 2.6 (seconds), I want to draw vertical red lines indicating the…
Francis
  • 6,416
  • 5
  • 24
  • 32