Questions tagged [pandas]

Pandas is a Python library for data manipulation and analysis, e.g. dataframes, multidimensional time series and cross-sectional datasets commonly found in statistics, experimental science results, econometrics, or finance. Pandas is one of the main data science libraries in Python.

Pandas is a Python library for PAN-el DA-ta manipulation and analysis, e.g. multidimensional time series and cross-sectional data sets commonly found in statistics, experimental science results, econometrics, or finance. pandas is implemented primarily using NumPy and Cython; it is intended to be able to integrate very easily with NumPy-based scientific libraries, such as statsmodels.

To create a reproducible Pandas example:

Main Features:

  • Data structures: for one- and two-dimensional labeled datasets (respectively Series and DataFrames). Some of their main features include:
    • Automatically aligning data and interpolation
    • Handling missing observations in calculations
    • Convenient slicing and reshaping ("reindexing") functions
    • Categorical data types
    • Provide 'group by' aggregation or transformation functionality
    • Tools for merging and joining together data sets
    • Simple Matplotlib integration for plotting and graphing
    • Multi-Indexing providing structure to indices that allow for representation of an arbitrary number of dimensions.
  • Date tools: objects for expressing date offsets or generating date ranges. Dates can be aligned to a specific time zone and converted or compared at will
  • Statistical models: convenient ordinary least squares and panel OLS implementations for in-sample or rolling time series and cross-sectional regressions. These will hopefully be the starting point for implementing models
  • Intelligent Cython offloading; complex computations are performed rapidly due to these optimizations.
  • Static and moving statistical tools: mean, standard deviation, correlation, and covariance
  • Rich User Documentation, using Sphinx

Asking Questions:

  • Before asking the question, make sure you have gone through the 10 Minutes to pandas introduction. It covers all the basic functionality of Pandas.
  • See this question on asking good questions: How to make good reproducible pandas examples
  • Please provide the version of Pandas, NumPy, and platform details (if appropriate) in your questions

Answering Questions:

Useful Canonicals:

More FAQs are at this link.

Resources and Tutorials:

Books:

282843 questions
1162
votes
14 answers

Pretty-print an entire Pandas Series / DataFrame

I work with Series and DataFrames on the terminal a lot. The default __repr__ for a Series returns a reduced sample, with some head and tail values, but the rest missing. Is there a builtin way to pretty-print the entire Series / DataFrame? …
Dun Peal
  • 16,679
  • 11
  • 33
  • 46
1147
votes
7 answers

Convert list of dictionaries to a pandas DataFrame

How can I convert a list of dictionaries into a DataFrame? Given: [{'points': 50, 'time': '5:00', 'year': 2010}, {'points': 25, 'time': '6:00', 'month': "february"}, {'points':90, 'time': '9:00', 'month': 'january'}, {'points_h1':20, 'month':…
appleLover
  • 14,835
  • 9
  • 33
  • 50
1076
votes
10 answers

Writing a pandas DataFrame to CSV file

I have a dataframe in pandas which I would like to write to a CSV file. I am doing this using: df.to_csv('out.csv') And getting the following error: UnicodeEncodeError: 'ascii' codec can't encode character u'\u03b1' in position 20: ordinal not in…
user7289
  • 32,560
  • 28
  • 71
  • 88
984
votes
18 answers

Deleting DataFrame row in Pandas based on column value

I have the following DataFrame: daysago line_race rating rw wrating line_date 2007-03-31 62 11 56 1.000000 56.000000 2007-03-10 83 11 67 …
TravisVOX
  • 20,342
  • 13
  • 37
  • 41
973
votes
22 answers

How do I expand the output display to see more columns of a Pandas DataFrame?

Is there a way to widen the display of output in either interactive or script-execution mode? Specifically, I am using the describe() function on a Pandas DataFrame. When the DataFrame is five columns (labels) wide, I get the descriptive statistics…
beets
  • 10,261
  • 4
  • 18
  • 11
938
votes
21 answers

Combine two columns of text in pandas dataframe

I have a 20 x 4000 dataframe in Python using pandas. Two of these columns are named Year and quarter. I'd like to create a variable called period that makes Year = 2000 and quarter= q2 into 2000q2. Can anyone help with that?
user2866103
  • 9,887
  • 6
  • 15
  • 15
918
votes
7 answers

How are iloc and loc different?

Can someone explain how these two methods of slicing are different? I've seen the docs and I've seen previous similar questions (1, 2), but I still find myself unable to understand how they are different. To me, they seem interchangeable in large…
AZhao
  • 13,617
  • 7
  • 31
  • 54
900
votes
8 answers

Pandas Merging 101

How can I perform a (INNER| (LEFT|RIGHT|FULL) OUTER) JOIN with pandas? How do I add NaNs for missing rows after a merge? How do I get rid of NaNs after merging? Can I merge on the index? How do I merge multiple DataFrames? Cross join with…
cs95
  • 379,657
  • 97
  • 704
  • 746
855
votes
17 answers

Filter pandas DataFrame by substring criteria

I have a pandas DataFrame with a column of string values. I need to select rows based on partial string matches. Something like this idiom: re.search(pattern, cell_in_question) returning a boolean. I am familiar with the syntax of df[df['A'] ==…
euforia
  • 8,635
  • 3
  • 14
  • 5
838
votes
8 answers

Creating an empty Pandas DataFrame, and then filling it

I'm starting from the pandas DataFrame documentation here: Introduction to data structures I'd like to iteratively fill the DataFrame with values in a time series kind of calculation. I'd like to initialize the DataFrame with columns A, B, and…
Matthias Kauer
  • 9,697
  • 5
  • 17
  • 19
836
votes
14 answers

Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

I want to filter my dataframe with an or condition to keep rows with a particular column's values that are outside the range [-0.25, 0.25]. I tried: df = df[(df['col'] < -0.25) or (df['col'] > 0.25)] But I get the error: ValueError: The truth…
obabs
  • 8,671
  • 4
  • 12
  • 17
827
votes
12 answers

How to filter Pandas dataframe using 'in' and 'not in' like in SQL

How can I achieve the equivalents of SQL's IN and NOT IN? I have a list with the required values. Here's the scenario: df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']}) countries_to_keep = ['UK', 'China'] #…
LondonRob
  • 73,083
  • 37
  • 144
  • 201
821
votes
14 answers

Shuffle DataFrame rows

I have the following DataFrame: Col1 Col2 Col3 Type 0 1 2 3 1 1 4 5 6 1 ... 20 7 8 9 2 21 10 11 12 2 ... 45 13 14 15 3 46 16 17 18 3 ... The DataFrame…
JNevens
  • 11,202
  • 9
  • 46
  • 72
807
votes
10 answers

How to convert index of a pandas dataframe into a column

How to convert an index of a dataframe into a column? For example: gi ptt_loc 0 384444683 593 1 384444684 594 2 384444686 596 to index1 gi ptt_loc 0 0 384444683 593 1 1 …
msakya
  • 9,311
  • 5
  • 23
  • 31
784
votes
32 answers

How do I count the NaN values in a column in pandas DataFrame?

I want to find the number of NaN in each column of my data.
user3799307
  • 7,849
  • 3
  • 12
  • 3