Questions tagged [dataframe]

A data frame is a 2D tabular data structure. Usually, it contains data where rows are observations and columns are variables and are allowed to be of different types (as distinct from an array or matrix). While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

data.frame in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base data.frames have been extended or modified to create new data structures by several R packages, including and . For further reading, see the paragraph on Data frames in the CRAN manual Intro to R


DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values. The DataFrame object in pandas.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)


DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

143674 questions
548
votes
3 answers

How to reset index in a pandas dataframe?

I have a dataframe from which I remove some rows. As a result, I get a dataframe in which index is something like that: [1,5,6,10,11] and I would like to reset it to [0,1,2,3,4]. How can I do it? The following seems to work: df =…
Roman
  • 124,451
  • 167
  • 349
  • 456
541
votes
19 answers

How to flatten a hierarchical index in columns

I have a data frame with a hierarchical index in axis 1 (columns) (from a groupby.agg operation): USAF WBAN year month day s_PC s_CL s_CD s_CNT tempf sum sum sum sum amax amin 0 …
Ross R
  • 8,853
  • 7
  • 28
  • 27
540
votes
30 answers

How do I create test and train samples from one dataframe with pandas?

I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples (80% and 20%) for training and testing. Thanks!
tooty44
  • 6,829
  • 9
  • 27
  • 39
527
votes
8 answers

Python Pandas: Get index of rows where column matches certain value

Given a DataFrame with a column "BoolCol", we want to find the indexes of the DataFrame in which the values for "BoolCol" == True I currently have the iterating way to do it, which works perfectly: for i in range(100,3000): if…
I want badges
  • 6,155
  • 5
  • 23
  • 38
503
votes
9 answers

Selecting/excluding sets of columns in pandas

I would like to create views or dataframes from an existing dataframe based on column selections. For example, I would like to create a dataframe df2 from a dataframe df1 that holds all columns from it except two of them. I tried doing the…
Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564
502
votes
18 answers

How to sum a variable by group

I have a data frame with two columns. First column contains categories such as "First", "Second", "Third", and the second column has numbers that represent the number of times I saw the specific groups from "Category". For example: Category …
boo-urns
  • 10,136
  • 26
  • 71
  • 107
495
votes
11 answers

Sorting columns in pandas dataframe based on column name

I have a dataframe with over 200 columns. The issue is as they were generated the order is ['Q1.3','Q6.1','Q1.2','Q1.1',......] I need to sort the columns as follows: ['Q1.1','Q1.2','Q1.3',.....'Q6.1',......] Is there some way for me to do this…
pythOnometrist
  • 6,531
  • 6
  • 30
  • 50
490
votes
16 answers

Count the frequency that a value occurs in a dataframe column

I have a dataset category cat a cat b cat a I'd like to return something like the following which shows the unique values and their frequencies category freq cat a 2 cat b 1
yoshiserry
  • 20,175
  • 35
  • 77
  • 104
490
votes
13 answers

Pandas conditional creation of a series/dataframe column

How do I add a color column to the following dataframe so that color='green' if Set == 'Z', and color='red' otherwise? Type Set 1 A Z 2 B Z 3 B X 4 C Y
user7289
  • 32,560
  • 28
  • 71
  • 88
485
votes
10 answers

Combine a list of data frames into one data frame by row

I have code that at one place ends up with a list of data frames which I really want to convert to a single big data frame. I got some pointers from an earlier question which was trying to do something similar but more complex. Here's an example…
JD Long
  • 59,675
  • 58
  • 202
  • 294
477
votes
19 answers

Changing column names of a data frame

I have a data frame called "newprice" (see below) and I want to change the column names in my program in R. > newprice Chang. Chang. Chang. 1 100 36 136 2 120 -33 87 3 150 14 164 In fact this is…
Son
  • 5,295
  • 5
  • 19
  • 10
475
votes
15 answers

Get the row(s) which have the max value in groups using groupby

How do I find all rows in a pandas DataFrame which have the max value for count column, after grouping by ['Sp','Mt'] columns? Example 1: the following DataFrame: Sp Mt Value count 0 MM1 S1 a **3** 1 MM1 S1 n 2 2 MM1 S3 …
jojo12
  • 4,853
  • 3
  • 14
  • 7
467
votes
6 answers

Convert DataFrame column type from string to datetime

How can I convert a DataFrame column of strings (in dd/mm/yyyy format) to datetime dtype?
perigee
  • 9,438
  • 11
  • 31
  • 35
466
votes
7 answers

Convert Pandas Column to DateTime

I have one field in a pandas DataFrame that was imported as string format. It should be a datetime variable. How do I convert it to a datetime column, and then filter based on date? Example: raw_data = pd.DataFrame({'Mycol':…
Chris
  • 12,900
  • 12
  • 43
  • 65
460
votes
24 answers

Normalize columns of a dataframe

I have a dataframe in pandas where each column has different value range. For example: df: A B C 1000 10 0.5 765 5 0.35 800 7 0.09 Any idea how I can normalize the columns of this dataframe where each value is between 0 and 1? My…
ahajib
  • 12,838
  • 29
  • 79
  • 120