Questions tagged [dataframe]

A data frame is a 2D tabular data structure. Usually, it contains data where rows are observations and columns are variables and are allowed to be of different types (as distinct from an array or matrix). While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

data.frame in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base data.frames have been extended or modified to create new data structures by several R packages, including and . For further reading, see the paragraph on Data frames in the CRAN manual Intro to R


DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values. The DataFrame object in pandas.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)


DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

143674 questions
630
votes
17 answers

How to replace NaN values by Zeroes in a column of a Pandas Dataframe?

I have a Pandas Dataframe as below: itm Date Amount 67 420 2012-09-30 00:00:00 65211 68 421 2012-09-09 00:00:00 29424 69 421 2012-09-16 00:00:00 29877 70 421 2012-09-23 00:00:00 30990 71 421 2012-09-30…
George Thompson
  • 6,627
  • 4
  • 16
  • 16
626
votes
13 answers

how to sort pandas dataframe from one column

I have a data frame like this: print(df) 0 1 2 0 354.7 April 4.0 1 55.4 August 8.0 2 176.5 December 12.0 3 95.5 February 2.0 4 85.6 January 1.0 5 152 July 7.0 6 238.7 …
Sachila Ranawaka
  • 39,756
  • 7
  • 56
  • 80
606
votes
8 answers

Create new column based on values from other columns / apply a function of multiple columns, row-wise in Pandas

I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr.Amer, ERI_HI_PacIsl, ERI_White) in each row of my dataframe. I've tried different methods from other…
Dave
  • 6,968
  • 7
  • 26
  • 32
601
votes
16 answers

Drop unused factor levels in a subsetted data frame

I have a data frame containing a factor. When I create a subset of this dataframe using subset or another indexing function, a new data frame is created. However, the factor variable retains all of its original levels, even when/if they do not…
medriscoll
  • 26,995
  • 17
  • 40
  • 36
596
votes
17 answers

Create an empty data.frame

I'm trying to initialize a data.frame without any rows. Basically, I want to specify the data types for each column and name them, but not have any rows created as a result. The best I've been able to do so far is something like: df <-…
Jeff Allen
  • 17,277
  • 8
  • 49
  • 70
591
votes
5 answers

How to check if a column exists in Pandas

How do I check if a column exists in a Pandas DataFrame df? A B C 0 3 40 100 1 6 30 200 How would I check if the column "A" exists in the above DataFrame so that I can compute: df['sum'] = df['A'] + df['C'] And if "A" doesn't…
npires
  • 6,093
  • 2
  • 13
  • 9
584
votes
7 answers

Filter dataframe rows if value in column is in a set list of values

I have a Python pandas DataFrame rpt: rpt MultiIndex: 47518 entries, ('000002', '20120331') to ('603366', '20091231') Data columns: STK_ID 47518 non-null values STK_Name …
bigbug
  • 55,954
  • 42
  • 77
  • 96
573
votes
12 answers

Remap values in pandas column with a dict, preserve NaNs

I have a dictionary which looks like this: di = {1: "A", 2: "B"} I would like to apply it to the col1 column of a dataframe similar to: col1 col2 0 w a 1 1 2 2 2 NaN to get: col1 col2 0 w a 1 …
TheChymera
  • 17,004
  • 14
  • 56
  • 86
572
votes
18 answers

Convert Python dict into a dataframe

I have a Python dictionary like the following: {u'2012-06-08': 388, u'2012-06-09': 388, u'2012-06-10': 388, u'2012-06-11': 389, u'2012-06-12': 389, u'2012-06-13': 389, u'2012-06-14': 389, u'2012-06-15': 389, u'2012-06-16': 389, …
anonuser0428
  • 11,789
  • 22
  • 63
  • 86
563
votes
8 answers

Selecting a row of pandas series/dataframe by integer index

I am curious as to why df[2] is not supported, while df.ix[2] and df[2:3] both work. In [26]: df.ix[2] Out[26]: A 1.027680 B 1.514210 C -1.466963 D -0.162339 Name: 2000-01-03 00:00:00 In [27]: df[2:3] Out[27]: A …
user1642513
561
votes
12 answers

Quickly reading very large tables as dataframes

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am…
eytan
  • 5,945
  • 3
  • 20
  • 11
553
votes
11 answers

Get list from pandas dataframe column or row?

I have a dataframe df imported from an Excel document like this: cluster load_date budget actual fixed_price A 1/1/2014 1000 4000 Y A 2/1/2014 12000 10000 Y A 3/1/2014 36000 2000 Y B 4/1/2014 15000 10000 …
yoshiserry
  • 20,175
  • 35
  • 77
  • 104
550
votes
13 answers

Pandas read_csv: low_memory and dtype options

df = pd.read_csv('somefile.csv') ...gives an error: .../site-packages/pandas/io/parsers.py:1130: DtypeWarning: Columns (4,5,7,16) have mixed types. Specify dtype option on import or set low_memory=False. Why is the dtype option related to…
Josh
  • 11,979
  • 17
  • 60
  • 96
549
votes
14 answers

How to select all columns except one in pandas?

I have a dataframe that look like this: a b c d 0 0.418762 0.042369 0.869203 0.972314 1 0.991058 0.510228 0.594784 0.534366 2 0.407472 0.259811 0.396664 0.894202 3 0.726168 0.139531 0.324932 …
markov zain
  • 11,987
  • 13
  • 35
  • 39
548
votes
8 answers

How can I use the apply() function for a single column?

I have a pandas dataframe with multiple columns. I want to change the values of the only the first column without affecting the other columns. How can I do that using apply() in pandas?
Amani
  • 16,245
  • 29
  • 103
  • 153