Questions tagged [dataframe]

A data frame is a 2D tabular data structure. Usually, it contains data where rows are observations and columns are variables and are allowed to be of different types (as distinct from an array or matrix). While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

`data.frame` in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base r data.frames have been extended or modified to create new data structures by several R packages, including data.table and tibble. For further reading, see the paragraph on Data frames in the CRAN manual Intro to R

DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values. The DataFrame object in pandas.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)

DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

143674 questions

630

votes

17 answers

How to replace NaN values by Zeroes in a column of a Pandas Dataframe?

I have a Pandas Dataframe as below: itm Date Amount 67 420 2012-09-30 00:00:00 65211 68 421 2012-09-09 00:00:00 29424 69 421 2012-09-16 00:00:00 29877 70 421 2012-09-23 00:00:00 30990 71 421 2012-09-30…

python pandas dataframe nan

asked Nov 08 '12 at 18:50

George Thompson

6,627
4
16
16

626

votes

13 answers

how to sort pandas dataframe from one column

I have a data frame like this: print(df) 0 1 2 0 354.7 April 4.0 1 55.4 August 8.0 2 176.5 December 12.0 3 95.5 February 2.0 4 85.6 January 1.0 5 152 July 7.0 6 238.7 …

python pandas dataframe sorting time

asked Jun 13 '16 at 10:44

Sachila Ranawaka

39,756
7
56
80

606

votes

8 answers

Create new column based on values from other columns / apply a function of multiple columns, row-wise in Pandas

I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr.Amer, ERI_HI_PacIsl, ERI_White) in each row of my dataframe. I've tried different methods from other…

python pandas dataframe numpy apply

asked Nov 12 '14 at 12:08

Dave

6,968
7
26
32

601

votes

16 answers

Drop unused factor levels in a subsetted data frame

I have a data frame containing a factor. When I create a subset of this dataframe using subset or another indexing function, a new data frame is created. However, the factor variable retains all of its original levels, even when/if they do not…

r dataframe r-factor r-faq

asked Jul 28 '09 at 18:21

medriscoll

26,995
17
40
36

596

votes

17 answers

Create an empty data.frame

I'm trying to initialize a data.frame without any rows. Basically, I want to specify the data types for each column and name them, but not have any rows created as a result. The best I've been able to do so far is something like: df <-…

r dataframe r-faq

asked May 21 '12 at 16:35

Jeff Allen

17,277
8
49
70

591

votes

5 answers

How to check if a column exists in Pandas

How do I check if a column exists in a Pandas DataFrame df? A B C 0 3 40 100 1 6 30 200 How would I check if the column "A" exists in the above DataFrame so that I can compute: df['sum'] = df['A'] + df['C'] And if "A" doesn't…

python pandas dataframe

asked Jul 21 '14 at 16:43

npires

6,093
2
13
9

584

votes

7 answers

Filter dataframe rows if value in column is in a set list of values

I have a Python pandas DataFrame rpt: rpt MultiIndex: 47518 entries, ('000002', '20120331') to ('603366', '20091231') Data columns: STK_ID 47518 non-null values STK_Name …

python pandas dataframe

asked Aug 22 '12 at 03:16

bigbug

55,954
42
77
96

573

votes

12 answers

Remap values in pandas column with a dict, preserve NaNs

I have a dictionary which looks like this: di = {1: "A", 2: "B"} I would like to apply it to the col1 column of a dataframe similar to: col1 col2 0 w a 1 1 2 2 2 NaN to get: col1 col2 0 w a 1 …

python pandas dataframe dictionary remap

asked Nov 27 '13 at 18:56

TheChymera

17,004
14
56
86

572

votes

18 answers

Convert Python dict into a dataframe

I have a Python dictionary like the following: {u'2012-06-08': 388, u'2012-06-09': 388, u'2012-06-10': 388, u'2012-06-11': 389, u'2012-06-12': 389, u'2012-06-13': 389, u'2012-06-14': 389, u'2012-06-15': 389, u'2012-06-16': 389, …

python pandas dataframe

asked Sep 16 '13 at 21:02

anonuser0428

11,789
22
63
86

563

votes

8 answers

Selecting a row of pandas series/dataframe by integer index

I am curious as to why df[2] is not supported, while df.ix[2] and df[2:3] both work. In [26]: df.ix[2] Out[26]: A 1.027680 B 1.514210 C -1.466963 D -0.162339 Name: 2000-01-03 00:00:00 In [27]: df[2:3] Out[27]: A …

python pandas dataframe indexing

asked Apr 19 '13 at 03:14

user1642513

561

votes

12 answers

Quickly reading very large tables as dataframes

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am…

r import dataframe r-faq

asked Nov 13 '09 at 07:53

eytan

5,945
3
20
11

553

votes

11 answers

Get list from pandas dataframe column or row?

I have a dataframe df imported from an Excel document like this: cluster load_date budget actual fixed_price A 1/1/2014 1000 4000 Y A 2/1/2014 12000 10000 Y A 3/1/2014 36000 2000 Y B 4/1/2014 15000 10000 …

python pandas list dataframe

asked Mar 12 '14 at 03:12

yoshiserry

20,175
35
77
104

550

votes

13 answers

Pandas read_csv: low_memory and dtype options

df = pd.read_csv('somefile.csv') ...gives an error: .../site-packages/pandas/io/parsers.py:1130: DtypeWarning: Columns (4,5,7,16) have mixed types. Specify dtype option on import or set low_memory=False. Why is the dtype option related to…

python parsing numpy pandas dataframe

asked Jun 16 '14 at 19:56

Josh

11,979
17
60
96

549

votes

14 answers

How to select all columns except one in pandas?

I have a dataframe that look like this: a b c d 0 0.418762 0.042369 0.869203 0.972314 1 0.991058 0.510228 0.594784 0.534366 2 0.407472 0.259811 0.396664 0.894202 3 0.726168 0.139531 0.324932 …

python pandas dataframe select

asked Apr 21 '15 at 05:24

markov zain

11,987
13
35
39

548

votes

8 answers

How can I use the apply() function for a single column?

I have a pandas dataframe with multiple columns. I want to change the values of the only the first column without affecting the other columns. How can I do that using apply() in pandas?

python pandas dataframe numpy apply

asked Jan 23 '16 at 10:04

Amani

16,245
29
103
153

Prev 1 2 3

…

99 100 Next

Questions tagged [dataframe]

data.frame in R

DataFrame in Python's pandas library

DataFrame in Apache Spark

DataFrame in Maple

`data.frame` in R