Questions tagged [dataframe]

A data frame is a 2D tabular data structure. Usually, it contains data where rows are observations and columns are variables and are allowed to be of different types (as distinct from an array or matrix). While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

data.frame in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base data.frames have been extended or modified to create new data structures by several R packages, including and . For further reading, see the paragraph on Data frames in the CRAN manual Intro to R


DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values. The DataFrame object in pandas.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)


DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

143674 questions
23
votes
3 answers

Pandas filter rows based on multiple conditions

I have some values in the risk column that are neither, Small, Medium or High. I want to delete the rows with the value not being Small, Medium and High. I tried the following: df = df[(df.risk == "Small") | (df.risk == "Medium") | (df.risk ==…
ArtDijk
  • 1,957
  • 6
  • 23
  • 31
23
votes
5 answers

Pandas won't fillna() inplace

I'm trying to fill NAs with "" on 4 specific columns in a data frame that are string/object types. I can assign these columns to a new variable as I fillna(), but when I fillna() inplace the underlying data doesn't change. a_n6 = a_n6[["PROV LAST",…
Beau Bristow
  • 231
  • 1
  • 2
  • 3
23
votes
2 answers

Changing pipe separated data to a pandas Dataframe

I have pipe-separated values like this: https|clients4.google.com|application/octet-stream|2296| https|clients4.google.com|text/html; charset=utf-8|0| .... .... https|clients4.google.com|application/octet-stream|2291| I have to create a Pandas…
itsaruns
  • 659
  • 2
  • 11
  • 16
23
votes
2 answers

Get size of a group knowing its grouper id in pandas groupby

In the following snippet, data is a pandas.DataFrame and indices is a set of columns of the data. After grouping the data with groupby, I am interested in the ids of the groups, but only those with a size greater than a threshold (say:…
piokuc
  • 25,594
  • 11
  • 72
  • 102
23
votes
3 answers

Enumerate each row for each group in a DataFrame

In pandas, how can I add a new column which enumerates rows based on a given grouping? For instance, assume the following DataFrame: import pandas as pd import numpy as np a_list = ['A', 'B', 'C', 'A', 'A', 'C', 'B', 'B', 'A', 'C'] df =…
Greg Reda
  • 1,744
  • 2
  • 13
  • 20
23
votes
9 answers

Minus operation of data frames

I have 2 data frames df1 and df2. df1 <- data.frame(c1=c("a","b","c","d"),c2=c(1,2,3,4) ) df2 <- data.frame(c1=c("c","d","e","f"),c2=c(3,4,5,6) ) > df1 c1 c2 1 a 1 2 b 2 3 c 3 4 d 4 > df2 c1 c2 1 c 3 2 d 4 3 e 5 4 f 6 I need…
Dinoop Nair
  • 2,663
  • 6
  • 31
  • 51
23
votes
1 answer

Reindexing dataframes

I have a data frame. Then I have a logical condition using which I create another data frame by removing some rows. The new data frame however skips indices for removed rows. How can I get it to reindex sequentially without skipping? Here's a sample…
user2133151
  • 247
  • 1
  • 2
  • 10
23
votes
3 answers

TimeGrouper, pandas

I use TimeGrouper from pandas.tseries.resample to sum monthly return to 6M as follows: 6m_return = monthly_return.groupby(TimeGrouper(freq='6M')).aggregate(numpy.sum) where monthly_return is like: 2008-07-01 0.003626 2008-08-01 …
user2019264
  • 231
  • 1
  • 2
  • 3
23
votes
5 answers

Divide each data frame row by vector in R

I'm trying to divide each number within a data frame with 16 columns by a specific number for each column. The numbers are stored as a data frame with 1-16 corresponding to the samples in the larger data frames columns 1-16. There is a single number…
Ramma
  • 335
  • 1
  • 2
  • 6
23
votes
4 answers

Apply function to pandas DataFrame that can return multiple rows

I am trying to transform DataFrame, such that some of the rows will be replicated a given number of times. For example: df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]}) class count 0 A 1 1 B 0 2 C …
btel
  • 5,563
  • 6
  • 37
  • 47
23
votes
1 answer

Replace numbers in data frame column in R?

Possible Duplicate: Replace contents of factor column in R dataframe I have the data.frame df1<-data.frame("Sp1"=1:6,"Sp2"=7:12,"Sp3"=13:18) rownames(df1)=c("A","B","C","D","E","F") df1 Sp1 Sp2 Sp3 A 1 7 13 B 2 8 14 C 3 9 15 D …
Elizabeth
  • 6,391
  • 17
  • 62
  • 90
23
votes
1 answer

Why does changing a column name take an extremely long time with a large data.frame?

I have a data.frame in R with 19 million rows and 90 columns. I have plenty of spare RAM and CPU cycles. It seems that changing a single column name in this data frame is a very intense operation for R. system.time(colnames(my.df)[1] <- "foo") …
Ina
  • 4,400
  • 6
  • 30
  • 44
23
votes
5 answers

pandas: combine two columns in a DataFrame

I have a pandas DataFrame that has multiple columns in it: Index: 239897 entries, 2012-05-11 15:20:00 to 2012-06-02 23:44:51 Data columns: foo 11516 non-null values bar 228381 non-null values Time_UTC …
BFTM
  • 3,225
  • 6
  • 23
  • 22
23
votes
2 answers

Prevent automatic conversion of single column to vector

I have a data frame like this: df = data.frame(a=1:3, b=2:4, c=3:5) I am selecting columns from that data frame using something akin to: df[, c(T, F, T)] This works fine as long as there are at least two columns to be returned; but, if I do this…
Nils
  • 1,936
  • 3
  • 27
  • 42
22
votes
3 answers

Deleting every n-th row in a dataframe

How can I delete every n-th row from a dataframe in R?
Yktula
  • 14,179
  • 14
  • 48
  • 71