Questions tagged [dataframe]

A data frame is a 2D tabular data structure. Usually, it contains data where rows are observations and columns are variables and are allowed to be of different types (as distinct from an array or matrix). While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

data.frame in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base data.frames have been extended or modified to create new data structures by several R packages, including and . For further reading, see the paragraph on Data frames in the CRAN manual Intro to R


DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values. The DataFrame object in pandas.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)


DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

143674 questions
22
votes
2 answers

How to set values based on a condition on a subset of MultiIndex pandas dataframe

I want to take a subset of a MultiIndex pandas dataframe, test for values less than zero and set them to zero. For example: df = pd.DataFrame({('A','a'): [-1,-1,0,10,12], ('A','b'): [0,1,2,3,-1], ('B','a'):…
pbreach
  • 16,049
  • 27
  • 82
  • 120
22
votes
2 answers

Convert matrix to three column data.frame

I've got matrix: var1 var2 row1 1 2 row2 3 4 Want to convert it to data.frame: rows vars values row1 var1 1 row1 var2 2 row2 var1 3 row2 var2 4 What is the best way to do it?
Aleksandro M Granda
  • 665
  • 1
  • 8
  • 13
22
votes
3 answers

How to analyze all duplicate entries in this Pandas DataFrame?

I'd like to be able to compute descriptive statistics on data in a Pandas DataFrame, but I only care about duplicated entries. For example, let's say I have the DataFrame created by: import pandas as…
gammapoint
  • 1,083
  • 2
  • 15
  • 27
22
votes
7 answers

R: Convert factor column to multiple boolean columns

I am trying to convert a factor column into multiple boolean columns as the image below shows. The data is from weather stations as retrieved using the fine weatherData package. The factor column I want to convert into multiple boolean columns…
Jose R
  • 930
  • 1
  • 11
  • 22
22
votes
4 answers

Same function over multiple data frames in R

I am new to R, and this is a very simple question. I've found a lot of similar things to what I want but not exactly it. Basically I have multiple data frames and I simply want to run the same function across all of them. A for-loop could work but…
user3272284
  • 279
  • 2
  • 3
  • 10
22
votes
1 answer

Better way to filter a data frame with dplyr using OR?

I have a data frame in R with columns subject1 and subject2 (which contain Library of Congress subject headings). I'd like to filter the data frame by testing whether the subjects match an approved list. Say, for example, that I have this data…
Lincoln Mullen
  • 6,257
  • 4
  • 27
  • 30
22
votes
6 answers

Get rows that have the same value across its columns in pandas

In pandas, given a DataFrame D: +-----+--------+--------+--------+ | | 1 | 2 | 3 | +-----+--------+--------+--------+ | 0 | apple | banana | banana | | 1 | orange | orange | orange | | 2 | banana | apple | orange | | 3…
kentwait
  • 1,969
  • 2
  • 21
  • 42
22
votes
5 answers

Creating a summary statistical table from a data frame

I have the following data frame (df) of 29 observations of 5 variables: age height_seca1 height_chad1 height_DL weight_alog1 1 19 1800 1797 180 70 2 19 1682 1670 167 69 3 21…
pkg77x7
  • 925
  • 2
  • 7
  • 10
22
votes
1 answer

Combine two data frames with the same column names

I have two data.frame with this format (this is small part of the data sets): data.frame 1 ID precip lat lon 1 45 115 -122.5 2 42.5 130 -122.5 3 40 155 -122.5 4 37.5 140 -122.5 data.frame 2 precip lat lon 1 …
user3000796
  • 263
  • 1
  • 2
  • 5
22
votes
1 answer

Why does "^" on a data.frame return a matrix instead of a data.frame like "*" does?

This question is motivated by a bug filed here by Abiel Reinhart on data.table. I noticed that the same happens on data.frame as well. Here's an example: DF <- data.frame(x=1:5, y=6:10) > DF*DF x y 1 1 36 2 4 49 3 9 64 4 16 81 5 25…
Arun
  • 116,683
  • 26
  • 284
  • 387
22
votes
4 answers

DT[!(x == .)] and DT[x != .] treat NA in x inconsistently

This is something that I thought I should ask following this question. I'd like to confirm if this is a bug/inconsistency before filing it as a such in the R-forge tracker. Consider this data.table: require(data.table) DT <- data.table(x=c(1,0,NA),…
Arun
  • 116,683
  • 26
  • 284
  • 387
22
votes
3 answers

Turn Pandas DataFrame of strings into histogram

Suppose I have a DataFrame of created like this: import pandas as pd s1 = pd.Series(['a', 'b', 'a', 'c', 'a', 'b']) s2 = pd.Series(['a', 'f', 'a', 'd', 'a', 'f', 'f']) d = pd.DataFrame({'s1': s1, 's2', s2}) There is quite a lot of sparsity in the…
amatsukawa
  • 841
  • 2
  • 10
  • 21
22
votes
3 answers

Put multiple data frames into list (smart way)

Is it possible to put a lot of data frames into a list in some easy way? Meaning instead of having to write each name manually like the following way: list_of_df <- list(data_frame1,data_frame2,data_frame3, ....) I have all the data frames loaded…
Martin Petri Bagger
  • 2,187
  • 4
  • 17
  • 20
22
votes
1 answer

Accessing column with df[col] gives: Error 'x' must be atomic for 'sort.list'

I have a very simple array on which I want to run ROC curve analysis. But first, when i try to force data into Factor type using command table[1]<-factor(table[1]), i get the error Error in sort.list(y) : 'x' must be atomic for 'sort.list' Have you…
Maelstorm
  • 580
  • 2
  • 10
  • 29
22
votes
2 answers

How can I order a dataframe by the second column in R?

Possible Duplicate: How to sort a dataframe by column(s) in R I was just wondering if some one could help me out, I have what I thought should be a easy problem to solve. I have the table below: SampleID Cluster R0132F041p …
sinead
  • 269
  • 1
  • 4
  • 7