Questions tagged [dataframe]

A data frame is a 2D tabular data structure. Usually, it contains data where rows are observations and columns are variables and are allowed to be of different types (as distinct from an array or matrix). While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

data.frame in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base data.frames have been extended or modified to create new data structures by several R packages, including and . For further reading, see the paragraph on Data frames in the CRAN manual Intro to R


DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values. The DataFrame object in pandas.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)


DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

143674 questions
23
votes
2 answers

Skipping range of rows after header through pandas.read_excel

I know the argument usecols in pandas.read_excel() allows you to select specific columns. Say, I read an Excel file in with pandas.read_excel(). My excel spreadsheet has 1161 rows. I want to keep the 1st row (with index 0), and skip rows 2:337.…
florence-y
  • 751
  • 3
  • 8
  • 18
23
votes
3 answers

Check if pandas dataframe is subset of other dataframe

I have two Python Pandas dataframes A, B, with the same columns (obviously with different data). I want to check A is a subset of B, that is, all rows of A are contained in B. Any idea how to do it?
user4316384
23
votes
9 answers

How to get the common index of two pandas dataframes?

I have two pandas DataFrames df1 and df2 and I want to transform them in order that they keep values only for the index that are common to the 2 dataframes. df1 values 1 0 28/11/2000 …
astudentofmaths
  • 1,122
  • 2
  • 19
  • 33
23
votes
4 answers

Pandas groupby mean - into a dataframe?

Say my data looks like…
Craig
  • 1,929
  • 5
  • 30
  • 51
23
votes
3 answers

Display rows where a column is False in pandas

I have a dataframe with one column(dtype=bool) contains True/False values, I want to filter the records if bool column == False Below script gives error, please help. if mFile['CCK'].str.contains(['False']): print(mFile.loc[mFile['CCK'] ==…
Learnings
  • 2,780
  • 9
  • 35
  • 55
23
votes
3 answers

How to merge dataframes based on an "OR" condition

Let's say I have two dataframes, and the column names for both are: table 1 columns: [ShipNumber, TrackNumber, ShipDate, Quantity, Weight] table 2 columns: [ShipNumber, TrackNumber, AmountReceived] I want to merge the two tables based on both…
alwaysaskingquestions
  • 1,595
  • 5
  • 22
  • 49
23
votes
4 answers

Retaining categorical dtype upon dataframe concatenation

I have two dataframes with identical column names and dtypes, similar to the following: A object B category C category The categories are not identical in each of the dataframes. When normally concatinating,…
tom
  • 2,236
  • 2
  • 18
  • 26
23
votes
5 answers

Pandas - Interleave / Zip two DataFrames by row

Suppose I have two dataframes: >> df1 0 1 2 0 a b c 1 d e f >> df2 0 1 2 0 A B C 1 D E F How can I interleave the rows? i.e. get this: >> interleaved_df 0 1 2 0 a b c 1 A B C 2 d e f 3 D E F (Note my…
OmerB
  • 4,134
  • 3
  • 20
  • 33
23
votes
2 answers

Extract name of data.frame in R as character

How can I extract the name of a data.frame in R as a character? For example, if I have data.frame named df, I want to get "df" as a character object.
Gaurav Bansal
  • 5,221
  • 14
  • 45
  • 91
23
votes
4 answers

Generate a pandas dataframe from ordereddict?

I am trying to create a pandas dataframe from an ordereddict to preserve the order of the values. But for some reason after creating the dataframe the fields are messed up again. Here's the list of ordereddicts: [OrderedDict([ ('key_a', …
E. Muuli
  • 3,940
  • 5
  • 22
  • 30
23
votes
1 answer

How to shuffle the rows in a Spark dataframe?

I have a dataframe like this: +---+---+ |_c0|_c1| +---+---+ |1.0|4.0| |1.0|4.0| |2.1|3.0| |2.1|3.0| |2.1|3.0| |2.1|3.0| |3.0|6.0| |4.0|5.0| |4.0|5.0| |4.0|5.0| +---+---+ and I would like to shuffle all the rows using Spark in Scala. How can I do…
Laure D
  • 857
  • 2
  • 9
  • 16
23
votes
10 answers

How to remove seconds from datetime?

I have the following date and I tried the following code, df['start_date_time'] = ["2016-05-19 08:25:00","2016-05-19 16:00:00","2016-05-20 07:45:00","2016-05-24 12:50:00","2016-05-25 23:00:00","2016-05-26 19:45:00"] df['start_date_time'] =…
user7779326
23
votes
3 answers

How to check if a particular cell in pandas DataFrame isnull?

I have the following df in pandas. 0 A B C 1 2 NaN 8 How can I check if df.iloc[1]['B'] is NaN? I tried using df.isnan() and I get a table like this: 0 A B C 1 false true false but I am not sure how to…
Newskooler
  • 3,973
  • 7
  • 46
  • 84
23
votes
1 answer

Create adjacency matrix for two columns in pandas dataframe

I have a dataframe of the form: index Name_A Name_B 0 Adam Ben 1 Chris David 2 Adam Chris 3 Ben Chris And I'd like to obtain the adjacency matrix for Name_A and Name_B, ie: Adam Ben Chris David Adam 0 1 …
The Ref
  • 684
  • 2
  • 7
  • 20
23
votes
1 answer

How to do a pandas groupby operation on one column but keep the other in the resulting dataframe

My question is about groupby operation with pandas. I have the following DataFrame : In [4]: df = pd.DataFrame({"A": range(4), "B": ["PO", "PO", "PA", "PA"], "C": ["Est", "Est", "West", "West"]}) In [5]: df Out[5]: A B C 0 0 PO Est 1 …
Ger
  • 9,076
  • 10
  • 37
  • 48