Questions tagged [dataframe]

A data frame is a 2D tabular data structure. Usually, it contains data where rows are observations and columns are variables and are allowed to be of different types (as distinct from an array or matrix). While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

data.frame in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base data.frames have been extended or modified to create new data structures by several R packages, including and . For further reading, see the paragraph on Data frames in the CRAN manual Intro to R


DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values. The DataFrame object in pandas.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)


DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

143674 questions
451
votes
13 answers

How to reversibly store and load a Pandas dataframe to/from disk

Right now I'm importing a fairly large CSV as a dataframe every time I run the script. Is there a good solution for keeping that dataframe constantly available in between runs so I don't have to spend all that time waiting for the script to run?
jeffstern
  • 4,706
  • 4
  • 16
  • 10
449
votes
7 answers

Remove pandas rows with duplicate indices

How to remove rows with duplicate index values? In the weather DataFrame below, sometimes a scientist goes back and corrects observations -- not by editing the erroneous rows, but by appending a duplicate row to the end of a file. I'm reading some…
Paul H
  • 65,268
  • 20
  • 159
  • 136
447
votes
7 answers

How to add pandas data to an existing csv file?

I want to know if it is possible to use the pandas to_csv() function to add a dataframe to an existing csv file. The csv file has the same structure as the loaded data.
Ayoub Ennassiri
  • 4,606
  • 3
  • 13
  • 9
445
votes
10 answers

Extracting specific columns from a data frame

I have an R data frame with 6 columns, and I want to create a new dataframe that only has three of the columns. Assuming my data frame is df, and I want to extract columns A, B, and E, this is the only command I can figure out: …
Aren Cambre
  • 6,540
  • 9
  • 30
  • 36
431
votes
13 answers

Sample random rows in dataframe

I am struggling to find the appropriate function that would return a specified number of rows picked up randomly without replacement from a data frame in R language? Can anyone help me out?
nikhil
  • 9,023
  • 22
  • 55
  • 81
423
votes
10 answers

Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?

I have a Numpy array consisting of a list of lists, representing a two-dimensional array with row labels and column names as shown below: data = np.array([['','Col1','Col2'],['Row1',1,2],['Row2',3,4]]) I'd like the resulting DataFrame to have Row1…
user3132783
  • 5,275
  • 3
  • 15
  • 7
417
votes
15 answers

pandas: filter rows of DataFrame with operator chaining

Most operations in pandas can be accomplished with operator chaining (groupby, aggregate, apply, etc), but the only way I've found to filter rows is via normal bracket indexing df_filtered = df[df['column'] == value] This is unappealing as it…
duckworthd
  • 14,679
  • 16
  • 53
  • 68
412
votes
12 answers

Convert a Pandas DataFrame to a dictionary

I have a DataFrame with four columns. I want to convert this DataFrame to a python dictionary. I want the elements of first column be keys and the elements of other columns in the same row be values. DataFrame: ID A B C 0 p 1 3 …
Prince Bhatti
  • 4,671
  • 4
  • 18
  • 24
411
votes
10 answers

How to get/set a pandas index column title or name?

How do I get the index column name in Python's pandas? Here's an example dataframe: Column 1 Index Title Apples 1 Oranges 2 Puppies 3 Ducks 4 What I'm trying to do is…
Radical Edward
  • 5,234
  • 5
  • 21
  • 33
406
votes
17 answers

pandas get rows which are NOT in other dataframe

I've two pandas data frames that have some rows in common. Suppose dataframe2 is a subset of dataframe1. How can I get the rows of dataframe1 which are not in dataframe2? df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12,…
think nice things
  • 4,315
  • 3
  • 14
  • 12
406
votes
27 answers

What does axis in pandas mean?

Here is my code to generate a dataframe: import pandas as pd import numpy as np dff = pd.DataFrame(np.random.randn(1,2),columns=list('AB')) then I got the dataframe: +------------+---------+--------+ | | A | B …
jerry_sjtu
  • 5,216
  • 8
  • 29
  • 42
400
votes
9 answers

Combining two Series into a DataFrame in pandas

I have two Series s1 and s2 with the same (non-consecutive) indices. How do I combine s1 and s2 to being two columns in a DataFrame and keep one of the indices as a third column?
user7289
  • 32,560
  • 28
  • 71
  • 88
398
votes
12 answers

What is the most efficient way to loop through dataframes with pandas?

I want to perform my own complex operations on financial data in dataframes in a sequential manner. For example I am using the following MSFT CSV file taken from Yahoo Finance: Date,Open,High,Low,Close,Volume,Adj…
Muppet
  • 5,767
  • 6
  • 29
  • 39
398
votes
18 answers

Convert data.frame columns from factors to characters

I have a data frame. Let's call him bob: > head(bob) phenotype exclusion GSM399350 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119- GSM399351 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119- GSM399352 3-…
Mike Dewar
  • 10,945
  • 14
  • 49
  • 65
397
votes
13 answers

Using Pandas to pd.read_excel() for multiple worksheets of the same workbook

I have a large spreadsheet file (.xlsx) that I'm processing using python pandas. It happens that I need data from two tabs (sheets) in that large file. One of the tabs has a ton of data and the other is just a few square cells. When I use…
HaPsantran
  • 5,581
  • 6
  • 24
  • 39