Questions tagged [dataframe]

A data frame is a 2D tabular data structure. Usually, it contains data where rows are observations and columns are variables and are allowed to be of different types (as distinct from an array or matrix). While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

data.frame in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base data.frames have been extended or modified to create new data structures by several R packages, including and . For further reading, see the paragraph on Data frames in the CRAN manual Intro to R


DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values. The DataFrame object in pandas.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)


DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

143674 questions
21
votes
3 answers

Dataframe head not shown in PyCharm

I have the following code in PyCharm import pandas as pd import numpy as np import matplotlib as plt df = pd.read_csv("c:/temp/datafile.txt", sep='\t') df.head(10) I get the following output: Process finished with exit code 0 I am supposed to…
user1774127
21
votes
5 answers

What is the meaning of "axis" attribute in a Pandas DataFrame?

Taking the following example: >>> df1 = pd.DataFrame({"x":[1, 2, 3, 4, 5], "y":[3, 4, 5, 6, 7]}, index=['a', 'b', 'c', 'd', 'e']) >>> df2 = pd.DataFrame({"y":[1, 3, 5, 7, 9], …
user4881093
21
votes
2 answers

Pandas replace with default value

I have a pandas dataframe I want to replace a certain column conditionally. eg: col 0 Mr 1 Miss 2 Mr 3 Mrs 4 Col. I want to map them as {'Mr': 0, 'Mrs': 1, 'Miss': 2} If there are other titles now available in the dict then I want them to…
21
votes
4 answers

pandas get_level_values for multiple columns

Is there a way to get the result of get_level_values for more than one column? Given the following DataFrame: d a b c 1 4 10 16 11 17 5 12 18 2 5 13 19 6 14 20 3 7 15 21 I wish to get the values (i.e. list of tuples) of…
danielhadar
  • 2,031
  • 1
  • 16
  • 27
21
votes
5 answers

DataFrame sorting based on a function of multiple column values

Based on python, sort descending dataframe with pandas: Given: from pandas import DataFrame import pandas as pd d = {'x':[2,3,1,4,5], 'y':[5,4,3,2,1], 'letter':['a','a','b','b','c']} df = DataFrame(d) df then looks like this: df: …
Ohumeronen
  • 1,769
  • 2
  • 14
  • 28
21
votes
3 answers

convert pandas dataframe column from hex string to int

I have a very large dataframe that I would like to avoid iterating through every single row and want to convert the entire column from hex string to int. It doesn't process the string correctly with astype but has no problems with a single entry. Is…
kaminsknator
  • 1,135
  • 3
  • 15
  • 26
21
votes
3 answers

Selecting columns with condition on Pandas DataFrame

I have a dataframe looking like this. col1 col2 0 something1 something1 1 something2 something3 2 something1 something1 3 something2 something3 4 something1 something2 I'm trying to filter all rows that have something1…
user3368526
  • 2,168
  • 10
  • 37
  • 52
21
votes
1 answer

How to delete a row in a data frame by name in R

I'm trying to delete a row from a data frame in which each row has a name. I cannot use indexes to delete the rows, only it's name. I have this dataframe: DF<- data.frame('2014' = c(30,20,4, 50), '2015' = c(25,40,6, 65), row.names = c("mobile…
Theo Sloot
  • 287
  • 2
  • 3
  • 10
21
votes
2 answers

Applying sqrt function on a column

I have following data frame data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012], 'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'], 'wins': [11, 8, 10, 15, 11, 6, 10,…
Night Walker
  • 20,638
  • 52
  • 151
  • 228
21
votes
3 answers

How to select a list of rows by name in Pandas dataframe

I am trying to extract rows from a Pandas dataframe using a list of row names, but it can't be done. Here is an example # df alleles chrom pos strand assembly# center protLSID assayLSID rs# TP3 A/C 0 3 + NaN …
upendra
  • 2,141
  • 9
  • 39
  • 64
21
votes
2 answers

Numbers as column names of data frames

Is there a reason why R won't allow me to have a number as the column name of my dataframe? Also noticed that if i do data.frame(XX) it adds an X to all the column headers that have numbers at the front.
Nathaniel Saxe
  • 1,527
  • 2
  • 15
  • 25
21
votes
5 answers

Applying dplyr's rename to all columns while using pipe operator

I'm working with an imported data set that corresponds to the extract below: set.seed(1) dta <- data.frame("This is Column One" = runif(n = 10), "Another amazing Column name" = runif(n = 10), "!## This…
Konrad
  • 17,740
  • 16
  • 106
  • 167
21
votes
2 answers

Reorder levels of MultiIndex in a pandas DataFrame

I have a DataFrame that looks something like this: >>> df = pd.DataFrame(index=pd.MultiIndex.from_tuples([(num,letter,color) for num in range(1,3) for letter in ['a','b','c'] for color in ['Red','Green']], …
AJG519
  • 3,249
  • 10
  • 36
  • 62
21
votes
5 answers

What is going wrong with `unionAll` of Spark `DataFrame`?

Using Spark 1.5.0 and given the following code, I expect unionAll to union DataFrames based on their column name. In the code, I'm using some FunSuite for passing in SparkContext sc: object Entities { case class A (a: Int, b: Int) case class B…
Martin Senne
  • 5,939
  • 6
  • 30
  • 47
21
votes
2 answers

Convert A Column In Pandas to One Long String (Python 3)

How can I convert a pandas column into one long string? For example, convert the following DF: Keyword James Went To The Market To read as Keyword James went to the market Any help?
user3682157
  • 1,625
  • 8
  • 29
  • 55
1 2 3
99
100