Questions tagged [dataframe]

A data frame is a 2D tabular data structure. Usually, it contains data where rows are observations and columns are variables and are allowed to be of different types (as distinct from an array or matrix). While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

data.frame in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base data.frames have been extended or modified to create new data structures by several R packages, including and . For further reading, see the paragraph on Data frames in the CRAN manual Intro to R


DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values. The DataFrame object in pandas.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)


DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

143674 questions
22
votes
2 answers

How to spread a column in a Pandas data frame

I have the following pandas data frame: import pandas as pd import numpy as np df = pd.DataFrame({ 'fc': [100,100,112,1.3,14,125], 'sample_id': ['S1','S1','S1','S2','S2','S2'], 'gene_symbol': ['a', 'b',…
neversaint
  • 60,904
  • 137
  • 310
  • 477
22
votes
3 answers

Select columns from dataframe on condition they exist

I have a pandas DataFrame with multiple columns (columns names are numbers; 1, 2, ...) and I want to copy some of them if they do exist. For example df1 = df[[1,2,3,4]] But it might happen that some columns do not exist in df, eg df might only have…
astudentofmaths
  • 1,122
  • 2
  • 19
  • 33
22
votes
3 answers

Python: Convert dataframe into a list with string items inside list

I currently have code that reads in an Excel table (image below): # Read in zipcode input file us_zips = pd.read_excel("Zipcode.xls") us_zips I use the following code to convert the dataframe zip codes into a list: us_zips =…
PineNuts0
  • 4,740
  • 21
  • 67
  • 112
22
votes
2 answers

Difference between === null and isNull in Spark DataDrame

I am bit confused with the difference when we are using df.filter(col("c1") === null) and df.filter(col("c1").isNull) Same dataframe I am getting counts in === null but zero counts in isNull. Please help me to understand the difference. Thanks…
John
  • 1,531
  • 6
  • 18
  • 30
22
votes
5 answers

R: replacing NAs in a data.frame with values in the same position in another dataframe

I have a dataframe with some NA values: dfa <- data.frame(a=c(1,NA,3,4,5,NA),b=c(1,5,NA,NA,8,9),c=c(7,NA,NA,NA,2,NA)) dfa I would like to replace the NAs with values in the same position in another dataframe: dfrepair <-…
adam.888
  • 7,686
  • 17
  • 70
  • 105
22
votes
2 answers

How to get keys and values from MapType column in SparkSQL DataFrame

I have data in a parquet file which has 2 fields: object_id: String and alpha: Map<>. It is read into a data frame in sparkSQL and the schema looks like this: scala> alphaDF.printSchema() root |-- object_id: string (nullable = true) |-- ALPHA: map…
22
votes
2 answers

Pandas: Group by two columns to get sum of another column

I look most of the previously asked questions but was not able to find answer for my question: I have following dataframe id year month score num_attempts 0 483625 2010 01 50 1 1 967799 2009 03 50 1 2 …
add-semi-colons
  • 18,094
  • 55
  • 145
  • 232
22
votes
2 answers

pandas drop row based on index vs ix

I'm trying to drop pandas dataframe row based on its index (not location). The data frame looks like DO 129518 a developer and 20066 responsible for 571 responsible for 85629 responsible for 5956 by…
aerin
  • 20,607
  • 28
  • 102
  • 140
22
votes
1 answer

Convert comma separated string to array in pyspark dataframe

I have a dataframe as below where ev is of type string. >>> df2.show() +---+--------------+ | id| ev| +---+--------------+ | 1| 200, 201, 202| | 1|23, 24, 34, 45| | 1| null| | 2| 32| | 2| …
Swadeep
  • 310
  • 1
  • 4
  • 10
22
votes
1 answer

Convert class 'pandas.indexes.numeric.Int64Index' to numpy

I am isolating some row ids from a Pandas dataframe, like this: data = df.loc[df.cell == id] rows = df.index print(type(rows)) < class 'pandas.indexes.numeric.Int64Index'> I want to convert rows to a numpy array so I can save it to a mat file…
Chris Parry
  • 2,937
  • 7
  • 30
  • 71
22
votes
4 answers

Adding a column of rowsums across a list of columns in Spark Dataframe

I have a Spark dataframe with several columns. I want to add a column on to the dataframe that is a sum of a certain number of the columns. For example, my data looks like this: ID var1 var2 var3 var4 var5 a 5 7 9 12 13 b 6 4 …
Sarah
  • 223
  • 1
  • 2
  • 6
22
votes
3 answers

Iterate through a dataframe by index

I have a dataframe called staticData which looks like this: narrow_sector broad_sector country exchange \ unique_id BBG.MTAA.STM.S …
Stacey
  • 4,825
  • 17
  • 58
  • 99
22
votes
4 answers

How to use the split function on every row in a dataframe in Python?

I want to count the number of times a word is being repeated in the review string I am reading the csv file and storing it in a python dataframe using the below line reviews = pd.read_csv("amazon_baby.csv") The code in the below lines work when I…
goutam
  • 657
  • 2
  • 13
  • 35
22
votes
7 answers

How to "negative select" columns in spark's dataframe

I can't figure it out, but guess it's simple. I have a spark dataframe df. This df has columns "A","B" and "C". Now let's say I have an Array containing the name of the columns of this df: column_names = Array("A","B","C") I'd like to do a…
Blaubaer
  • 654
  • 1
  • 5
  • 15
22
votes
4 answers

Apache Spark, add an "CASE WHEN ... ELSE ..." calculated column to an existing DataFrame

I'm trying to add an "CASE WHEN ... ELSE ..." calculated column to an existing DataFrame, using Scala APIs. Starting dataframe: color Red Green Blue Desired dataframe (SQL syntax: CASE WHEN color == Green THEN 1 ELSE 0 END AS bool): color bool Red …
Leonardo Biagioli
  • 347
  • 1
  • 3
  • 8