Questions tagged [dataframe]

A data frame is a 2D tabular data structure. Usually, it contains data where rows are observations and columns are variables and are allowed to be of different types (as distinct from an array or matrix). While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

data.frame in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base data.frames have been extended or modified to create new data structures by several R packages, including and . For further reading, see the paragraph on Data frames in the CRAN manual Intro to R


DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values. The DataFrame object in pandas.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)


DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

143674 questions
784
votes
32 answers

How do I count the NaN values in a column in pandas DataFrame?

I want to find the number of NaN in each column of my data.
user3799307
  • 7,849
  • 3
  • 12
  • 3
778
votes
11 answers

Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

I have a dataframe df and I use several columns from it to groupby: df['col1','col2','col3','col4'].groupby(['col1','col2']).mean() In the above way, I almost get the table (dataframe) that I need. What is missing is an additional column that…
Roman
  • 124,451
  • 167
  • 349
  • 456
772
votes
24 answers

Set value for particular cell in pandas DataFrame using index

I have created a Pandas DataFrame df = DataFrame(index=['A','B','C'], columns=['x','y']) and have got this x y A NaN NaN B NaN NaN C NaN NaN Now, I would like to assign a value to particular cell, for example to row C and column x. I…
Mitkp
  • 7,800
  • 3
  • 14
  • 8
766
votes
23 answers

Constructing pandas DataFrame from values in variables gives "ValueError: If using all scalar values, you must pass an index"

This may be a simple question, but I can not figure out how to do this. Lets say that I have two variables as follows. a = 2 b = 3 I want to construct a DataFrame from this: df2 = pd.DataFrame({'A':a,'B':b}) This generates an error: ValueError:…
Nilani Algiriyage
  • 32,876
  • 32
  • 87
  • 121
751
votes
20 answers

Import multiple CSV files into pandas and concatenate into one DataFrame

I would like to read several CSV files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I have so far: import glob import pandas as pd # Get data file names path =…
jonas
  • 13,559
  • 22
  • 57
  • 75
720
votes
15 answers

How to apply a function to two columns of Pandas dataframe

Suppose I have a df which has columns of 'ID', 'col_1', 'col_2'. And I define a function : f = lambda x, y : my_function_expression. Now I want to apply the f to df's two columns 'col_1', 'col_2' to element-wise calculate a new column 'col_3' ,…
bigbug
  • 55,954
  • 42
  • 77
  • 96
697
votes
11 answers

Difference between map, applymap and apply methods in Pandas

Can you tell me when to use these vectorization methods with basic examples? I see that map is a Series method whereas the rest are DataFrame methods. I got confused about apply and applymap methods though. Why do we have two methods for applying a…
marillion
  • 10,618
  • 19
  • 48
  • 63
697
votes
19 answers

How can I get a value from a cell of a dataframe?

I have constructed a condition that extracts exactly one row from my dataframe: d2 = df[(df['l_ext']==l_ext) & (df['item']==item) & (df['wn']==wn) & (df['wd']==1)] Now I would like to take a value from a particular column: val = d2['col_name'] But…
Roman
  • 124,451
  • 167
  • 349
  • 456
693
votes
28 answers

How to check if any value is NaN in a Pandas DataFrame

In Python Pandas, what's the best way to check whether a DataFrame has one (or more) NaN values? I know about the function pd.isnan, but this returns a DataFrame of booleans for each element. This post right here doesn't exactly answer my question…
hlin117
  • 20,764
  • 31
  • 72
  • 93
691
votes
16 answers

Convert pandas dataframe to NumPy array

How do I convert a pandas dataframe into a NumPy array? DataFrame: import numpy as np import pandas as pd index = [1, 2, 3, 4, 5, 6, 7] a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1] b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan] c = [np.nan,…
Mister Nobody
  • 6,927
  • 3
  • 13
  • 3
683
votes
25 answers

UnicodeDecodeError when reading CSV file in Pandas

I'm running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error... File "C:\Importer\src\dfman\importer.py", line 26, in import_chr data = pd.read_csv(filepath, names=fields) File…
TravisVOX
  • 20,342
  • 13
  • 37
  • 41
673
votes
12 answers

Converting a Pandas GroupBy output from Series to DataFrame

I'm starting with input data like this df1 = pandas.DataFrame( { "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } ) Which when printed…
saveenr
  • 8,439
  • 3
  • 19
  • 20
665
votes
11 answers

The difference between bracket [ ] and double bracket [[ ]] for accessing the elements of a list or dataframe

R provides two different methods for accessing the elements of a list or data.frame: [] and [[]]. What is the difference between the two, and when should I use one over the other?
Sharpie
  • 17,323
  • 4
  • 44
  • 47
637
votes
5 answers

How to check whether a pandas DataFrame is empty?

How to check whether a pandas DataFrame is empty? In my case I want to print some message in terminal if the DataFrame is empty.
Nilani Algiriyage
  • 32,876
  • 32
  • 87
  • 121
634
votes
26 answers

Convert a list to a data frame

I have a nested list of data. Its length is 132 and each item is a list of length 20. Is there a quick way to convert this structure into a data frame that has 132 rows and 20 columns of data? Here is some sample data to work with: l <- replicate( …
Btibert3
  • 38,798
  • 44
  • 129
  • 168