Questions tagged [dataframe]

A data frame is a 2D tabular data structure. Usually, it contains data where rows are observations and columns are variables and are allowed to be of different types (as distinct from an array or matrix). While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

data.frame in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base data.frames have been extended or modified to create new data structures by several R packages, including and . For further reading, see the paragraph on Data frames in the CRAN manual Intro to R


DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values. The DataFrame object in pandas.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)


DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

143674 questions
22
votes
4 answers

finding the index of a max value in R

I have the following data frame called surge: MeshID StormID Rate Surge Wind 1 1412 1.0000E-01 0.01 0.0 2 1412 1.0000E-01 0.03 0.0 3 1412 1.0000E-01 0.09 0.0 4 1412 1.0000E-01 0.12 0.0 5 1412…
kimmyjo221
  • 685
  • 4
  • 10
  • 17
21
votes
3 answers

Add characters to a numeric column in dataframe

I have a dataframe like this: V1 V2 V3 1 1 3423086 3423685 2 1 3467184 3467723 3 1 4115236 4115672 4 1 5202437 5203057 5 2 7132558 7133089 6 2 7448688 7449283 I want to change the V1 column and add chr before the number.…
Lisann
  • 5,705
  • 14
  • 41
  • 50
21
votes
6 answers

Importing wikipedia tables in R

I regularly extract tables from Wikipedia. Excel's web import does not work properly for wikipedia, as it treats the whole page as a table. In google spreadsheet, I can enter…
karlos
  • 873
  • 2
  • 10
  • 21
21
votes
5 answers

pandas.Int64Index fix for FutureWarning

Just getting this new warning for my dataframes that are loaded from excel. I understand if I were to pd.DataFrame I could set the index, but I am not clear how to set the dataframe index type when I am loading from a…
AAmes
  • 333
  • 1
  • 2
  • 11
21
votes
2 answers

Pandas error: "IndexError: iloc cannot enlarge its target object"

I want to replace the value of a dataframe cell using pandas. I'm using this line: submission.iloc[i, coli] = train2.iloc[i2, coli-1] I get the following error line: IndexError: iloc cannot enlarge its target object What is the reason for this?
Moradnejad
  • 3,466
  • 2
  • 30
  • 52
21
votes
1 answer

Feather format for long term storage since the release of apache arrow 1.0.1

As I'm given to understand due to the search of issues in the Feather Github, as well as questions in stackoverflow such as What are the differences between feather and parquet?, the Feather format was not recommended as long term storage due to…
Serelia
  • 213
  • 2
  • 6
21
votes
2 answers

Pandas explode multiple columns

I have DF that has multiple columns. Two of the columns are list of the same len.( col2 and col3 are list. the len of the list is the same). My goal is to list each element on it's own row. I can use the df.explode(). but it only accepts one column.…
Imsa
  • 1,105
  • 2
  • 17
  • 39
21
votes
4 answers

Pandas resample with start date

I'd like to resample a pandas object using a specific date (or month) as the edge of the first bin. For instance, in the following snippet I'd like my first index value to be 2020-02-29 and I'd be happy specifying start=2 or start="2020-02-29". >>>…
jsignell
  • 3,072
  • 1
  • 22
  • 23
21
votes
5 answers

List of Series to Dataframe

I have a list having Pandas Series objects, which I've created by doing something like this: li = [] li.append(input_df.iloc[0]) li.append(input_df.iloc[4]) where input_df is a Pandas Dataframe I want to convert this list of Series objects back to…
Saurabh Verma
  • 6,328
  • 12
  • 52
  • 84
21
votes
5 answers

Pyspark: Serialized task exceeds max allowed. Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values

I'm doing calculations on a cluster and at the end when I ask summary statistics on my Spark dataframe with df.describe().show() I get an error: Serialized task 15:0 was 137500581 bytes, which exceeds max allowed: spark.rpc.message.maxSize…
Wendy De Wit
  • 293
  • 2
  • 3
  • 6
21
votes
6 answers

How to remove nulls with array_remove Spark SQL built-in function

Spark 2.4 introduced new useful Spark SQL functions involving arrays, but I was a little bit puzzled when I found out that the result of select array_remove(array(1, 2, 3, null, 3), null) is null and not [1, 2, 3, 3]. Is this the expected behavior?…
datapug
  • 2,261
  • 1
  • 17
  • 33
21
votes
2 answers

Joining two pandas dataframes based on multiple conditions

df_a and df_b are two dataframes that looks like following df_a A B C D E x1 Apple 0.3 0.9 0.6 x1 Orange 0.1 0.5 0.2 x2 Apple 0.2 0.2 0.1 x2 Orange 0.3 0.4 0.9 x2 Mango 0.1 0.2 0.3 x3 Orange …
iprof0214
  • 701
  • 2
  • 6
  • 19
21
votes
1 answer

Pandas, loc vs non loc for boolean indexing

All the research I do point to using loc as the way to filter a dataframe by a col(s) value(s), today I was reading this and I discovered by the examples I tested, that loc isn't really needed when filtering cols by it's values: EX: df =…
Miguel
  • 1,579
  • 5
  • 18
  • 31
21
votes
3 answers

Convert a Pandas DataFrame into a list of objects

I want to convert a Pandas DataFrame into a list of objects. This is my class: class Reading: def __init__(self): self.HourOfDay: int = 0 self.Percentage: float = 0 I read up on .to_dict, so I tried…
zola25
  • 1,774
  • 6
  • 24
  • 44
21
votes
3 answers

How can I populate a pandas DataFrame with the result of a Snowflake sql query?

Using the Python Connector I can query Snowflake: import snowflake.connector # Gets the version ctx = snowflake.connector.connect( user=USER, password=PASSWORD, account=ACCOUNT, authenticator='https://XXXX.okta.com', …
RubenLaguna
  • 21,435
  • 13
  • 113
  • 151