Questions tagged [dataframe]

A data frame is a 2D tabular data structure. Usually, it contains data where rows are observations and columns are variables and are allowed to be of different types (as distinct from an array or matrix). While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

`data.frame` in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base r data.frames have been extended or modified to create new data structures by several R packages, including data.table and tibble. For further reading, see the paragraph on Data frames in the CRAN manual Intro to R

DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values. The DataFrame object in pandas.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)

DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

143674 questions

votes

2 answers

Creating a row number of each row in PySpark DataFrame using row_number() function with Spark version 2.2

I am having a PySpark DataFrame - valuesCol = [('Sweden',31),('Norway',62),('Iceland',13),('Finland',24),('Denmark',52)] df = sqlContext.createDataFrame(valuesCol,['name','id']) +-------+---+ | name| id| +-------+---+ | Sweden| 31| | Norway|…

pandas apache-spark dataframe pyspark row-number

asked Oct 29 '18 at 09:30

cph_sto

7,189
12
42
78

votes

2 answers

pandas - AttributeError 'dataframe' object has no attribute

I am trying to filter out the dataframe that contains a list of product. However, I am getting the pandas - 'dataframe' object has no attribute 'str' error whenever I run the code. Here is the line of code: include_clique =…

python pandas dataframe indexing attributeerror

asked Jul 24 '18 at 15:18

David Luong

votes

3 answers

Skip specific set of columns when reading excel frame - pandas

I know beforehand what columns I don't need from an excel file and I'd like to avoid them when reading the file to improve the performance. Something like this: import pandas as pd df = pd.read_excel('large_excel_file.xlsx', skip_cols=['col_a',…

python python-3.x excel pandas dataframe

asked Apr 05 '18 at 16:32

Juan David

2,676
4
32
42

votes

2 answers

'<' not supported between instances of 'datetime.date' and 'str'

I get a TypeError: TypeError: '<' not supported between instances of 'datetime.date' and 'str'` While running the following piece of code: import requests import re import json import pandas as pd def retrieve_quotes_historical(stock_code): …

pandas dataframe indexing

asked Mar 29 '18 at 11:06

Xiaowu Zhao

votes

4 answers

How to keep original index of a DataFrame after groupby 2 columns?

Is there any way I can retain the original index of my large dataframe after I perform a groupby? The reason I need to this is because I need to do an inner merge back to my original df (after my groupby) to regain those lost columns. And the index…

python pandas dataframe indexing pandas-groupby

asked Mar 11 '18 at 03:31

Hana

1,330
4
23
38

votes

3 answers

How to extract the n-th maximum/minimum value in a column of a DataFrame in pandas?

I would like to obtain the n-th minimum or the n-th maximum value from numerical columns in the DataFrame in pandas. Example: df = pd.DataFrame({'a': [3.0, 2.0, 4.0, 1.0],'b': [1.0, 4.0 , 2.0, 3.0]}) a b 0 3.0 1.0 1 2.0 4.0 2 4.0 …

python pandas dataframe max min

asked Dec 29 '17 at 17:48

Krzysztof Słowiński

6,239
8
44
62

votes

6 answers

Generate word cloud from single-column Pandas dataframe

I have a Pandas dataframe with one column: Crime type. The column contains 16 different "categories" of crime, which I would like to visualise as a word cloud, with words sized based on their frequency within the dataframe. I have attempted to do…

python pandas dataframe word-cloud

asked Apr 25 '17 at 09:12

the_bonze

votes

5 answers

How to remove a row from pandas dataframe based on the number of elements in a column

In the following pandas.DataFframe: df = alfa beta ceta a,b,c c,d,e g,e,h a,b d,e,f g,h,k j,k c,k,l f,k,n How to drop the rows in which the column values for alfa has more than 2 elements? This can be done using…

python pandas dataframe string-length

asked Mar 20 '17 at 02:47

everestial007

6,665
7
32
72

votes

14 answers

pandas.read_csv FileNotFoundError: File b'\xe2\x80\xaa' despite correct path

I'm trying to load a .csv file using the pd.read_csv() function when I get an error despite the file path being correct and using raw strings. import pandas as pd df = pd.read_csv('‪C:\\Users\\user\\Desktop\\datafile.csv') df =…

python csv pandas dataframe file-not-found

asked Feb 10 '17 at 17:48

Impuls3H

votes

3 answers

Python pandas linear regression groupby

I am trying to use a linear regression on a group by pandas python dataframe: This is the dataframe df: group date value A 01-02-2016 16 A 01-03-2016 15 A 01-04-2016 14 A 01-05-2016 17…

python pandas dataframe group-by linear-regression

asked Jan 06 '17 at 18:24

jeangelj

4,338
16
54
98

votes

4 answers

data frame to file.txt python

I have this dataframe X Y Z Value 0 18 55 1 70 1 18 55 2 67 2 18 57 2 75 3 18 58 1 35 4 19 54 2 70 I want to save it as a text file with this format X…

python pandas numpy text dataframe

asked Jan 02 '17 at 14:19

Amal Kostali Targhi

votes

1 answer

join two or more data frames in system R

My questions is how can join two or more data frames in system R? For example: I have two data frames: first: x y z 1 3 2 4 2 4 5 7 3 5 6 8 second: x y z 1 1 1 1 2 4 5 7 I need this: x y z 1 3 2 4 2 4 5 7 3 5 …

r join dataframe rbind

asked Nov 10 '10 at 05:47

olga

votes

3 answers

Pandas : TypeError: float() argument must be a string or a number

I have a dataframe that contains user_id date browser conversion test sex age country 1 2015-12-03 IE 1 0 M 32.0 US Here is my code: from sklearn import tree data['date'] =…

python pandas dataframe datetime data-science

asked Dec 21 '16 at 06:41

Gingerbread

1,938
8
22
36

votes

6 answers

Python df.to_excel() storing numbers as text in excel. How to store as Value?

I am scraping table data from google finance through pd.read_html and then saving that data to excel through df.to_excel() as seen below: dfs = pd.read_html('https://www.google.com/finance?q=NASDAQ%3AGOOGL&fstype=ii&ei=9YBMWIiaLo29e83Rr9AM',…

python html excel pandas dataframe

asked Dec 10 '16 at 22:30

gluc7

votes

1 answer

unexpected type: when casting to Int on a ApacheSpark Dataframe

I'm having an error when trying to cast a StringType to a IntType on a pyspark dataframe: joint = aggregates.join(df_data_3,aggregates.year==df_data_3.year) joint2 = joint.filter(joint.CountyCode==999).filter(joint.CropName=='WOOL')\ …

python apache-spark dataframe pyspark apache-spark-sql

asked Nov 20 '16 at 05:49

Romeo Kienzler

3,373
3
36
58

Prev 1 2 3

…

100 Next

Questions tagged [dataframe]

data.frame in R

DataFrame in Python's pandas library

DataFrame in Apache Spark

DataFrame in Maple

`data.frame` in R