Questions tagged [dataframe]

A data frame is a 2D tabular data structure. Usually, it contains data where rows are observations and columns are variables and are allowed to be of different types (as distinct from an array or matrix). While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

data.frame in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base data.frames have been extended or modified to create new data structures by several R packages, including and . For further reading, see the paragraph on Data frames in the CRAN manual Intro to R


DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values. The DataFrame object in pandas.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)


DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

143674 questions
23
votes
1 answer

Concatenate (join) a NumPy array with a pandas DataFrame

I have a pandas dataframe with 10 rows and 5 columns and a numpy matrix of zeros np.zeros((10,3)). I want to concat the numpy matrix to the pandas dataframe but I want to delete the last column from the pandas dataframe before concatenating the…
Jamgreen
  • 10,329
  • 29
  • 113
  • 224
23
votes
3 answers

How to get Text from b'Text' in the pandas object type after using read_sas?

I'm trying to read the data from .sas7bdat format of SAS using pandas function read_sas: import pandas as pd df = pd.read_sas('D:/input/houses.sas7bdat', format = 'sas7bdat') df.head() And I have two data types in the df dataframe - float64 and…
doktr
  • 406
  • 1
  • 3
  • 9
23
votes
3 answers

Pandas: replace substring in string

I want to replace substring icashier.alipay.com in column in df url icashier.alipay.com/catalog/2758186/detail.aspx icashier.alipay.com/catalog/2758186/detail.aspx icashier.alipay.com/catalog/2758186/detail.aspx vk.com to aliexpress.com. Desire…
NineWasps
  • 2,081
  • 8
  • 28
  • 45
23
votes
1 answer

How to correctly write out a TSV file from a series in Pandas?

I have read the manual here and saw this answer, but it is not working: >>> import pandas as pd >>> import csv >>> pd.Series([my_list]).to_csv('output.tsv',sep='\t',index=False,header=False, quoting=csv.QUOTE_NONE) Traceback (most recent call…
user5359531
  • 3,217
  • 6
  • 30
  • 55
23
votes
3 answers

Combine PySpark DataFrame ArrayType fields into single ArrayType field

I have a PySpark DataFrame with 2 ArrayType fields: >>>df DataFrame[id: string, tokens: array, bigrams: array] >>>df.take(1) [Row(id='ID1', tokens=['one', 'two', 'two'], bigrams=['one two', 'two two'])] I would like to combine them…
zemekeneng
  • 1,660
  • 2
  • 15
  • 26
23
votes
3 answers

PySpark converting a column of type 'map' to multiple columns in a dataframe

Input I have a column Parameters of type map of the form: from pyspark.sql import SQLContext sqlContext = SQLContext(sc) d = [{'Parameters': {'foo': '1', 'bar': '2', 'baz': 'aaa'}}] df = sqlContext.createDataFrame(d) df.collect() #…
Kamil Sindi
  • 21,782
  • 19
  • 96
  • 120
23
votes
2 answers

DataFrame object has no attribute 'sort_values'

dataset = pd.read_csv("dataset.csv").fillna(" ")[:100] dataset['Id']=0 dataset['i']=0 dataset['j']=0 #... entries=dataset[dataset['Id']==0] print type(entries) # Prints
Klausos Klausos
  • 15,308
  • 51
  • 135
  • 217
23
votes
4 answers

Spark dataframe transform multiple rows to column

I am a novice to spark, and I want to transform below source dataframe (load from JSON file): +--+-----+-----+ |A |count|major| +--+-----+-----+ | a| 1| m1| | a| 1| m2| | a| 2| m3| | a| 3| m4| | b| 4| m1| | b| 1| m2| |…
resec
  • 2,091
  • 3
  • 13
  • 22
23
votes
2 answers

python pandas dataframe head() displays nothing

I am new to using pandas and I just don't know what to do with this : I am using python. I have (properly) installed anaconda. In my file I simply create a DataFrame (first by importing it from read_csv, then recreating it by hand to make sure that…
Lauref
  • 331
  • 1
  • 2
  • 4
23
votes
4 answers

Add column to the end of Pandas DataFrame containing average of previous data

I have a DataFrame ave_data that contains the following: ave_data Time F7 F8 F9 00:00:00 43.005593 -56.509746 25.271271 01:00:00 55.114918 -59.173852 31.849262 02:00:00 63.990762 -64.699492 …
LinnK
  • 385
  • 2
  • 4
  • 17
23
votes
3 answers

How to delete a column in pandas based on a condition?

I have a pandas DataFrame, with many NAN values in it. How can I drop columns such that number_of_na_values > 2000? I tried to do it like that: toRemove = set() naNumbersPerColumn = df.isnull().sum() for i in naNumbersPerColumn.index: …
Fedorenko Kristina
  • 2,607
  • 2
  • 19
  • 18
23
votes
4 answers

Python pandas: exclude rows below a certain frequency count

So I have a pandas DataFrame that looks like this: r vals positions 1.2 1 1.8 2 2.3 1 1.8 1 2.1 3 2.0 3 1.9 1 ... ... I would like the filter out all rows by position that do not appear at least 20…
Wes Field
  • 3,291
  • 6
  • 23
  • 26
23
votes
4 answers

Remove rows where column value type is string Pandas

I have a pandas dataframe. One of my columns should only be floats. When I try to convert that column to floats, I'm alerted that there are strings in there. I'd like to delete all rows where values in this column are strings...
porteclefs
  • 477
  • 2
  • 6
  • 12
23
votes
3 answers

R: splitting dataset into quartiles/deciles. What is the right method?

I am very new with R, so hoping I can get some pointers on how to achieve the desired manipulation of my data. I have an array of data with three variables. gene_id fpkm meth_val 1 100629094 0.000 0.0063 2 100628995 0.000 0.0000 3…
user1995839
  • 737
  • 2
  • 8
  • 19
23
votes
4 answers

Collapsing data frame by selecting one row per group

I'm trying to collapse a data frame by removing all but one row from each group of rows with identical values in a particular column. In other words, the first row from each group. For example, I'd like to convert this > d =…
jkebinger
  • 3,944
  • 4
  • 19
  • 14