Questions tagged [dataframe]

A data frame is a 2D tabular data structure. Usually, it contains data where rows are observations and columns are variables and are allowed to be of different types (as distinct from an array or matrix). While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

`data.frame` in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base r data.frames have been extended or modified to create new data structures by several R packages, including data.table and tibble. For further reading, see the paragraph on Data frames in the CRAN manual Intro to R

DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values. The DataFrame object in pandas.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)

DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

143674 questions

votes

1 answer

Concatenate (join) a NumPy array with a pandas DataFrame

I have a pandas dataframe with 10 rows and 5 columns and a numpy matrix of zeros np.zeros((10,3)). I want to concat the numpy matrix to the pandas dataframe but I want to delete the last column from the pandas dataframe before concatenating the…

python pandas numpy dataframe

asked Sep 26 '16 at 08:48

Jamgreen

10,329
29
113
224

votes

3 answers

How to get Text from b'Text' in the pandas object type after using read_sas?

I'm trying to read the data from .sas7bdat format of SAS using pandas function read_sas: import pandas as pd df = pd.read_sas('D:/input/houses.sas7bdat', format = 'sas7bdat') df.head() And I have two data types in the df dataframe - float64 and…

python object pandas dataframe

asked Aug 13 '16 at 07:59

doktr

votes

3 answers

Pandas: replace substring in string

I want to replace substring icashier.alipay.com in column in df url icashier.alipay.com/catalog/2758186/detail.aspx icashier.alipay.com/catalog/2758186/detail.aspx icashier.alipay.com/catalog/2758186/detail.aspx vk.com to aliexpress.com. Desire…

python pandas replace dataframe substring

asked Jul 25 '16 at 10:54

NineWasps

2,081
8
28
45

votes

1 answer

How to correctly write out a TSV file from a series in Pandas?

I have read the manual here and saw this answer, but it is not working: >>> import pandas as pd >>> import csv >>> pd.Series([my_list]).to_csv('output.tsv',sep='\t',index=False,header=False, quoting=csv.QUOTE_NONE) Traceback (most recent call…

python pandas dataframe

asked Jul 15 '16 at 23:36

user5359531

3,217
6
30
55

votes

3 answers

Combine PySpark DataFrame ArrayType fields into single ArrayType field

I have a PySpark DataFrame with 2 ArrayType fields: >>>df DataFrame[id: string, tokens: array, bigrams: array] >>>df.take(1) [Row(id='ID1', tokens=['one', 'two', 'two'], bigrams=['one two', 'two two'])] I would like to combine them…

python apache-spark dataframe pyspark apache-spark-sql

asked May 17 '16 at 18:48

zemekeneng

1,660
2
15
26

votes

3 answers

PySpark converting a column of type 'map' to multiple columns in a dataframe

Input I have a column Parameters of type map of the form: from pyspark.sql import SQLContext sqlContext = SQLContext(sc) d = [{'Parameters': {'foo': '1', 'bar': '2', 'baz': 'aaa'}}] df = sqlContext.createDataFrame(d) df.collect() #…

python apache-spark dataframe pyspark apache-spark-sql

asked Apr 26 '16 at 15:18

Kamil Sindi

21,782
19
96
120

votes

2 answers

DataFrame object has no attribute 'sort_values'

dataset = pd.read_csv("dataset.csv").fillna(" ")[:100] dataset['Id']=0 dataset['i']=0 dataset['j']=0 #... entries=dataset[dataset['Id']==0] print type(entries) # Prints

python pandas dataframe

asked Dec 28 '15 at 19:44

Klausos Klausos

15,308
51
135
217

votes

4 answers

Spark dataframe transform multiple rows to column

I am a novice to spark, and I want to transform below source dataframe (load from JSON file): +--+-----+-----+ |A |count|major| +--+-----+-----+ | a| 1| m1| | a| 1| m2| | a| 2| m3| | a| 3| m4| | b| 4| m1| | b| 1| m2| |…

python apache-spark dataframe apache-spark-sql rdd

asked Nov 16 '15 at 09:45

resec

2,091
3
13
22

votes

2 answers

python pandas dataframe head() displays nothing

I am new to using pandas and I just don't know what to do with this : I am using python. I have (properly) installed anaconda. In my file I simply create a DataFrame (first by importing it from read_csv, then recreating it by hand to make sure that…

python pandas dataframe anaconda

asked Nov 02 '15 at 17:46

Lauref

votes

4 answers

Add column to the end of Pandas DataFrame containing average of previous data

I have a DataFrame ave_data that contains the following: ave_data Time F7 F8 F9 00:00:00 43.005593 -56.509746 25.271271 01:00:00 55.114918 -59.173852 31.849262 02:00:00 63.990762 -64.699492 …

python pandas dataframe calculated-columns

asked Jul 29 '15 at 11:13

LinnK

votes

3 answers

How to delete a column in pandas based on a condition?

I have a pandas DataFrame, with many NAN values in it. How can I drop columns such that number_of_na_values > 2000? I tried to do it like that: toRemove = set() naNumbersPerColumn = df.isnull().sum() for i in naNumbersPerColumn.index: …

python pandas dataframe nan

asked Jul 24 '15 at 15:49

Fedorenko Kristina

2,607
2
19
18

votes

4 answers

Python pandas: exclude rows below a certain frequency count

So I have a pandas DataFrame that looks like this: r vals positions 1.2 1 1.8 2 2.3 1 1.8 1 2.1 3 2.0 3 1.9 1 ... ... I would like the filter out all rows by position that do not appear at least 20…

python pandas filter dataframe

asked May 27 '15 at 14:15

Wes Field

3,291
6
23
26

votes

4 answers

Remove rows where column value type is string Pandas

I have a pandas dataframe. One of my columns should only be floats. When I try to convert that column to floats, I'm alerted that there are strings in there. I'd like to delete all rows where values in this column are strings...

python pandas dataframe

asked Nov 06 '14 at 03:59

porteclefs

votes

3 answers

R: splitting dataset into quartiles/deciles. What is the right method?

I am very new with R, so hoping I can get some pointers on how to achieve the desired manipulation of my data. I have an array of data with three variables. gene_id fpkm meth_val 1 100629094 0.000 0.0063 2 100628995 0.000 0.0000 3…

r plot dataframe

asked Oct 09 '14 at 08:37

user1995839

votes

4 answers

Collapsing data frame by selecting one row per group

I'm trying to collapse a data frame by removing all but one row from each group of rows with identical values in a particular column. In other words, the first row from each group. For example, I'd like to convert this > d =…

r dataframe

asked Apr 13 '10 at 02:17

jkebinger

3,944
4
19
14

Prev 1 2 3

…

99 100 Next

Questions tagged [dataframe]

data.frame in R

DataFrame in Python's pandas library

DataFrame in Apache Spark

DataFrame in Maple

`data.frame` in R