Highest Voted 'pyspark' Questions

49

votes

2 answers

How to calculate the counts of each distinct value in a pyspark dataframe?

I have a column filled with a bunch of states' initials as strings. My goal is to how the count of each state in such list. For example: (("TX":3),("NJ":2)) should be the output when there are two occurrences of "TX" and "NJ". I'm fairly new to…

python dataframe pyspark

asked Feb 25 '17 at 02:11

madsthaks

2,091
6
25
46

49

votes

4 answers

Spark Equivalent of IF Then ELSE

I have seen this question earlier here and I have took lessons from that. However I am not sure why I am getting an error when I feel it should work. I want to create a new column in existing Spark DataFrame by some rules. Here is what I wrote.…

python apache-spark pyspark apache-spark-sql

asked Aug 19 '16 at 21:59

Baktaawar

7,086
24
81
149

49

votes

2 answers

Where do you need to use lit() in Pyspark SQL?

I'm trying to make sense of where you need to use a lit value, which is defined as a literal column in the documentation. Take for example this udf, which returns the index of a SQL column array: def find_index(column, index): return…

python apache-spark pyspark apache-spark-sql

asked Jun 09 '16 at 00:24

flybonzai

3,763
11
38
72

49

votes

3 answers

How take a random row from a PySpark DataFrame?

How can I get a random row from a PySpark DataFrame? I only see the method sample() which takes a fraction as parameter. Setting this fraction to 1/numberOfRows leads to random results, where sometimes I won't get any row. On RDD there is a method…

python apache-spark dataframe pyspark apache-spark-sql

asked Nov 30 '15 at 16:29

DanT

3,960
5
28
33

49

votes

3 answers

Viewing the content of a Spark Dataframe Column

I'm using Spark 1.3.1. I am trying to view the values of a Spark dataframe column in Python. With a Spark dataframe, I can do df.collect() to view the contents of the dataframe, but there is no such method for a Spark dataframe column as best as I…

python apache-spark dataframe pyspark

asked Jun 29 '15 at 19:37

John Lin

493
1
4
5

48

votes

2 answers

Pyspark convert a standard list to data frame

The case is really simple, I need to convert a python list into data frame with following code from pyspark.sql.types import StructType from pyspark.sql.types import StructField from pyspark.sql.types import StringType, IntegerType schema =…

python apache-spark pyspark apache-spark-sql

asked Jan 25 '18 at 17:13

seiya

1,477
3
17
26

48

votes

2 answers

Filter Spark DataFrame based on another DataFrame that specifies denylist criteria

I have a largeDataFrame (multiple columns and billions of rows) and a smallDataFrame (single column and 10,000 rows). I'd like to filter all the rows from the largeDataFrame whenever the some_identifier column in the largeDataFrame matches one of…

dataframe apache-spark pyspark apache-spark-sql

asked Oct 06 '16 at 04:27

Powers

18,150
10
103
108

48

votes

5 answers

How to build a sparkSession in Spark 2.0 using pyspark?

I just got access to spark 2.0; I have been using spark 1.6.1 up until this point. Can someone please help me set up a sparkSession using pyspark (python)? I know that the scala examples available online are similar (here), but I was hoping for a…

python sql apache-spark pyspark

asked Sep 29 '16 at 22:09

haileyeve

481
1
4
4

48

votes

4 answers

PySpark row-wise function composition

As a simplified example, I have a dataframe "df" with columns "col1,col2" and I want to compute a row-wise maximum after applying a function to each column : def f(x): return (x+1) max_udf=udf(lambda x,y: max(x,y), IntegerType()) f_udf=udf(f,…

python apache-spark pyspark apache-spark-sql

asked Apr 12 '16 at 21:58

Alex R.

1,397
3
18
33

47

votes

2 answers

Pivot String column on Pyspark Dataframe

I have a simple dataframe like this: rdd = sc.parallelize( [ (0, "A", 223,"201603", "PORT"), (0, "A", 22,"201602", "PORT"), (0, "A", 422,"201601", "DOCK"), (1,"B", 3213,"201602", "DOCK"), (1,"B",…

python apache-spark dataframe pyspark apache-spark-sql

asked May 27 '16 at 15:11

Ivan

19,560
31
97
141

47

votes

7 answers

I can't seem to get --py-files on Spark to work

I'm having a problem with using Python on Spark. My application has some dependencies, such as numpy, pandas, astropy, etc. I cannot use virtualenv to create an environment with all dependencies, since the nodes on the cluster do not have any common…

python apache-spark pyspark

asked Apr 06 '16 at 19:51

Andrej Palicka

971
1
11
26

47

votes

2 answers

overwriting a spark output using pyspark

I am trying to overwrite a Spark dataframe using the following option in PySpark but I am not successful spark_df.write.format('com.databricks.spark.csv').option("header", "true",mode='overwrite').save(self.output_file_path) the mode=overwrite…

python apache-spark pyspark

asked Mar 08 '16 at 07:06

Devesh

719
1
7
13

47

votes

10 answers

Trim string column in PySpark dataframe

After creating a Spark DataFrame from a CSV file, I would like to trim a column. I've tried: df = df.withColumn("Product", df.Product.strip()) df is my data frame, Product is a column in my table. But I get the error: Column object is not…

string apache-spark pyspark apache-spark-sql trim

asked Feb 02 '16 at 14:15

minh-hieu.pham

1,029
2
12
21

47

votes

8 answers

Add column sum as new column in PySpark dataframe

I'm using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns. Suppose my dataframe had columns "a", "b", and "c". I know I can do this: df.withColumn('total_col',…

python apache-spark pyspark apache-spark-sql

asked Aug 12 '15 at 02:59

plam

1,305
3
15
24

46

votes

1 answer

Pyspark filter dataframe by columns of another dataframe

Not sure why I'm having a difficult time with this, it seems so simple considering it's fairly easy to do in R or pandas. I wanted to avoid using pandas though since I'm dealing with a lot of data, and I believe toPandas() loads all the data into…

python-2.7 apache-spark dataframe pyspark apache-spark-sql

asked Feb 09 '17 at 23:04

drewyupdrew

1,549
1
11
16

Questions tagged [pyspark]

Useful Links:

Related Tags:

How to calculate the counts of each distinct value in a pyspark dataframe?

Spark Equivalent of IF Then ELSE

Where do you need to use lit() in Pyspark SQL?

How take a random row from a PySpark DataFrame?

Viewing the content of a Spark Dataframe Column

Pyspark convert a standard list to data frame

Filter Spark DataFrame based on another DataFrame that specifies denylist criteria

How to build a sparkSession in Spark 2.0 using pyspark?

PySpark row-wise function composition

Pivot String column on Pyspark Dataframe

I can't seem to get --py-files on Spark to work

overwriting a spark output using pyspark

Trim string column in PySpark dataframe

Add column sum as new column in PySpark dataframe

Pyspark filter dataframe by columns of another dataframe