Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
49
votes
2 answers

How to calculate the counts of each distinct value in a pyspark dataframe?

I have a column filled with a bunch of states' initials as strings. My goal is to how the count of each state in such list. For example: (("TX":3),("NJ":2)) should be the output when there are two occurrences of "TX" and "NJ". I'm fairly new to…
madsthaks
  • 2,091
  • 6
  • 25
  • 46
49
votes
4 answers

Spark Equivalent of IF Then ELSE

I have seen this question earlier here and I have took lessons from that. However I am not sure why I am getting an error when I feel it should work. I want to create a new column in existing Spark DataFrame by some rules. Here is what I wrote.…
Baktaawar
  • 7,086
  • 24
  • 81
  • 149
49
votes
2 answers

Where do you need to use lit() in Pyspark SQL?

I'm trying to make sense of where you need to use a lit value, which is defined as a literal column in the documentation. Take for example this udf, which returns the index of a SQL column array: def find_index(column, index): return…
flybonzai
  • 3,763
  • 11
  • 38
  • 72
49
votes
3 answers

How take a random row from a PySpark DataFrame?

How can I get a random row from a PySpark DataFrame? I only see the method sample() which takes a fraction as parameter. Setting this fraction to 1/numberOfRows leads to random results, where sometimes I won't get any row. On RDD there is a method…
DanT
  • 3,960
  • 5
  • 28
  • 33
49
votes
3 answers

Viewing the content of a Spark Dataframe Column

I'm using Spark 1.3.1. I am trying to view the values of a Spark dataframe column in Python. With a Spark dataframe, I can do df.collect() to view the contents of the dataframe, but there is no such method for a Spark dataframe column as best as I…
John Lin
  • 493
  • 1
  • 4
  • 5
48
votes
2 answers

Pyspark convert a standard list to data frame

The case is really simple, I need to convert a python list into data frame with following code from pyspark.sql.types import StructType from pyspark.sql.types import StructField from pyspark.sql.types import StringType, IntegerType schema =…
seiya
  • 1,477
  • 3
  • 17
  • 26
48
votes
2 answers

Filter Spark DataFrame based on another DataFrame that specifies denylist criteria

I have a largeDataFrame (multiple columns and billions of rows) and a smallDataFrame (single column and 10,000 rows). I'd like to filter all the rows from the largeDataFrame whenever the some_identifier column in the largeDataFrame matches one of…
Powers
  • 18,150
  • 10
  • 103
  • 108
48
votes
5 answers

How to build a sparkSession in Spark 2.0 using pyspark?

I just got access to spark 2.0; I have been using spark 1.6.1 up until this point. Can someone please help me set up a sparkSession using pyspark (python)? I know that the scala examples available online are similar (here), but I was hoping for a…
haileyeve
  • 481
  • 1
  • 4
  • 4
48
votes
4 answers

PySpark row-wise function composition

As a simplified example, I have a dataframe "df" with columns "col1,col2" and I want to compute a row-wise maximum after applying a function to each column : def f(x): return (x+1) max_udf=udf(lambda x,y: max(x,y), IntegerType()) f_udf=udf(f,…
Alex R.
  • 1,397
  • 3
  • 18
  • 33
47
votes
2 answers

Pivot String column on Pyspark Dataframe

I have a simple dataframe like this: rdd = sc.parallelize( [ (0, "A", 223,"201603", "PORT"), (0, "A", 22,"201602", "PORT"), (0, "A", 422,"201601", "DOCK"), (1,"B", 3213,"201602", "DOCK"), (1,"B",…
Ivan
  • 19,560
  • 31
  • 97
  • 141
47
votes
7 answers

I can't seem to get --py-files on Spark to work

I'm having a problem with using Python on Spark. My application has some dependencies, such as numpy, pandas, astropy, etc. I cannot use virtualenv to create an environment with all dependencies, since the nodes on the cluster do not have any common…
Andrej Palicka
  • 971
  • 1
  • 11
  • 26
47
votes
2 answers

overwriting a spark output using pyspark

I am trying to overwrite a Spark dataframe using the following option in PySpark but I am not successful spark_df.write.format('com.databricks.spark.csv').option("header", "true",mode='overwrite').save(self.output_file_path) the mode=overwrite…
Devesh
  • 719
  • 1
  • 7
  • 13
47
votes
10 answers

Trim string column in PySpark dataframe

After creating a Spark DataFrame from a CSV file, I would like to trim a column. I've tried: df = df.withColumn("Product", df.Product.strip()) df is my data frame, Product is a column in my table. But I get the error: Column object is not…
minh-hieu.pham
  • 1,029
  • 2
  • 12
  • 21
47
votes
8 answers

Add column sum as new column in PySpark dataframe

I'm using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns. Suppose my dataframe had columns "a", "b", and "c". I know I can do this: df.withColumn('total_col',…
plam
  • 1,305
  • 3
  • 15
  • 24
46
votes
1 answer

Pyspark filter dataframe by columns of another dataframe

Not sure why I'm having a difficult time with this, it seems so simple considering it's fairly easy to do in R or pandas. I wanted to avoid using pandas though since I'm dealing with a lot of data, and I believe toPandas() loads all the data into…
drewyupdrew
  • 1,549
  • 1
  • 11
  • 16