Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
64
votes
4 answers

Reading csv files with quoted fields containing embedded commas

I am reading a csv file in Pyspark as follows: df_raw=spark.read.option("header","true").csv(csv_path) However, the data file has quoted fields with embedded commas in them which should not be treated as commas. How can I handle this in Pyspark ?…
femibyte
  • 3,317
  • 7
  • 34
  • 59
64
votes
2 answers

Apache Spark -- Assign the result of UDF to multiple dataframe columns

I'm using pyspark, loading a large csv file into a dataframe with spark-csv, and as a pre-processing step I need to apply a variety of operations to the data available in one of the columns (that contains a json string). That will return X values,…
62
votes
3 answers

How to replace all Null values of a dataframe in Pyspark

I have a data frame in pyspark with more than 300 columns. In these columns there are some columns with values null. For example: Column_1 column_2 null null null null 234 null 125 124 365 187 and so on When I want to do a…
user7543621
62
votes
6 answers

PySpark groupByKey returning pyspark.resultiterable.ResultIterable

I am trying to figure out why my groupByKey is returning the following: [(0, ), (1, ), (2,…
theMadKing
  • 2,064
  • 7
  • 32
  • 59
61
votes
3 answers

Python/pyspark data frame rearrange columns

I have a data frame in python/pyspark with columns id time city zip and so on...... Now I added a new column name to this data frame. Now I have to arrange the columns in such a way that the name column comes after id I have done like…
User12345
  • 5,180
  • 14
  • 58
  • 105
61
votes
1 answer

aggregate function Count usage with groupBy in Spark

I'm trying to make multiple operations in one line of code in pySpark, and not sure if that's possible for my case. My intention is not having to save the output as a new dataframe. My current code is rather simple: encodeUDF = udf(encode_time,…
Adiel
  • 1,203
  • 3
  • 18
  • 31
60
votes
4 answers

GroupBy column and filter rows with maximum value in Pyspark

I am almost certain this has been asked before, but a search through stackoverflow did not answer my question. Not a duplicate of [2] since I want the maximum value, not the most frequent item. I am new to pyspark and trying to do something really…
Thomas
  • 4,696
  • 5
  • 36
  • 71
60
votes
4 answers

Pyspark: Convert column to lowercase

I want to convert the values inside a column to lowercase. Currently if I use the lower() method, it complains that column objects are not callable. Since there's a function called lower() in SQL, I assume there's a native Spark solution that…
wlad
  • 2,073
  • 2
  • 18
  • 29
60
votes
6 answers

PySpark create new column with mapping from a dict

Using Spark 1.6, I have a Spark DataFrame column (named let's say col1) with values A, B, C, DS, DNS, E, F, G and H. I want to create a new column (say col2) with the values from the dict here below. How do I map this? (e,g. 'A' needs to be mapped…
ad_s
  • 1,560
  • 4
  • 15
  • 16
60
votes
6 answers

Spark SQL Row_number() PartitionBy Sort Desc

I've successfully create a row_number() partitionBy by in Spark using Window, but would like to sort this by descending, instead of the default ascending. Here is my working code: from pyspark import HiveContext from pyspark.sql.types import * from…
jKraut
  • 2,325
  • 6
  • 35
  • 48
59
votes
9 answers

PySpark - Sum a column in dataframe and return results as int

I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable. df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"]) I do the following to…
Bryce Ramgovind
  • 3,127
  • 10
  • 41
  • 72
59
votes
7 answers

Pyspark: Parse a column of json strings

I have a pyspark dataframe consisting of one column, called json, where each row is a unicode string of json. I'd like to parse each row and return a new dataframe where each row is the parsed json. # Sample Data Frame jstr1 =…
Steve
  • 2,401
  • 3
  • 24
  • 28
59
votes
6 answers

Spark Error:expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

I have a dataframe in Spark in which one of the columns contains an array.Now,I have written a separate UDF which converts the array to another array with distinct values in it only. See example below: Ex: [24,23,27,23] should get converted to [24,…
59
votes
3 answers

Find maximum row per group in Spark DataFrame

I'm trying to use Spark dataframes instead of RDDs since they appear to be more high-level than RDDs and tend to produce more readable code. In a 14-nodes Google Dataproc cluster, I have about 6 millions names that are translated to ids by two…
Quentin Pradet
  • 4,691
  • 2
  • 29
  • 41
58
votes
4 answers

Python Spark Cumulative Sum by Group Using DataFrame

How do I compute the cumulative sum per group specifically using the DataFrame abstraction; and in PySpark? With an example dataset as follows: df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")], …
mr kw
  • 1,977
  • 2
  • 14
  • 12