Highest Voted 'pyspark' Questions

64

votes

4 answers

Reading csv files with quoted fields containing embedded commas

I am reading a csv file in Pyspark as follows: df_raw=spark.read.option("header","true").csv(csv_path) However, the data file has quoted fields with embedded commas in them which should not be treated as commas. How can I handle this in Pyspark ?…

csv apache-spark pyspark apache-spark-sql apache-spark-2.0

asked Nov 04 '16 at 00:34

femibyte

3,317
7
34
59

64

votes

2 answers

Apache Spark -- Assign the result of UDF to multiple dataframe columns

I'm using pyspark, loading a large csv file into a dataframe with spark-csv, and as a pre-processing step I need to apply a variety of operations to the data available in one of the columns (that contains a json string). That will return X values,…

python apache-spark pyspark apache-spark-sql user-defined-functions

asked Feb 10 '16 at 18:08

Everaldo Aguiar

4,016
7
26
31

62

votes

3 answers

How to replace all Null values of a dataframe in Pyspark

I have a data frame in pyspark with more than 300 columns. In these columns there are some columns with values null. For example: Column_1 column_2 null null null null 234 null 125 124 365 187 and so on When I want to do a…

dataframe null pyspark

asked Feb 18 '17 at 06:45

user7543621

62

votes

6 answers

PySpark groupByKey returning pyspark.resultiterable.ResultIterable

I am trying to figure out why my groupByKey is returning the following: [(0, ), (1, ), (2,…

python apache-spark pyspark

asked Apr 18 '15 at 12:18

theMadKing

2,064
7
32
59

61

votes

3 answers

Python/pyspark data frame rearrange columns

I have a data frame in python/pyspark with columns id time city zip and so on...... Now I added a new column name to this data frame. Now I have to arrange the columns in such a way that the name column comes after id I have done like…

python pyspark apache-spark-sql

asked Mar 20 '17 at 19:16

User12345

5,180
14
58
105

61

votes

1 answer

aggregate function Count usage with groupBy in Spark

I'm trying to make multiple operations in one line of code in pySpark, and not sure if that's possible for my case. My intention is not having to save the output as a new dataframe. My current code is rather simple: encodeUDF = udf(encode_time,…

java scala apache-spark pyspark apache-spark-sql

asked Jan 27 '17 at 09:19

Adiel

1,203
3
18
31

60

votes

4 answers

GroupBy column and filter rows with maximum value in Pyspark

I am almost certain this has been asked before, but a search through stackoverflow did not answer my question. Not a duplicate of [2] since I want the maximum value, not the most frequent item. I am new to pyspark and trying to do something really…

python apache-spark pyspark apache-spark-sql

asked Feb 16 '18 at 15:31

Thomas

4,696
5
36
71

60

votes

4 answers

Pyspark: Convert column to lowercase

I want to convert the values inside a column to lowercase. Currently if I use the lower() method, it complains that column objects are not callable. Since there's a function called lower() in SQL, I assume there's a native Spark solution that…

apache-spark pyspark apache-spark-sql

asked Nov 08 '17 at 12:26

wlad

2,073
2
18
29

60

votes

6 answers

PySpark create new column with mapping from a dict

Using Spark 1.6, I have a Spark DataFrame column (named let's say col1) with values A, B, C, DS, DNS, E, F, G and H. I want to create a new column (say col2) with the values from the dict here below. How do I map this? (e,g. 'A' needs to be mapped…

python apache-spark dictionary pyspark apache-spark-sql

asked Mar 23 '17 at 15:39

ad_s

1,560
4
15
16

60

votes

6 answers

Spark SQL Row_number() PartitionBy Sort Desc

I've successfully create a row_number() partitionBy by in Spark using Window, but would like to sort this by descending, instead of the default ascending. Here is my working code: from pyspark import HiveContext from pyspark.sql.types import * from…

python apache-spark pyspark apache-spark-sql window-functions

asked Feb 06 '16 at 22:17

jKraut

2,325
6
35
48

59

votes

9 answers

PySpark - Sum a column in dataframe and return results as int

I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable. df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"]) I do the following to…

python dataframe sum pyspark

asked Dec 14 '17 at 11:43

Bryce Ramgovind

3,127
10
41
72

59

votes

7 answers

Pyspark: Parse a column of json strings

I have a pyspark dataframe consisting of one column, called json, where each row is a unicode string of json. I'd like to parse each row and return a new dataframe where each row is the parsed json. # Sample Data Frame jstr1 =…

python json apache-spark pyspark

asked Dec 12 '16 at 19:10

Steve

2,401
3
24
28

59

votes

6 answers

Spark Error:expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

I have a dataframe in Spark in which one of the columns contains an array.Now,I have written a separate UDF which converts the array to another array with distinct values in it only. See example below: Ex: [24,23,27,23] should get converted to [24,…

arrays apache-spark pyspark apache-spark-sql user-defined-functions

asked Aug 16 '16 at 21:28

Preyas

773
1
7
12

59

votes

3 answers

Find maximum row per group in Spark DataFrame

I'm trying to use Spark dataframes instead of RDDs since they appear to be more high-level than RDDs and tend to produce more readable code. In a 14-nodes Google Dataproc cluster, I have about 6 millions names that are translated to ids by two…

apache-spark pyspark apache-spark-sql

asked Feb 05 '16 at 07:52

Quentin Pradet

4,691
2
29
41

58

votes

4 answers

Python Spark Cumulative Sum by Group Using DataFrame

How do I compute the cumulative sum per group specifically using the DataFrame abstraction; and in PySpark? With an example dataset as follows: df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")], …

apache-spark pyspark apache-spark-sql

asked Aug 29 '17 at 18:50

mr kw

1,977
2
14
12

Questions tagged [pyspark]

Useful Links:

Related Tags:

Reading csv files with quoted fields containing embedded commas

Apache Spark -- Assign the result of UDF to multiple dataframe columns

How to replace all Null values of a dataframe in Pyspark

PySpark groupByKey returning pyspark.resultiterable.ResultIterable

Python/pyspark data frame rearrange columns

aggregate function Count usage with groupBy in Spark

GroupBy column and filter rows with maximum value in Pyspark

Pyspark: Convert column to lowercase

PySpark create new column with mapping from a dict

Spark SQL Row_number() PartitionBy Sort Desc

PySpark - Sum a column in dataframe and return results as int

Pyspark: Parse a column of json strings

Spark Error:expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

Find maximum row per group in Spark DataFrame

Python Spark Cumulative Sum by Group Using DataFrame