Highest Voted 'pyspark' Questions

51

votes

6 answers

pyspark: ValueError: Some of types cannot be determined after inferring

I have a pandas data frame my_df, and my_df.dtypes gives us: ts int64 fieldA object fieldB object fieldC object fieldD object fieldE object dtype: object Then I am trying to convert the pandas…

python python-2.7 pandas pyspark apache-spark-sql

asked Nov 09 '16 at 23:11

Edamame

23,718
73
186
320

51

votes

6 answers

How to run a script in PySpark

I'm trying to run a script in the pyspark environment but so far I haven't been able to. How can I run a script like python script.py but in pyspark?

apache-spark pyspark

asked Oct 13 '16 at 19:00

Daniel Rodríguez

684
3
7
15

51

votes

2 answers

AttributeError: 'DataFrame' object has no attribute 'map'

I wanted to convert the spark data frame to add using the code below: from pyspark.mllib.clustering import KMeans spark_df = sqlContext.createDataFrame(pandas_df) rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data])) model =…

python apache-spark pyspark apache-spark-sql apache-spark-mllib

asked Sep 16 '16 at 15:44

Edamame

23,718
73
186
320

51

votes

3 answers

Number of partitions in RDD and performance in Spark

In Pyspark, I can create a RDD from a list and decide how many partitions to have: sc = SparkContext() sc.parallelize(xrange(0, 10), 4) How does the number of partitions I decide to partition my RDD in influence the performance? And how does this…

performance apache-spark pyspark rdd

asked Mar 04 '16 at 16:13

mar tin

9,266
23
72
97

50

votes

1 answer

How to conditionally replace value in a column based on evaluation of expression based on another column in Pyspark?

import numpy as np df = spark.createDataFrame( [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (0, 5, float(10)), (1, 6, float('nan')), (0, 6, float('nan'))], ('session', "timestamp1",…

apache-spark pyspark apache-spark-sql

asked Jun 27 '17 at 06:40

GeorgeOfTheRF

8,244
23
57
80

50

votes

2 answers

How can I write a parquet file using Spark (pyspark)?

I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows…

python pyspark apache-spark-sql

asked Feb 03 '17 at 11:13

ultraInstinct

4,063
10
36
53

50

votes

1 answer

Specifying the filename when saving a DataFrame as a CSV

Say I have a Spark DF that I want to save to disk a CSV file. In Spark 2.0.0+, one can convert DataFrame(DataSet[Rows]) as a DataFrameWriter and use the .csv method to write the file. The function is defined as def csv(path: String): Unit path…

scala csv apache-spark pyspark

asked Feb 01 '17 at 21:28

Spandan Brahmbhatt

3,774
6
24
36

50

votes

6 answers

spark 2.1.0 session config settings (pyspark)

I am trying to overwrite the spark session/spark context default configs, but it is picking entire node/cluster resource. spark = SparkSession.builder .master("ip") .enableHiveSupport() …

python apache-spark pyspark apache-spark-sql

asked Jan 27 '17 at 02:49

Harish

969
2
10
15

50

votes

4 answers

Filtering a pyspark dataframe using isin by exclusion

I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion). As an example: df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')] ,schema=('id','bar')) I get…

python apache-spark pyspark apache-spark-sql

asked Jan 21 '17 at 02:54

gabrown86

1,719
3
12
18

50

votes

4 answers

Applying UDFs on GroupedData in PySpark (with functioning python example)

I have this python code that runs locally in a pandas dataframe: df_result = pd.DataFrame(df .groupby('A') .apply(lambda x: myFunction(zip(x.B, x.C), x.name)) I would like to run this in PySpark,…

python apache-spark pyspark apache-spark-sql user-defined-functions

asked Oct 12 '16 at 19:01

arosner09

649
2
8
9

50

votes

3 answers

Spark RDD to DataFrame python

I am trying to convert the Spark RDD to a DataFrame. I have seen the documentation and example where the scheme is passed to sqlContext.CreateDataFrame(rdd,schema) function. But I have 38 columns or fields and this will increase further. If I…

python apache-spark pyspark apache-spark-sql

asked Sep 26 '16 at 09:24

Jack Daniel

2,527
3
31
52

50

votes

7 answers

Spark 1.4 increase maxResultSize memory

I am using Spark 1.4 for my research and struggling with the memory settings. My machine has 16GB of memory so no problem there since the size of my file is only 300MB. Although, when I try to convert Spark RDD to panda dataframe using toPandas()…

python memory apache-spark pyspark jupyter

asked Jun 25 '15 at 18:51

ahajib

12,838
29
79
120

50

votes

2 answers

What is the Spark DataFrame method `toPandas` actually doing?

I'm a beginner of Spark-DataFrame API. I use this code to load csv tab-separated into Spark Dataframe lines = sc.textFile('tail5.csv') parts = lines.map(lambda l : l.strip().split('\t')) fnames = *some name list* schemaData =…

python pandas apache-spark pyspark

asked Mar 24 '15 at 06:22

Napitupulu Jon

7,713
3
22
23

49

votes

5 answers

What's the equivalent of Panda's value_counts() in PySpark?

I am having the following python/pandas command: df.groupby('Column_Name').agg(lambda x: x.value_counts().max() where I am getting the value counts for ALL columns in a DataFrameGroupBy object. How do I do this action in PySpark?

dataframe count pyspark pandas-groupby

asked Jun 27 '18 at 13:08

TSAR

683
1
6
8

49

votes

4 answers

pyspark: rolling average using timeseries data

I have a dataset consisting of a timestamp column and a dollars column. I would like to find the average number of dollars per week ending at the timestamp of each row. I was initially looking at the pyspark.sql.functions.window function, but that…

apache-spark pyspark window-functions moving-average

asked Aug 21 '17 at 22:06

Bob Swain

3,052
3
17
28

Questions tagged [pyspark]

Useful Links:

Related Tags:

pyspark: ValueError: Some of types cannot be determined after inferring

How to run a script in PySpark

AttributeError: 'DataFrame' object has no attribute 'map'

Number of partitions in RDD and performance in Spark

How to conditionally replace value in a column based on evaluation of expression based on another column in Pyspark?

How can I write a parquet file using Spark (pyspark)?

Specifying the filename when saving a DataFrame as a CSV

spark 2.1.0 session config settings (pyspark)

Filtering a pyspark dataframe using isin by exclusion

Applying UDFs on GroupedData in PySpark (with functioning python example)

Spark RDD to DataFrame python

Spark 1.4 increase maxResultSize memory

What is the Spark DataFrame method `toPandas` actually doing?

What's the equivalent of Panda's value_counts() in PySpark?

pyspark: rolling average using timeseries data