Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
51
votes
6 answers

pyspark: ValueError: Some of types cannot be determined after inferring

I have a pandas data frame my_df, and my_df.dtypes gives us: ts int64 fieldA object fieldB object fieldC object fieldD object fieldE object dtype: object Then I am trying to convert the pandas…
Edamame
  • 23,718
  • 73
  • 186
  • 320
51
votes
6 answers

How to run a script in PySpark

I'm trying to run a script in the pyspark environment but so far I haven't been able to. How can I run a script like python script.py but in pyspark?
Daniel Rodríguez
  • 684
  • 3
  • 7
  • 15
51
votes
2 answers

AttributeError: 'DataFrame' object has no attribute 'map'

I wanted to convert the spark data frame to add using the code below: from pyspark.mllib.clustering import KMeans spark_df = sqlContext.createDataFrame(pandas_df) rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data])) model =…
Edamame
  • 23,718
  • 73
  • 186
  • 320
51
votes
3 answers

Number of partitions in RDD and performance in Spark

In Pyspark, I can create a RDD from a list and decide how many partitions to have: sc = SparkContext() sc.parallelize(xrange(0, 10), 4) How does the number of partitions I decide to partition my RDD in influence the performance? And how does this…
mar tin
  • 9,266
  • 23
  • 72
  • 97
50
votes
1 answer

How to conditionally replace value in a column based on evaluation of expression based on another column in Pyspark?

import numpy as np df = spark.createDataFrame( [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (0, 5, float(10)), (1, 6, float('nan')), (0, 6, float('nan'))], ('session', "timestamp1",…
GeorgeOfTheRF
  • 8,244
  • 23
  • 57
  • 80
50
votes
2 answers

How can I write a parquet file using Spark (pyspark)?

I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows…
ultraInstinct
  • 4,063
  • 10
  • 36
  • 53
50
votes
1 answer

Specifying the filename when saving a DataFrame as a CSV

Say I have a Spark DF that I want to save to disk a CSV file. In Spark 2.0.0+, one can convert DataFrame(DataSet[Rows]) as a DataFrameWriter and use the .csv method to write the file. The function is defined as def csv(path: String): Unit path…
Spandan Brahmbhatt
  • 3,774
  • 6
  • 24
  • 36
50
votes
6 answers

spark 2.1.0 session config settings (pyspark)

I am trying to overwrite the spark session/spark context default configs, but it is picking entire node/cluster resource. spark = SparkSession.builder .master("ip") .enableHiveSupport() …
Harish
  • 969
  • 2
  • 10
  • 15
50
votes
4 answers

Filtering a pyspark dataframe using isin by exclusion

I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion). As an example: df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')] ,schema=('id','bar')) I get…
gabrown86
  • 1,719
  • 3
  • 12
  • 18
50
votes
4 answers

Applying UDFs on GroupedData in PySpark (with functioning python example)

I have this python code that runs locally in a pandas dataframe: df_result = pd.DataFrame(df .groupby('A') .apply(lambda x: myFunction(zip(x.B, x.C), x.name)) I would like to run this in PySpark,…
50
votes
3 answers

Spark RDD to DataFrame python

I am trying to convert the Spark RDD to a DataFrame. I have seen the documentation and example where the scheme is passed to sqlContext.CreateDataFrame(rdd,schema) function. But I have 38 columns or fields and this will increase further. If I…
Jack Daniel
  • 2,527
  • 3
  • 31
  • 52
50
votes
7 answers

Spark 1.4 increase maxResultSize memory

I am using Spark 1.4 for my research and struggling with the memory settings. My machine has 16GB of memory so no problem there since the size of my file is only 300MB. Although, when I try to convert Spark RDD to panda dataframe using toPandas()…
ahajib
  • 12,838
  • 29
  • 79
  • 120
50
votes
2 answers

What is the Spark DataFrame method `toPandas` actually doing?

I'm a beginner of Spark-DataFrame API. I use this code to load csv tab-separated into Spark Dataframe lines = sc.textFile('tail5.csv') parts = lines.map(lambda l : l.strip().split('\t')) fnames = *some name list* schemaData =…
Napitupulu Jon
  • 7,713
  • 3
  • 22
  • 23
49
votes
5 answers

What's the equivalent of Panda's value_counts() in PySpark?

I am having the following python/pandas command: df.groupby('Column_Name').agg(lambda x: x.value_counts().max() where I am getting the value counts for ALL columns in a DataFrameGroupBy object. How do I do this action in PySpark?
TSAR
  • 683
  • 1
  • 6
  • 8
49
votes
4 answers

pyspark: rolling average using timeseries data

I have a dataset consisting of a timestamp column and a dollars column. I would like to find the average number of dollars per week ending at the timestamp of each row. I was initially looking at the pyspark.sql.functions.window function, but that…
Bob Swain
  • 3,052
  • 3
  • 17
  • 28