Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
8
votes
1 answer

How to print rdd in python in spark

I have two files on HDFS and I just want to join these two files on a column say employee id. I am trying to simply print the files to make sure we are reading that correctly from HDFS. lines = sc.textFile("hdfs://ip:8020/emp.txt") print…
yguw
  • 856
  • 6
  • 12
  • 32
8
votes
3 answers

pyspark : how to check if a file exists in hdfs

I want to check if several files exist in hdfs before load them by SparkContext. I use pyspark. I tried os.system("hadoop fs -test -e %s" %path) but as I have a lot of paths to check, the job crashed. I tried also sc.wholeTextFiles(parent_path) and…
A7med
  • 451
  • 2
  • 5
  • 6
8
votes
2 answers

Pyspark: shuffle RDD

I'm trying to randomise the order of elements in an RDD. My current approach is to zip the elements with an RDD of shuffled integers, then later join by those integers. However, pyspark falls over with only 100000000 integers. I'm using the code…
Marcin
  • 48,559
  • 18
  • 128
  • 201
8
votes
1 answer

A list as a key for PySpark's reduceByKey

I am attempting to call the reduceByKey function of pyspark on data of the format (([a,b,c], 1), ([a,b,c], 1), ([a,d,b,e], 1), ... It seems pyspark will not accept an array as the key in normal key, value reduction by simply applying…
Peter Doro
  • 255
  • 4
  • 10
8
votes
1 answer

how to print out snippets of a RDD in the spark-shell / pyspark?

When working in the spark-shell, I frequently want to inspect RDDs (similar to using head in unix). For example: scala> val readmeFile = sc.textFile("input/tmp/README.md") scala> // how to inspect the readmeFile? and ... scala> val…
Chris Snow
  • 23,813
  • 35
  • 144
  • 309
8
votes
1 answer

How to join two RDDs in spark with python?

Suppose rdd1 = ( (a, 1), (a, 2), (b, 1) ), rdd2 = ( (a, ?), (a, *), (c, .) ). Want to generate ( (a, (1, ?)), (a, (1, *)), (a, (2, ?)), (a, (2, *)) ). Any easy methods? I think it is different from the cross join but can't find a good…
Peng Sun
  • 130
  • 1
  • 1
  • 8
8
votes
2 answers

Joining two spark dataframes on time (TimestampType) in python

I have two dataframes and I would like to join them based on one column, with a caveat that this column is a timestamp, and that timestamp has to be within a certain offset (5 seconds) in order to join records. More specifically, a record in…
Oleksiy
  • 6,337
  • 5
  • 41
  • 58
8
votes
2 answers

pySpark Create DataFrame from RDD with Key/Value

If I have an RDD of Key/Value (key being the column index) is it possible to load it into a dataframe? For example: (0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18) And have the dataframe look like: 1,2,18 1,10,18 2,20,18
theMadKing
  • 2,064
  • 7
  • 32
  • 59
8
votes
1 answer

Create Spark DataFrame from nested dictionary

I have a list of nested dictionaries, e.g. ds = [{'a': {'b': {'c': 1}}}] and want to create a spark DataFrame from it while inferring schema of nested dictionaries. Using sqlContext.createDataFrame(ds).printSchema() gives me following schema root …
Marigold
  • 1,619
  • 1
  • 15
  • 17
8
votes
2 answers

Not able to connect to postgres using jdbc in pyspark shell

I am using standalone cluster on my local windows and trying to load data from one of our server using following code - from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.load(source="jdbc",…
Soni Shashank
  • 221
  • 1
  • 3
  • 9
8
votes
1 answer

Functions from Python packages for udf() of Spark dataframe

For Spark dataframe via pyspark, we can use pyspark.sql.functions.udf to create a user defined function (UDF). I wonder if I can use any function from Python packages in udf(), e.g., np.random.normal from numpy?
Jie Chen
  • 151
  • 1
  • 2
  • 4
8
votes
1 answer

Save Apache Spark mllib model in python

I am trying to save a fitted model to a file in Spark. I have a Spark cluster which trains a RandomForest model. I would like to save and reuse the fitted model on another machine. I read some posts on the web which recommends to do java…
poiuytrez
  • 21,330
  • 35
  • 113
  • 172
7
votes
2 answers

Does spark read the same file twice, if two stages are using the same DataFrame?

The following code reads the same csv twice even though only one action is called End to end runnable example: import pandas as pd import numpy as np df1= pd.DataFrame(np.arange(1_000).reshape(-1,1)) df1.index =…
figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
7
votes
1 answer

pyspark.sql.utils.AnalysisException: Parquet data source does not support void data type

I am trying to add a column in my dataframe df1 in PySpark. The code I tried: import pyspark.sql.functions as F df1 = df1.withColumn("empty_column", F.lit(None)) But I get this error: pyspark.sql.utils.AnalysisException: Parquet data source does…
ar_mm18
  • 415
  • 2
  • 8
7
votes
4 answers

Join dataframes and rename resulting columns with same names

Shortened example: vals1 = [(1, "a"), (2, "b"), ] columns1 = ["id","name"] df1 = spark.createDataFrame(data=vals1, schema=columns1) vals2 = [(1, "k"), ] columns2 = ["id","name"] df2 = spark.createDataFrame(data=vals2,…
user626528
  • 13,999
  • 30
  • 78
  • 146