Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
8
votes
1 answer

PySpark count values by condition

I have a DataFrame, a snippet here: [['u1', 1], ['u2', 0]] basically a string field named f and either a 1 or a 0 for second element (is_fav). What I need to do is grouping on the first field and counting the occurrences of 1s and 0s. I was hoping…
mar tin
  • 9,266
  • 23
  • 72
  • 97
8
votes
2 answers

pyspark : Convert DataFrame to RDD[string]

I'd like to convert pyspark.sql.dataframe.DataFrame to pyspark.rdd.RDD[String] I converted a DataFrame df to RDD data: data = df.rdd type (data) ## pyspark.rdd.RDD the new RDD data contains Row first = data.first() type(first) ##…
Toren
  • 6,648
  • 12
  • 41
  • 62
8
votes
5 answers

Spark Python error "FileNotFoundError: [WinError 2] The system cannot find the file specified"

I am new to Spark and Python. I have installed python 3.5.1 and Spark-1.6.0-bin-hadoop2.4 on windows. I am getting the below error when I execute sc = SparkContext("local", "Simple App") from the Python shell: >>> from pyspark import SparkConf,…
sam
  • 101
  • 1
  • 1
  • 6
8
votes
2 answers

Connecting DynamoDB from Spark program to load all items from one table using Python?

I have written a program to write items into DynamoDB table. Now I would like to read all items from the DynamoDB table using PySpark. Are there any libraries available to do this in Spark?
sms_1190
  • 1,267
  • 2
  • 12
  • 24
8
votes
3 answers

How to write data in Elasticsearch from Pyspark?

I have integrated ELK with Pyspark. saved RDD as ELK data on local file system rdd.saveAsTextFile("/tmp/ELKdata") logData = sc.textFile('/tmp/ELKdata/*') errors = logData.filter(lambda line: "raw1-VirtualBox" in line) errors.count() value i got…
8
votes
1 answer

How to replace infinity in PySpark DataFrame

It seems like there is no support for replacing infinity values. I tried the code below and it doesn't work. Or am I missing out something? a=sqlContext.createDataFrame([(None, None), (1, np.inf), (None, 2)]) a.replace(np.inf, 10) Or do I have to…
Michael
  • 1,398
  • 5
  • 24
  • 40
8
votes
1 answer

Numpy and static linking

I am running Spark programs on a large cluster (for which, I do not have administrative privileges). numpy is not installed on the worker nodes. Hence, I bundled numpy with my program, but I get the following error: Traceback (most recent call…
abhinavkulkarni
  • 2,284
  • 4
  • 36
  • 54
8
votes
1 answer

Spark: More Efficient Aggregation to join strings from different rows

I'm currently working with DNA sequence data and I have run into a bit of a performance roadblock. I have two lookup dictionaries/hashes (as RDDs) with DNA "words" (short sequences) as keys and a list of index positions as the value. One is for a…
Chris Chambers
  • 1,367
  • 21
  • 39
8
votes
5 answers

How to prevent logging of pyspark 'answer received' and 'command to send' messages

I am using python logging with pyspark and pyspark DEBUG level messages are flooding my log file with the example shown. How do I prevent this from happening? A simple solution is to set log level to INFO, but I need to log my own python DEBUG level…
Michael
  • 1,398
  • 5
  • 24
  • 40
8
votes
2 answers

How to load jar dependenices in IPython Notebook

This page was inspiring me to try out spark-csv for reading .csv file in PySpark I found a couple of posts such as this describing how to use spark-csv But I am not able to initialize the ipython instance by including either the .jar file or…
KarthikS
  • 883
  • 1
  • 11
  • 17
8
votes
1 answer

collect RDD with buffer in pyspark

I would like a way to return rows from my RDD one at a time (or in small batches) so that I can collect the rows locally as I need them. My RDD is large enough that it cannot fit into memory on the name node, so running collect() would cause an…
mgoldwasser
  • 14,558
  • 15
  • 79
  • 103
8
votes
2 answers

Geoip2's python library doesn't work in pySpark's map function

I'm using geoip2's python library and pySpark to get the geographical address of some IPs. My code is like: geoDBpath = 'somePath/geoDB/GeoLite2-City.mmdb' geoPath = os.path.join(geoDBpath) sc.addFile(geoPath) reader =…
Dong
  • 125
  • 1
  • 7
8
votes
3 answers

Write and run pyspark in IntelliJ IDEA

i am trying to work with Pyspark in IntelliJ but i cannot figure out how to correctly install it/setup the project. I can work with Python in IntelliJ and I can use the pyspark shell but I cannot tell IntelliJ how to find the Spark files (import…
tandy
  • 93
  • 1
  • 1
  • 6
8
votes
2 answers

What is the equivalent to scala.util.Try in pyspark?

I've got a lousy HTTPD access_log and just want to skip the "lousy" lines. In scala this is straightforward: import scala.util.Try val log = sc.textFile("access_log") log.map(_.split(' ')).map(a =>…
Romeo Kienzler
  • 3,373
  • 3
  • 36
  • 58
8
votes
2 answers

How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?

We are using the PySpark libraries interfacing with Spark 1.3.1. We have two dataframes, documents_df := {document_id, document_text} and keywords_df := {keyword}. We would like to JOIN the two dataframes and return a resulting dataframe with…
Will Hardman
  • 193
  • 1
  • 2
  • 8