Highest Voted 'pyspark' Questions

8

votes

1 answer

PySpark count values by condition

I have a DataFrame, a snippet here: [['u1', 1], ['u2', 0]] basically a string field named f and either a 1 or a 0 for second element (is_fav). What I need to do is grouping on the first field and counting the occurrences of 1s and 0s. I was hoping…

python apache-spark pyspark

asked Mar 17 '16 at 21:46

mar tin

9,266
23
72
97

8

votes

2 answers

pyspark : Convert DataFrame to RDD[string]

I'd like to convert pyspark.sql.dataframe.DataFrame to pyspark.rdd.RDD[String] I converted a DataFrame df to RDD data: data = df.rdd type (data) ## pyspark.rdd.RDD the new RDD data contains Row first = data.first() type(first) ##…

python apache-spark dataframe pyspark apache-spark-sql

asked Feb 17 '16 at 13:21

Toren

6,648
12
41
62

8

votes

5 answers

Spark Python error "FileNotFoundError: [WinError 2] The system cannot find the file specified"

I am new to Spark and Python. I have installed python 3.5.1 and Spark-1.6.0-bin-hadoop2.4 on windows. I am getting the below error when I execute sc = SparkContext("local", "Simple App") from the Python shell: >>> from pyspark import SparkConf,…

python apache-spark pyspark

asked Feb 17 '16 at 05:39

sam

101
1
1
6

8

votes

2 answers

Connecting DynamoDB from Spark program to load all items from one table using Python?

I have written a program to write items into DynamoDB table. Now I would like to read all items from the DynamoDB table using PySpark. Are there any libraries available to do this in Spark?

amazon-dynamodb pyspark apache-spark-sql

asked Feb 04 '16 at 19:18

sms_1190

1,267
2
12
24

8

votes

3 answers

How to write data in Elasticsearch from Pyspark?

I have integrated ELK with Pyspark. saved RDD as ELK data on local file system rdd.saveAsTextFile("/tmp/ELKdata") logData = sc.textFile('/tmp/ELKdata/*') errors = logData.filter(lambda line: "raw1-VirtualBox" in line) errors.count() value i got…

elasticsearch apache-spark pyspark elastic-map-reduce

asked Jan 19 '16 at 06:19

pyspark

81
1
4

8

votes

1 answer

How to replace infinity in PySpark DataFrame

It seems like there is no support for replacing infinity values. I tried the code below and it doesn't work. Or am I missing out something? a=sqlContext.createDataFrame([(None, None), (1, np.inf), (None, 2)]) a.replace(np.inf, 10) Or do I have to…

python pandas apache-spark pyspark apache-spark-sql

asked Dec 23 '15 at 09:59

Michael

1,398
5
24
40

8

votes

1 answer

Numpy and static linking

I am running Spark programs on a large cluster (for which, I do not have administrative privileges). numpy is not installed on the worker nodes. Hence, I bundled numpy with my program, but I get the following error: Traceback (most recent call…

python numpy apache-spark pyspark

asked Dec 19 '15 at 23:05

abhinavkulkarni

2,284
4
36
54

8

votes

1 answer

Spark: More Efficient Aggregation to join strings from different rows

I'm currently working with DNA sequence data and I have run into a bit of a performance roadblock. I have two lookup dictionaries/hashes (as RDDs) with DNA "words" (short sequences) as keys and a list of index positions as the value. One is for a…

python apache-spark pyspark

asked Dec 19 '15 at 20:57

Chris Chambers

1,367
21
39

8

votes

5 answers

How to prevent logging of pyspark 'answer received' and 'command to send' messages

I am using python logging with pyspark and pyspark DEBUG level messages are flooding my log file with the example shown. How do I prevent this from happening? A simple solution is to set log level to INFO, but I need to log my own python DEBUG level…

python logging pyspark

asked Dec 13 '15 at 07:26

Michael

1,398
5
24
40

8

votes

2 answers

How to load jar dependenices in IPython Notebook

This page was inspiring me to try out spark-csv for reading .csv file in PySpark I found a couple of posts such as this describing how to use spark-csv But I am not able to initialize the ipython instance by including either the .jar file or…

csv apache-spark pyspark jupyter-notebook

asked Nov 25 '15 at 03:46

KarthikS

883
1
11
17

8

votes

1 answer

collect RDD with buffer in pyspark

I would like a way to return rows from my RDD one at a time (or in small batches) so that I can collect the rows locally as I need them. My RDD is large enough that it cannot fit into memory on the name node, so running collect() would cause an…

apache-spark pyspark

asked Nov 19 '15 at 19:14

mgoldwasser

14,558
15
79
103

8

votes

2 answers

Geoip2's python library doesn't work in pySpark's map function

I'm using geoip2's python library and pySpark to get the geographical address of some IPs. My code is like: geoDBpath = 'somePath/geoDB/GeoLite2-City.mmdb' geoPath = os.path.join(geoDBpath) sc.addFile(geoPath) reader =…

python apache-spark pyspark geoip

asked Nov 16 '15 at 22:07

Dong

125
1
7

8

votes

3 answers

Write and run pyspark in IntelliJ IDEA

i am trying to work with Pyspark in IntelliJ but i cannot figure out how to correctly install it/setup the project. I can work with Python in IntelliJ and I can use the pyspark shell but I cannot tell IntelliJ how to find the Spark files (import…

python intellij-idea apache-spark pyspark

asked Nov 02 '15 at 13:01

tandy

93
1
1
6

8

votes

2 answers

What is the equivalent to scala.util.Try in pyspark?

I've got a lousy HTTPD access_log and just want to skip the "lousy" lines. In scala this is straightforward: import scala.util.Try val log = sc.textFile("access_log") log.map(_.split(' ')).map(a =>…

python scala apache-spark pyspark

asked Oct 28 '15 at 05:06

Romeo Kienzler

3,373
3
36
58

8

votes

2 answers

How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?

We are using the PySpark libraries interfacing with Spark 1.3.1. We have two dataframes, documents_df := {document_id, document_text} and keywords_df := {keyword}. We would like to JOIN the two dataframes and return a resulting dataframe with…

python apache-spark apache-spark-sql pyspark

asked Oct 16 '15 at 11:06

Will Hardman

193
1
2
8

Questions tagged [pyspark]

Useful Links:

Related Tags:

PySpark count values by condition

pyspark : Convert DataFrame to RDD[string]

Spark Python error "FileNotFoundError: [WinError 2] The system cannot find the file specified"

Connecting DynamoDB from Spark program to load all items from one table using Python?

How to write data in Elasticsearch from Pyspark?

How to replace infinity in PySpark DataFrame

Numpy and static linking

Spark: More Efficient Aggregation to join strings from different rows

How to prevent logging of pyspark 'answer received' and 'command to send' messages

How to load jar dependenices in IPython Notebook

collect RDD with buffer in pyspark

Geoip2's python library doesn't work in pySpark's map function

Write and run pyspark in IntelliJ IDEA

What is the equivalent to scala.util.Try in pyspark?

How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?