Highest Voted 'pyspark' Questions

8

votes

1 answer

How to print rdd in python in spark

I have two files on HDFS and I just want to join these two files on a column say employee id. I am trying to simply print the files to make sure we are reading that correctly from HDFS. lines = sc.textFile("hdfs://ip:8020/emp.txt") print…

python apache-spark pyspark apache-spark-sql

asked Oct 09 '15 at 00:15

yguw

856
6
12
32

8

votes

3 answers

pyspark : how to check if a file exists in hdfs

I want to check if several files exist in hdfs before load them by SparkContext. I use pyspark. I tried os.system("hadoop fs -test -e %s" %path) but as I have a lot of paths to check, the job crashed. I tried also sc.wholeTextFiles(parent_path) and…

hadoop apache-spark filesystems hdfs pyspark

asked Sep 01 '15 at 14:53

A7med

451
2
5
6

8

votes

2 answers

Pyspark: shuffle RDD

I'm trying to randomise the order of elements in an RDD. My current approach is to zip the elements with an RDD of shuffled integers, then later join by those integers. However, pyspark falls over with only 100000000 integers. I'm using the code…

python hadoop apache-spark bigdata pyspark

asked Aug 19 '15 at 22:41

Marcin

48,559
18
128
201

8

votes

1 answer

A list as a key for PySpark's reduceByKey

I am attempting to call the reduceByKey function of pyspark on data of the format (([a,b,c], 1), ([a,b,c], 1), ([a,d,b,e], 1), ... It seems pyspark will not accept an array as the key in normal key, value reduction by simply applying…

python apache-spark rdd pyspark

asked Jul 14 '15 at 10:37

Peter Doro

255
4
10

8

votes

1 answer

how to print out snippets of a RDD in the spark-shell / pyspark?

When working in the spark-shell, I frequently want to inspect RDDs (similar to using head in unix). For example: scala> val readmeFile = sc.textFile("input/tmp/README.md") scala> // how to inspect the readmeFile? and ... scala> val…

apache-spark pyspark

asked Jun 29 '15 at 12:35

Chris Snow

23,813
35
144
309

8

votes

1 answer

How to join two RDDs in spark with python?

Suppose rdd1 = ( (a, 1), (a, 2), (b, 1) ), rdd2 = ( (a, ?), (a, *), (c, .) ). Want to generate ( (a, (1, ?)), (a, (1, *)), (a, (2, ?)), (a, (2, *)) ). Any easy methods? I think it is different from the cross join but can't find a good…

apache-spark join pyspark

asked Jun 22 '15 at 20:12

Peng Sun

130
1
1
8

8

votes

2 answers

Joining two spark dataframes on time (TimestampType) in python

I have two dataframes and I would like to join them based on one column, with a caveat that this column is a timestamp, and that timestamp has to be within a certain offset (5 seconds) in order to join records. More specifically, a record in…

join apache-spark apache-spark-sql pyspark

asked Jun 03 '15 at 20:43

Oleksiy

6,337
5
41
58

8

votes

2 answers

pySpark Create DataFrame from RDD with Key/Value

If I have an RDD of Key/Value (key being the column index) is it possible to load it into a dataframe? For example: (0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18) And have the dataframe look like: 1,2,18 1,10,18 2,20,18

apache-spark pyspark

asked May 02 '15 at 20:36

theMadKing

2,064
7
32
59

8

votes

1 answer

Create Spark DataFrame from nested dictionary

I have a list of nested dictionaries, e.g. ds = [{'a': {'b': {'c': 1}}}] and want to create a spark DataFrame from it while inferring schema of nested dictionaries. Using sqlContext.createDataFrame(ds).printSchema() gives me following schema root …

apache-spark pyspark

asked Apr 21 '15 at 11:14

Marigold

1,619
1
15
17

8

votes

2 answers

Not able to connect to postgres using jdbc in pyspark shell

I am using standalone cluster on my local windows and trying to load data from one of our server using following code - from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.load(source="jdbc",…

postgresql jdbc apache-spark apache-spark-sql pyspark

asked Apr 16 '15 at 08:34

Soni Shashank

221
1
3
9

8

votes

1 answer

Functions from Python packages for udf() of Spark dataframe

For Spark dataframe via pyspark, we can use pyspark.sql.functions.udf to create a user defined function (UDF). I wonder if I can use any function from Python packages in udf(), e.g., np.random.normal from numpy?

python apache-spark pyspark

asked Apr 06 '15 at 21:18

Jie Chen

151
1
2
4

8

votes

1 answer

Save Apache Spark mllib model in python

I am trying to save a fitted model to a file in Spark. I have a Spark cluster which trains a RandomForest model. I would like to save and reuse the fitted model on another machine. I read some posts on the web which recommends to do java…

python pyspark apache-spark-mllib

asked Feb 10 '15 at 09:11

poiuytrez

21,330
35
113
172

7

votes

2 answers

Does spark read the same file twice, if two stages are using the same DataFrame?

The following code reads the same csv twice even though only one action is called End to end runnable example: import pandas as pd import numpy as np df1= pd.DataFrame(np.arange(1_000).reshape(-1,1)) df1.index =…

apache-spark pyspark apache-spark-sql

asked May 06 '23 at 14:25

figs_and_nuts

4,870
2
31
56

7

votes

1 answer

pyspark.sql.utils.AnalysisException: Parquet data source does not support void data type

I am trying to add a column in my dataframe df1 in PySpark. The code I tried: import pyspark.sql.functions as F df1 = df1.withColumn("empty_column", F.lit(None)) But I get this error: pyspark.sql.utils.AnalysisException: Parquet data source does…

apache-spark pyspark types parquet void

asked Oct 18 '22 at 18:36

ar_mm18

415
2
8

7

votes

4 answers

Join dataframes and rename resulting columns with same names

Shortened example: vals1 = [(1, "a"), (2, "b"), ] columns1 = ["id","name"] df1 = spark.createDataFrame(data=vals1, schema=columns1) vals2 = [(1, "k"), ] columns2 = ["id","name"] df2 = spark.createDataFrame(data=vals2,…

python apache-spark pyspark

asked Aug 17 '22 at 22:54

user626528

13,999
30
78
146

Questions tagged [pyspark]

Useful Links:

Related Tags:

How to print rdd in python in spark

pyspark : how to check if a file exists in hdfs

Pyspark: shuffle RDD

A list as a key for PySpark's reduceByKey

how to print out snippets of a RDD in the spark-shell / pyspark?

How to join two RDDs in spark with python?

Joining two spark dataframes on time (TimestampType) in python

pySpark Create DataFrame from RDD with Key/Value

Create Spark DataFrame from nested dictionary

Not able to connect to postgres using jdbc in pyspark shell

Functions from Python packages for udf() of Spark dataframe

Save Apache Spark mllib model in python

Does spark read the same file twice, if two stages are using the same DataFrame?

pyspark.sql.utils.AnalysisException: Parquet data source does not support void data type

Join dataframes and rename resulting columns with same names