I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this?
In my current use case, I have a list of addresses that I want to normalize. For example this dataframe:
id …
Context: I have a DataFrame with 2 columns: word and vector. Where the column type of "vector" is VectorUDT.
An Example:
word | vector
assert | [435,323,324,212...]
And I want to get this:
word | v1 | v2 | v3 | v4 | v5 | v6 ......
assert |…
I am loading some data into Spark with a wrapper function:
def load_data( filename ):
df = sqlContext.read.format("com.databricks.spark.csv")\
.option("delimiter", "\t")\
.option("header", "false")\
.option("mode",…
There's a DataFrame in pyspark with data as below:
user_id object_id score
user_1 object_1 3
user_1 object_1 1
user_1 object_2 2
user_2 object_1 5
user_2 object_2 2
user_2 object_2 6
What I expect is returning 2 records in each group…
I would like to modify the cell values of a dataframe column (Age) where currently it is blank and I would only do it if another column (Survived) has the value 0 for the corresponding row where it is blank for Age. If it is 1 in the Survived…
I am writing a User Defined Function which will take all the columns except the first one in a dataframe and do sum (or any other operation). Now the dataframe can sometimes have 3 columns or 4 columns or more. It will vary.
I know I can hard code…
I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode:
df = df.withColumn('new_column',
IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.)
I am trying to do this…
Question: in pandas when dropping duplicates you can specify which columns to keep. Is there an equivalent in Spark Dataframes?
Pandas:
df.sort_values('actual_datetime', ascending=False).drop_duplicates(subset=['scheduled_datetime',…
I need to use the
(rdd.)partitionBy(npartitions, custom_partitioner)
method that is not available on the DataFrame. All of the DataFrame methods refer only to DataFrame results. So then how to create an RDD from the DataFrame data?
Note: this is…
I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year.
from pyspark.sql.functions import col
import pyspark.sql.functions as fn
gr = Df2.groupby(['Year'])
df_grouped =…
We are reading data from MongoDB Collection. Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ).
I am trying to get a datatype using pyspark.
My problem is some columns have different datatype.
Assume quantity and…
Is there an equivalent of Pandas Melt function in Apache Spark in PySpark or at least in Scala?
I was running a sample dataset till now in Python and now I want to use Spark for the entire dataset.
I get the following error when I add --conf spark.driver.maxResultSize=2050 to my spark-submit command.
17/12/27 18:33:19 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /XXX.XX.XXX.XX:36245 is closed
17/12/27…