I am reading a csv file in Pyspark as follows:
df_raw=spark.read.option("header","true").csv(csv_path)
However, the data file has quoted fields with embedded commas in them which
should not be treated as commas. How can I handle this in Pyspark ?…
I'm using pyspark, loading a large csv file into a dataframe with spark-csv, and as a pre-processing step I need to apply a variety of operations to the data available in one of the columns (that contains a json string). That will return X values,…
I have a data frame in pyspark with more than 300 columns. In these columns there are some columns with values null.
For example:
Column_1 column_2
null null
null null
234 null
125 124
365 187
and so on
When I want to do a…
I have a data frame in python/pyspark with columns id time city zip and so on......
Now I added a new column name to this data frame.
Now I have to arrange the columns in such a way that the name column comes after id
I have done like…
I'm trying to make multiple operations in one line of code in pySpark,
and not sure if that's possible for my case.
My intention is not having to save the output as a new dataframe.
My current code is rather simple:
encodeUDF = udf(encode_time,…
I am almost certain this has been asked before, but a search through stackoverflow did not answer my question. Not a duplicate of [2] since I want the maximum value, not the most frequent item. I am new to pyspark and trying to do something really…
I want to convert the values inside a column to lowercase. Currently if I use the lower() method, it complains that column objects are not callable. Since there's a function called lower() in SQL, I assume there's a native Spark solution that…
Using Spark 1.6, I have a Spark DataFrame column (named let's say col1) with values A, B, C, DS, DNS, E, F, G and H. I want to create a new column (say col2) with the values from the dict here below. How do I map this? (e,g. 'A' needs to be mapped…
I've successfully create a row_number() partitionBy by in Spark using Window, but would like to sort this by descending, instead of the default ascending. Here is my working code:
from pyspark import HiveContext
from pyspark.sql.types import *
from…
I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.
df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"])
I do the following to…
I have a pyspark dataframe consisting of one column, called json, where each row is a unicode string of json. I'd like to parse each row and return a new dataframe where each row is the parsed json.
# Sample Data Frame
jstr1 =…
I have a dataframe in Spark in which one of the columns contains an array.Now,I have written a separate UDF which converts the array to another array with distinct values in it only. See example below:
Ex: [24,23,27,23] should get converted to [24,…
I'm trying to use Spark dataframes instead of RDDs since they appear to be more high-level than RDDs and tend to produce more readable code.
In a 14-nodes Google Dataproc cluster, I have about 6 millions names that are translated to ids by two…
How do I compute the cumulative sum per group specifically using the DataFrame abstraction; and in PySpark?
With an example dataset as follows:
df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")],
…