My database contains one column of strings. I'm going to create a new column based on part of string of other columns. For example:
"content" "other column"
The father has two dogs father
One cat stay at home of my mother mother
etc. etc.
I thought to create an array with words who interessed me. For example: people=[mother,father,etc.]
Then, I iterate on column "content" and extract the word to insert on new column:
def extract_people(df):
column=[]
people=[mother,father,etc.]
for row in df.select("content").collect():
for word in people:
if str(row).find(word):
column.append(word)
break
return pd.Series(column)
f_pyspark = df_pyspark.withColumn('people', extract_people(df_pyspark))
This code don't work and give me this error on the collect():
22/01/26 11:34:04 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 36)
java.lang.OutOfMemoryError: Java heap space
Maybe because my file is too large, have 15 million of row. How I may make the new column in different mode?