0

My database contains one column of strings. I'm going to create a new column based on part of string of other columns. For example:

         "content"                             "other column"
The father has two dogs                            father
One cat stay at home of my mother                  mother
etc.                                               etc.

I thought to create an array with words who interessed me. For example: people=[mother,father,etc.]

Then, I iterate on column "content" and extract the word to insert on new column:



def extract_people(df):
    column=[]
    people=[mother,father,etc.]
    for row in df.select("content").collect():
        for word in people:
            if str(row).find(word):
                column.append(word)
                break
    return pd.Series(column)


f_pyspark = df_pyspark.withColumn('people', extract_people(df_pyspark))

This code don't work and give me this error on the collect():

22/01/26 11:34:04 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 36)
java.lang.OutOfMemoryError: Java heap space

Maybe because my file is too large, have 15 million of row. How I may make the new column in different mode?

domenico
  • 59
  • 1
  • 7
  • Do not use `.collect()` unless your data is small. It fetch all the data on the driver node and thus, does not work in any distributed way. Do you really needs Spark for this ? It can be done in pure Python. – Gaarv Jan 26 '22 at 13:46
  • If you really want to use Spark, there is other threads about this: https://stackoverflow.com/questions/61636254/pyspark-udf-that-checks-if-a-column-contains-one-of-elements-of-a-list – Gaarv Jan 26 '22 at 13:54

1 Answers1

1

Using the following dataframe as an example

+---------------------------------+
|content                          |
+---------------------------------+
|Thefatherhas two dogs            |
|The fatherhas two dogs           |
|Thefather has two dogs           |
|Thefatherhastwodogs              |
|One cat stay at home of my mother|
|One cat stay at home of mymother |
|Onecatstayathomeofmymother       |
|etc.                             |
|my feet smell                    |
+---------------------------------+

You can do the following

from pyspark.sql import functions

arr = ["father", "mother", "etc."]

expression = (
   "CASE " + 
    "".join(["WHEN content LIKE '%{}%' THEN '{}' ".format(val, val) for val in arr]) + 
     "ELSE 'None' END")

df = df.withColumn("other_column", functions.expr(expression))
df.show()
+---------------------------------+------------+
|content                          |other_column|
+---------------------------------+------------+
|Thefatherhas two dogs            |father      |
|The fatherhas two dogs           |father      |
|Thefather has two dogs           |father      |
|Thefatherhastwodogs              |father      |
|One cat stay at home of my mother|mother      |
|One cat stay at home of mymother |mother      |
|Onecatstayathomeofmymother       |mother      |
|etc.                             |etc.        |
|my feet smell                    |None        |
+---------------------------------+------------+
BoomBoxBoy
  • 1,770
  • 1
  • 5
  • 23
  • 1
    Great answer, but i tell you the problem not correctly. In particular I need to search a word in text even if it is attached to another word. For example: – domenico Jan 26 '22 at 17:08
  • This should work for that case as well! I altered the example to show that it works when the word is not surrounded by spaces – BoomBoxBoy Jan 26 '22 at 17:13
  • Great answer, but I tell my problem not correctly. In particular I need to write the word in the new column even if it is attached to another word. FOR EXAMPLE: If the text is -> The fatherhas two dogs | The text to attached in new column should be: father | But, in this case new column will contains None. | How need modify the query? – domenico Jan 26 '22 at 17:14
  • Maybe, have to modify LIKE with another word. – domenico Jan 26 '22 at 17:18
  • I edited the above example again to show more cases.. If it still isn't what you are looking for, could you provide a more specific example in the question? – BoomBoxBoy Jan 26 '22 at 17:22
  • And other, if the array is: arr=["FATHER","MOTHER",...], whith upper case. And the text is: "The BigFather has two dogs", then the word to attach at new column should be "FATHER". How modify your answer to satisfy this problem? – domenico Jan 26 '22 at 17:24
  • Change `WHEN content LIKE '%{}%' `to `WHEN UPPER(content) LIKE '%{}%' `. Note that this would only then match words that are all uppercase. – BoomBoxBoy Jan 26 '22 at 17:26
  • 1
    Thank you very much. It working!!! – domenico Jan 26 '22 at 17:49
  • Hi @BrendanA, I would ask you another question. If I have the text: "My father have ", "My mother have ", etc. And I'm going to concatenate for each text the phrase: "two dogs". How I can do it? – domenico Jan 27 '22 at 14:41
  • Hey @domenico, its a bit tough to understand what your are looking for! If you could open up a new question with an example I would be happy to help! – BoomBoxBoy Jan 27 '22 at 15:12
  • Go to this question: https://stackoverflow.com/questions/70881912/ho-use-pyspark-to-concatenate-a-string-to-column – domenico Jan 27 '22 at 16:11