Creating a UDF in pyspark to select a column and parsing each row through beautiful soup to get a string

Question

I have a python pyspark block of code that collects data from a dataframe column(Body) and I am able to use beautifulsoup to parse the <p> tags paragraph for each row and turn it to a long string.

text_list = []
for row in df_java.select("Body").collect():
    soup = BeautifulSoup(row[0], "html.parser")
    for p in soup.find_all("p"):
        text = p.get_text()
        text_list.append(text)                 
string = ''.join(text_list)
print (string)

my results is

I am basically trying to sort an input file of Students and Marks into alphabetic and numeric order. I have 4 classes, however I cannot manage to get it to print the student with the mark in any order. Let alone in a alphabetic and numeric order. Any help in how I can get the results printing as a total or any help at all is greatly appreciated. Below is the code I have used for the 4 classes and the input file.Input File:Code:I am trying to get this program to get the passwords from an array list.The output is or just for the 2nd thing I tried.The specific number / letter combination seems to change each time the program is run. Is there a way to specify which string to display from the array list? Is it possible to create reentrant aspects with Spring AOP (or AspectJ)?Here is an example:And Aspect:}Now I'd like to know how many times calcFibonacci was called (counting in recurrent calls)........

which is a string of all the joined text and my hoped for result when I call the function below.

I am trying to create a UDF in pyspark which I am new to so I can just call the function with different dataframes. I defined a function and added an argument of the dataframe to the function such as

@udf
def collect_textual_content(data_set):
    list1 =[]
    for row in data_set:
        soup = BeautifulSoup(row[0], "html.parser")
        for p in soup.find_all("p"):
            text = p.get_text()
            return text
            list1.append(text)
    string = ''.join(list1)
    return string

when I call the function collect_textual_content(df_java.select("Body").collect()) I get an error

Invalid argument, not a string or column. [Row(Body='
I am basically trying to sort an input file of Students and Marks into alphabetic and numeric order. I have 4 classes, however I cannot manage to get it to print the student with the mark in any order. Let alone in a alphabetic and numeric order. Any help in how I can get the results printing as a total or any help at all is greatly appreciated.\nBelow is the code I have used for the 4 classes and the input file.
\n\n
Input File:

which is not parsed what so ever

And the type of argument in the called function is a list of strings.

I hope anyone good at pyspark would know a solution.

You are running the udf over the dataframe, a udf should be ran over a column (or string as reference to the column. So something like `df_java.select(collect_textual_content("Body")).collect()` Also, I see that you have a line `return text` in the udf which is not in the original parsing code. — Thijs, Oct 18 '21 at 19:53

Creating a UDF in pyspark to select a column and parsing each row through beautiful soup to get a string

0 Answers0