How to decode HTML entities in Spark?

Question

I'm reading a large collection of text files into a DataFrame. Initially it will just have one column, value. The text files use HTML encoding (i.e., they have < instead of <, etc.). I want to decode all of them back to normal characters.

Obviously, I could do it with a UDF, but it would be super slow.

I could try regexp_replace, but it would be even slower, since there's over 200 named entities, and each would require its own regexp function. Each regexp_replace call will need to parse the entire line of text, searching for one specific encoded character at a time.

What is a good approach?

@AlbertoBonsanto it won't work, I'm afraid. I updated the question to talk about this. There's [200+ named entities](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references). So many regexp_replace will be so slow that it's probably faster to just a UDF with a dictionary that stores the conversion map. — max, Jul 02 '16 at 23:31
Language tag would be useful here.. In general UDFs are not the first choice but overall these are not "super" slow(er) although have other associated costs depending on a language and Spark version. — zero323, Jul 03 '16 at 11:47
@zero323 ahh I didn't realize that. I thought all UDFs are slow. So I guess if I was writing in Scala, UDFs would be almost as fast as built-in functions (apart from not being part of optimization)? Unfortunately, I"m using python. — max, Jul 03 '16 at 17:54
Python UDFs use completely different evaluation strategy. So the cost is not the cost of UDF as such but passing data between JVM and Python interpreter. — zero323, Jul 03 '16 at 18:01
@zero323 got it. so the answer is: "use UDF, but try to switch to Scala or Java", right? if you want maybe you can post it as an answer? — max, Jul 03 '16 at 19:52
If I was sure how to answer I would, but I am not. Impact of moving data is significant but shouldn't be prohibiting and personally I find Python significantly when it comes to string and xml processing. Which version of Python do you use? Is the input just a plain text? — zero323, Jul 03 '16 at 20:06
@zero323 Python 3.5. Plain text. Did you mean you find python significantly more convenient? — max, Jul 04 '16 at 09:27

score 1 · Accepted Answer · answered Jul 04 '16 at 09:49

1

Since you read plain text input I would simply skip UDF part and pass data to JVM after initial processing. With Python 3.4+:

import html
from pyspark.sql.types import StringType, StructField, StructType

def clean(s):
    return html.unescape(s), 

(sc.textFile("README.md")
    .map(clean)
    .toDF(StructType([StructField("value", StringType(), False)])))

answered Jul 04 '16 at 09:49

zero323

322,348
103
959
935

So it still requires moving data from python to JVM, but at least not both directions as in the case of UDF right? – max Jul 04 '16 at 16:22
Yes it involves SerDe activity when data is passed to DataFrame. From the other hand it doesn't require additional dependencies so it can a win after all :) And truth be told if you don't plan to use anything beyond strict DataFrame then PySpark doesn't make much sense. – zero323 Jul 04 '16 at 16:29

How to decode HTML entities in Spark?

1 Answers1

Linked