3

I'm reading a large collection of text files into a DataFrame. Initially it will just have one column, value. The text files use HTML encoding (i.e., they have &lt; instead of <, etc.). I want to decode all of them back to normal characters.

Obviously, I could do it with a UDF, but it would be super slow.

I could try regexp_replace, but it would be even slower, since there's over 200 named entities, and each would require its own regexp function. Each regexp_replace call will need to parse the entire line of text, searching for one specific encoded character at a time.

What is a good approach?

zero323
  • 322,348
  • 103
  • 959
  • 935
max
  • 49,282
  • 56
  • 208
  • 355
  • 1
    I would take the `regexp_replace` approach – Alberto Bonsanto Jul 02 '16 at 21:58
  • @AlbertoBonsanto it won't work, I'm afraid. I updated the question to talk about this. There's [200+ named entities](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references). So many regexp_replace will be so slow that it's probably faster to just a UDF with a dictionary that stores the conversion map. – max Jul 02 '16 at 23:31
  • Language tag would be useful here.. In general UDFs are not the first choice but overall these are not "super" slow(er) although have other associated costs depending on a language and Spark version. – zero323 Jul 03 '16 at 11:47
  • @zero323 ahh I didn't realize that. I thought all UDFs are slow. So I guess if I was writing in Scala, UDFs would be almost as fast as built-in functions (apart from not being part of optimization)? Unfortunately, I"m using python. – max Jul 03 '16 at 17:54
  • Python UDFs use completely different evaluation strategy. So the cost is not the cost of UDF as such but passing data between JVM and Python interpreter. – zero323 Jul 03 '16 at 18:01
  • @zero323 got it. so the answer is: "use UDF, but try to switch to Scala or Java", right? if you want maybe you can post it as an answer? – max Jul 03 '16 at 19:52
  • If I was sure how to answer I would, but I am not. Impact of moving data is significant but shouldn't be prohibiting and personally I find Python significantly when it comes to string and xml processing. Which version of Python do you use? Is the input just a plain text? – zero323 Jul 03 '16 at 20:06
  • @zero323 Python 3.5. Plain text. Did you mean you find python significantly more convenient? – max Jul 04 '16 at 09:27

1 Answers1

1

Since you read plain text input I would simply skip UDF part and pass data to JVM after initial processing. With Python 3.4+:

import html
from pyspark.sql.types import StringType, StructField, StructType

def clean(s):
    return html.unescape(s), 

(sc.textFile("README.md")
    .map(clean)
    .toDF(StructType([StructField("value", StringType(), False)])))
zero323
  • 322,348
  • 103
  • 959
  • 935
  • So it still requires moving data from python to JVM, but at least not both directions as in the case of UDF right? – max Jul 04 '16 at 16:22
  • Yes it involves SerDe activity when data is passed to DataFrame. From the other hand it doesn't require additional dependencies so it can a win after all :) And truth be told if you don't plan to use anything beyond strict DataFrame then PySpark doesn't make much sense. – zero323 Jul 04 '16 at 16:29