0

I am using SparkNLP from johnsnowlabs for extracting embeddings from my textual data, below is the pipeline. The size of the model is 1.8g after saving to hdfs

embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") \
      .setInputCols("sentence") \
      .setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

I saved the pipeline_model into HDFS using pipeline_model.save("hdfs:///<path>").

The above was executed only one time

In another script, i am loading the stored pipeline from HDFS using pipeline_model = PretrainedPipeline.from_disk("hdfs:///<path>").

The above code loads the model but takes too much. I tested it on the spark local model ( no cluster ) but i had high resource 94g RAM, 32 Cores.

Later, i deployed the script on yarn with 12 Executor each with 3 cores and 7g ram. I assigned driver memory of 10g.

The script again takes too much time just to load the saved model from HDFS.

When the spark reaches at this point, it takes too much time

When the spark reaches at this point (see above screenshot), it takes too much time

I thought of an approach

Preloading

The approach which i thought was to somehow pre-load model one time into memory, and when the script wants to apply transformation on dataframe, i can somehow call the reference to the pretrained pipeline and use it on the go, without doing any disk i/o. I searched but i it lead to nowhere.

Please, do let me know what you think of this solution and what would be the best way to achieve this.

Resources on YARN

NodeName Count RAM (each) Cores (each)
Master Node 1 38g 8
Secondary Node 1 38 g 8
Worker Nodes 4 24 g 4
Total 6 172g 32

Thanks

Danial Shabbir
  • 612
  • 8
  • 18
  • 1
    I also experienced extremely poor performance with sparknlp labse on Hadoop cpu cluster. Ended up using huggingface pytorch port, up to X100 times faster. – shay__ Jun 09 '21 at 05:14
  • Also, make sure you're using kryo serialization. – shay__ Jun 09 '21 at 05:51
  • sure :) with pytorch I just use `df.rdd.mapPartitions` and use the model manually... if you still want to use sparknlp, you might want to check issue #2846 on github, about output not equal to the original model – shay__ Jun 09 '21 at 07:11
  • Okay, thanks. i used hugging face transformer. Thanks for the tips, it works fine now – Danial Shabbir Jun 09 '21 at 07:26
  • @shay__ what do u think about using the hugging_transformer inside udf? i am using it in udf but the spark job automatically gets killed. edit: I checked using system monitor, the ram gets full, i have 100gb RAM and it's get fulled, when i use sparknlp it only uses 28gb of RAM. – Danial Shabbir Jun 10 '21 at 07:46
  • I set `spark.yarn.executor.memoryOverhead=16gb` and make sure every partition has no more than 10MB of data – shay__ Jun 10 '21 at 08:38
  • and in general I wouldn't use udf – shay__ Jun 10 '21 at 08:57
  • Shay I used that spark.yarn.executor.memoryOverhead=16gb, I am using sparkling water h2o AutoML but the problem now is, when I input the feature matrix in AutoML function to start training, it gets killed automatically. It's like the memory is spilled or something, because my whole ram is being utilized – Danial Shabbir Jun 14 '21 at 03:53
  • try to train on a single data point, just to make sure the issue is not related to scale (but rather to loading the model, etc.) – shay__ Jun 14 '21 at 06:27
  • okay, i will let you know. Thanks! – Danial Shabbir Jun 14 '21 at 06:51
  • @shay__ i did try loading 5 records and the script works fine. I think the issue here is large data, i have 0.5 Million and the records will be growing in the future, may be exponentially. What should be the optimal method for this? – Danial Shabbir Jun 14 '21 at 07:04
  • I mentioned earlier that I repartition the data - in your case try `df.repartition(500)` just to make sure there are ~1k rows per partition. Then (inside `mapPartitions`) I load the entire partition to memory - and pass it to the pytorch model. Also - how many cores you set per executor? mind that every core will use its own python process. – shay__ Jun 14 '21 at 07:17
  • @shay__ i have 32 cores, i am using 30 cores, i am using locally on PC, i have 94gb ram. – Danial Shabbir Jun 14 '21 at 07:22
  • after partitioning, the script seems to works partially, when the mapPartitions() is called it gets `ERROR:root:Exception while sending command. Traceback (most recent call last): File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_4782/container_e199_1623058160826_4782_01_000001/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1062, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty` – Danial Shabbir Jun 14 '21 at 11:29
  • I'm not sure you can utilize 30 cores in this case... try 5, again just to make sure this is the issue. – shay__ Jun 14 '21 at 12:35
  • okay , i tried with 5 rows on my Yarn Cluster, the error is same. I am broacasting the model, the size of the model is 1.9gb and then i am using broacasted model inside mapPartition(). i think that is creating the issue – Danial Shabbir Jun 14 '21 at 12:43
  • I dont use broadcast - I use `sc.addFile(hdfs_dir_path, recursive=True)` on the driver and `SparkFiles.get()` on the executors - much better in this case :) – shay__ Jun 14 '21 at 13:22
  • @shay__ thank you for your awesome support till now, but please can u share me a sample code of loading hugging_face model from `HDFS` using `sc.addFile()` and performing `word_embeddings` . That will be very helpful and much appreciated. – Danial Shabbir Jun 15 '21 at 05:26
  • @shay__ i am using `sentence-transformer` but if my code works with `hugging-face` i can also shift on it. – Danial Shabbir Jun 15 '21 at 05:27
  • 1
    please see example as an answer – shay__ Jun 15 '21 at 06:00
  • @DanialShabbir Setting `.config("spark.kryoserializer.buffer.max", "2000M")` is very important for loading large models especially in PySpark (serialization via Kyro) - `labse` is very large – Maziyar Mar 23 '22 at 13:15

1 Answers1

1

As discussed in the comments, this is a solution based on PyTorch, not SparkNLP. Simplified code:

# labse_spark.py

LABSE_MODEL, LABSE_TOKENIZER = None


def transform(spark, df, input_col='text', output_col='output'):
    spark.sparkContext.addFile('hdfs:///path/to/labse_model')
    output_schema = T.StructType(df.schema.fields + [T.StructField(output_col, T.ArrayType(T.FloatType()))])

    rdd = df.rdd.mapPartitions(_map_partitions_func(input_col, output_col))
    res = spark.createDataFrame(data=rdd, schema=output_schema)
    return res


def _map_partitions_func(input_col, output_col):
    def executor_func(rows):
        # load everything to memory (partitions should be small, ~1k rows per partition):
        pandas_df = pd.DataFrame([r.asDict() for r in rows])
        global LABSE_MODEL, LABSE_TOKENIZER
        if not (LABSE_TOKENIZER or LABSE_MODEL):  # should happen once per executor core
            LABSE_TOKENIZER = AutoTokenizer.from_pretrained(SparkFiles.get('labse_model'))
            LABSE_MODEL = AutoModel.from_pretrained(SparkFiles.get('labse_model'))
        
        # copied from HF model card:
        encoded_input = LABSE_TOKENIZER(
            pandas_df[input_col].tolist(), padding=True, truncation=True, max_length=64, return_tensors='pt')
        with torch.no_grad():
            model_output = LABSE_MODEL(**encoded_input)
        embeddings = model_output.pooler_output
        embeddings = torch.nn.functional.normalize(embeddings)

        pandas_df[output_col] = pd.Series(embeddings.tolist())
        return pandas_df.to_dict('records')

    return executor_func
shay__
  • 3,815
  • 17
  • 34