How to load a SPARK NLP pretrained pipeline through HDFS

Question

I've already installed sparknlp and its assembly jars, but I still get an error when I try to use one of the models, I get a TypeError: 'JavaPackage' object is not callable.

I cannot install the model and load it from disk because it's considered too big (>100MB) to my project, so I've been suggested to use HDFS to load the pretrained model. Is there a way to do that?

My code:

    from sparknlp.pretrained import PretrainedPipeline
    spark = sparknlp.start()
    pipeline = PretrainedPipeline('analyze_sentimentdl_glove_imdb', lang = 'en')
    annotations =  pipeline.fullAnnotate("Hello from John Snow Labs ! ")[0]

What would be the equivalent for loading with HDFS?

EDIT: Full traceback:

Using NLU code:

Traceback (most recent call last):

  File "/usr/local/juicer/juicer/spark/spark_minion.py", line 490, in _perform_execute
    raise ex from None

  File "/usr/local/juicer/juicer/spark/spark_minion.py", line 486, in _perform_execute
    self._emit_event(room=job_id, namespace='/stand'))

  File "/tmp/juicer_app_10_10_60.py", line 230, in main
    task_futures['a6d45e1d-4322-443e-b7e9-ed78b504a8b0'].result()

  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()

  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception

  File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)

  File "/tmp/juicer_app_10_10_60.py", line 228, in <lambda>
    lambda: sentimentanalysis_1(spark_session, cached_state, cached_emit_event))

  File "/tmp/juicer_app_10_10_60.py", line 150, in sentimentanalysis_1
    result_df = nlu.load('emotion').predict("I am so happy")

  File "/usr/local/lib/python3.7/dist-packages/nlu/__init__.py", line 153, in load
    f"Something went wrong during creating the Spark NLP model for your request =  {request} Did you use a NLU Spell?")

Exception: Something went wrong during creating the Spark NLP model for your request =  emotion Did you use a NLU Spell?

Using my original code (spark nlu):

Traceback (most recent call last):

  File "/usr/local/juicer/juicer/spark/spark_minion.py", line 490, in _perform_execute
    raise ex from None

  File "/usr/local/juicer/juicer/spark/spark_minion.py", line 486, in _perform_execute
    self._emit_event(room=job_id, namespace='/stand'))

  File "/tmp/juicer_app_10_10_51.py", line 225, in main
    task_futures['a6d45e1d-4322-443e-b7e9-ed78b504a8b0'].result()

  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()

  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception

  File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)

  File "/tmp/juicer_app_10_10_51.py", line 223, in <lambda>
    lambda: sentimentanalysis_1(spark_session, cached_state, cached_emit_event))

  File "/tmp/juicer_app_10_10_51.py", line 145, in sentimentanalysis_1
    pipeline = PretrainedPipeline('analyze_sentimentdl_glove_imdb', lang = 'en')

  File "/usr/local/lib/python3.7/dist-packages/sparknlp/pretrained.py", line 141, in __init__
    self.model = ResourceDownloader().downloadPipeline(name, lang, remote_loc)

  File "/usr/local/lib/python3.7/dist-packages/sparknlp/pretrained.py", line 72, in downloadPipeline
    file_size = _internal._GetResourceSize(name, language, remote_loc).apply()

  File "/usr/local/lib/python3.7/dist-packages/sparknlp/internal.py", line 232, in __init__
    "com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize", name, language, remote_loc)

  File "/usr/local/lib/python3.7/dist-packages/sparknlp/internal.py", line 165, in __init__
    self._java_obj = self.new_java_obj(java_obj, *args)

  File "/usr/local/lib/python3.7/dist-packages/sparknlp/internal.py", line 175, in new_java_obj
    return self._new_java_obj(java_class, *args)

  File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
    return java_obj(*java_args)

TypeError: 'JavaPackage' object is not callabl

So, this error has nothing to do with the file size. Besides, even at 101Mb, loading from local filesystem should be fine. I'm not familiar with Sparknlp, but the error seems to be caused internally to that library, so perhaps you should file an issue with them, or perhaps verify it supports Python 3.7 first — OneCricketeer, Mar 10 '22 at 14:05

How to load a SPARK NLP pretrained pipeline through HDFS

0 Answers0