0

Can I please seek your help on how to install and use Spark NLP and Pyspark in Kaggle notebook when the internet is disabled? I have already attempted myself quite a number of times, but unfortunately, I am still not able to get it worked. Your guidance will be very much appreciated.

'''
Once the JAR file is uploaded after downloaded from https://github.com/JohnSnowLabs/spark-nlp/releases/tag/5.0.0 under `FAT JARs`, 
option: CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.0.0.jar
, you can determine its path within the Kaggle environment
'''
import os
# `TypeError: 'JavaPackage' object is not callable` will be thrown if not provided
jar_path = '/kaggle/input/spark-nlp-assembly-500jar/spark-nlp-assembly-5.0.0.jar'
print(os.path.exists(jar_path)) # True

# https://pypi.org/project/spark-nlp/#files: spark_nlp-5.0.2-py2.py3-none-any.wh
!pip install /kaggle/input/spark-nlp-502-py2py3-none-anywhl/spark_nlp-5.0.2-py2.py3-none-any.whl
'''
Processing /kaggle/input/spark-nlp-502-py2py3-none-anywhl/spark_nlp-5.0.2-py2.py3-none-any.whl
Installing collected packages: spark-nlp
Successfully installed spark-nlp-5.0.2
'''

I believe I have Spark NLP installed at this point given the coding and log message above. Therefore, I proceeded to the next step, which unfortunately led to an error related to PySpark.

import sparknlp

'''
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[13], line 1
----> 1 import sparknlp

File /opt/conda/lib/python3.10/site-packages/sparknlp/__init__.py:18
     16 import subprocess
     17 import threading
---> 18 from pyspark.sql import SparkSession
     19 from sparknlp import annotator
     20 # Must be declared here one by one or else PretrainedPipeline will fail with AttributeError

ModuleNotFoundError: No module named 'pyspark'
'''

So I went to download pyspark-3.4.1.tar.gz from https://pypi.org/project/pyspark/#files and uploaded it to my Kaggle notebook. Unfortunately it is still not working.

!pip install /kaggle/input/pyspark-341targz/pyspark-3.4.1

'''
Processing /kaggle/input/pyspark-341targz/pyspark-3.4.1
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [8 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/kaggle/input/pyspark-341targz/pyspark-3.4.1/setup.py", line 183, in <module>
          copyfile("pyspark/shell.py", "pyspark/python/pyspark/shell.py")
        File "/opt/conda/lib/python3.10/shutil.py", line 256, in copyfile
          with open(dst, 'wb') as fdst:
      OSError: [Errno 30] Read-only file system: 'pyspark/python/pyspark/shell.py'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
'''

I checked by running the following two pip with the internet on and noticed the two wheel files, py4j-0.10.9.7-py2.py3-none-any.whl and pyspark-3.4.1-py2.py3-none-any.wh. I was able to locate py4j-0.10.9.7-py2.py3-none-any.whl from https://pypi.org/project/py4j/#files but not pyspark-3.4.1-py2.py3-none-any.wh. As far as I can understand the log message, pyspark-3.4.1-py2.py3-none-any.wh is only created after executing the command.

!pip download pyspark
'''
Collecting pyspark
  Using cached pyspark-3.4.1.tar.gz (310.8 MB)
  Preparing metadata (setup.py) ... done
Collecting py4j==0.10.9.7 (from pyspark)
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.5/200.5 kB 7.6 MB/s eta 0:00:00
Saved ./pyspark-3.4.1.tar.gz
Saved ./py4j-0.10.9.7-py2.py3-none-any.whl
Successfully downloaded pyspark py4j
'''

!pip install ./pyspark-3.4.1.tar.gz
'''
Processing ./pyspark-3.4.1.tar.gz
  Preparing metadata (setup.py) ... done
Requirement already satisfied: py4j==0.10.9.7 in /opt/conda/lib/python3.10/site-packages (from pyspark==3.4.1) (0.10.9.7)
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... done
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285413 sha256=a0ee7978f25af3b6b09b05ad35c2e76570fa95b44fcefc54fe31e3883242638e
  Stored in directory: /root/.cache/pip/wheels/6c/07/fb/6d94088fb2a66b99f7632f394832b13e8b982fb8dd3d606c20
Successfully built pyspark
Installing collected packages: pyspark
  Attempting uninstall: pyspark
    Found existing installation: pyspark 3.4.1
    Uninstalling pyspark-3.4.1:
      Successfully uninstalled pyspark-3.4.1
Successfully installed pyspark-3.4.1
'''
gracenz
  • 137
  • 1
  • 10

0 Answers0