I have the following error on my notebook after setting up and EMR 6.3.0:
An error was encountered:
Install s3fs to access S3
Traceback (most recent call last):
File "/usr/local/lib64/python3.7/site-packages/pandas/io/parquet.py", line 460, in read_parquet
path, columns=columns, use_nullable_dtypes=use_nullable_dtypes, **kwargs
File "/usr/local/lib64/python3.7/site-packages/pandas/io/parquet.py", line 218, in read
mode="rb",
File "/usr/local/lib64/python3.7/site-packages/pandas/io/parquet.py", line 67, in _get_path_or_handle
path_or_handle, **(storage_options or {})
File "/usr/local/lib/python3.7/site-packages/fsspec/core.py", line 353, in url_to_fs
chain = _un_chain(url, kwargs)
File "/usr/local/lib/python3.7/site-packages/fsspec/core.py", line 315, in _un_chain
cls = get_filesystem_class(protocol)
File "/usr/local/lib/python3.7/site-packages/fsspec/registry.py", line 213, in get_filesystem_class
raise ImportError(bit["err"]) from e
ImportError: Install s3fs to access S3
The EMR is setup with JupyterHub 1.2.0, TensorFlow 2.4.1, Spark 3.1.1 and I ran the following bootstrap:
#!/bin/bash
sudo python3 -m pip install -U setuptools
sudo python3 -m pip install -U pip
sudo python3 -m pip install wheel
sudo python3 -m pip install pillow
sudo python3 -m pip install pandas==1.2.5
sudo python3 -m pip install pyarrow
sudo python3 -m pip install boto3
sudo python3 -m pip install s3fs
sudo python3 -m pip install fsspec
THe notebook is on an S3 bucket: https://p8-data-001.s3.eu-west-3.amazonaws.com/jupyter/jovyan/P8_Notebook_Linux_EMR_PySpark_V1.0.ipynb
The error happens after #4.10.6. From what i have read, it seems I need to downgrade boto3 but if i do so, I have an error with botocore version compatibility. Does anyone knows how I should set up my bootstrap ?
I would expect the read_parquet to be working fine as s3fs is installed according to my log file.