4

I'm experimenting with NLTK in an Azure Synapse notebook. When I try and run nltk.download('stopwords') I get the following error:

ValueError: I/O operation on closed file
Traceback (most recent call last):

  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 782, in download
    show(msg.message)

  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 775, in show
    subsequent_indent=prefix + prefix2 + " " * 4,

  File "/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1616860588116_0001/container_1616860588116_0001_01_000001/tmp/9026485902214290372", line 536, in write
    super(UnicodeDecodingStringIO, self).write(s)

ValueError: I/O operation on closed file

If I try and just run nltk.download() I get the following error:

EOFError: EOF when reading a line
Traceback (most recent call last):

  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 765, in download
    self._interactive_download()

  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 1117, in _interactive_download
    DownloaderShell(self).run()

  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/nltk/downloader.py", line 1143, in run
    user_input = input("Downloader> ").strip()

EOFError: EOF when reading a line

I'm hoping someone could give me some help on what may be causing this and how to get around it. I haven't been able to find much information on where to go from here.

Edit: The code I am using to generate the error is the following:

import nltk
nltk.download('stopwords')

Update I ended up opening a support request with Microsoft and this was their response:

Synapse does not support arbitrary shell scripts which is where you would download the related model corpus for NLTK

They recommended I use sc.addFile, which I ended up getting to work. So if anyone else finds this, here's what I did.

  1. Downloaded the NLTK stopwords here: http://nltk.org/nltk_data/
  2. Upload the stopwords to the follwoing folder in storage: abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk_data/corpora/stopwords/
  3. Run the below code to import them

.

import os
import sys
import nltk
from pyspark import SparkFiles

#add stopwords from storage
sc.addFile('abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk_data/',True)

#append path to NLTK
nltk.data.path.append(SparkFiles.getRootDirectory() + '/nltk_data')

nltk.corpus.stopwords.words('english')

Thanks!

User181
  • 111
  • 8
  • Just to rule out the obvious, have you got at least Storage Blob Data Contributor on any storage associated with your Synapse workspace, and are Contributor or Owner of the workspace? Just for fun I'd try the same code in an Azure Databricks notebook just to see if you get a different behaviour. Otherwise you're looking at raising a support ticket I think. – wBob Mar 27 '21 at 17:33
  • This is a test environment I setup, so I do have owner/contributor on all of the services. I did actually try it on Databricks and the code runs fine. This is my first time using Synapse, so I wasn't sure if I was missing something or if there's an extra step when downloading data like this. – User181 Mar 27 '21 at 18:33
  • Did you set up the storage as part of the workspace or did it already exist? If you set it up then you probably do have the new RBAC role Storage Blob Data Contributor. If the storage already existed you probably don’t. Please check, just to rule it out. Otherwise consider posting a more complete code example which would allow someone to reproduce the error. – wBob Mar 27 '21 at 18:41
  • Thanks - I updated the post with the exact code I'm running. The storage account already existed, however, I also created the storage account. I currently am a service admin for this subscription and do have read/write access to the storage account Synapse is using. – User181 Mar 27 '21 at 18:56
  • I tried a few things like the `quiet` switch of download: `nltk.download('stopwords', quiet=True)` and manually downloading the files and loading unzipped copies up to `.../sparkpools//libraries/python/nltk_data/corpora` as per [here](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-manage-python-packages#storage-account) but couldn't get it to work. Post back the solution if you raise that ticket. – wBob Mar 31 '21 at 23:09
  • 1
    @wBob Thanks for your help, you were getting pretty close with loading those unzipped copies. I updated the post with Microsoft's answer and what I got to work. – User181 Apr 03 '21 at 02:21
  • Great! Good solution. The best thing to do is post that as an answer and mark it as answered. This gives a clear indication the solution was found. – wBob Apr 03 '21 at 10:54

2 Answers2

5

I ended up opening a support request with Microsoft and this was their response:

Synapse does not support arbitrary shell scripts which is where you would download the related model corpus for NLTK

They recommended I use sc.addFile, which I ended up getting to work. So if anyone else finds this, here's what I did.

  1. Downloaded the NLTK stopwords here: http://nltk.org/nltk_data/
  2. Upload the stopwords to the foll0woing folder in storage: abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk_data/corpora/stopwords/
  3. Run the below code to import them

....

import os
import sys
import nltk
from pyspark import SparkFiles
    
#add stopwords from storage
    sc.addFile('abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk_data/',True)
    
#append path to NLTK
nltk.data.path.append(SparkFiles.getRootDirectory() + '/nltk_data')
    
nltk.corpus.stopwords.words('english')

Thanks!

User181
  • 111
  • 8
1

I recently had this same issue on synapse analytics and ended up opening a support request ticket with Microsoft.

Note that synapse does not natively support NLTK stopwords, hence you will have to download the stopwords and put them in an azure storage directory.

The Microsoft team recommended I use sc.addFile, which worked.

Just like the others above, here's what I did.

  1. Downloaded the NLTK stopwords here: http://nltk.org/nltk_data/
  2. Upload the stopwords to the foll0woing folder in storage: abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk-data/corpora/stopwords/ Run the below code to import them

Note that i used 'nltk-data' in the directories, rather than 'nltk_data' for my implementation work

import os
import sys
import nltk
from pyspark import SparkFiles
    
#add stopwords from storage
    sc.addFile('abfss://<file_system>@<account_name>.dfs.core.windows.net/synapse/workspaces/<workspace_name>/nltk-data/',True)
    
#append path to NLTK
nltk.data.path.append(SparkFiles.getRootDirectory() + '/nltk-data')
    
nltk.corpus.stopwords.words('english')


  • This feels suspiciously similar to @User181's answer... Even the whitespace before sc... I think you should just comment under his answer adding any additional information you believe you have, instead of copying it and pasting it as yours. – Ioannis Koumarelas Aug 17 '23 at 11:51