0

I started playing with streaming on my Community Edition Databricks but after some minutes of producing test events I encountered some problem. I believe it's somehow connected with the fact of some temporary small files produced during streaming process. I would like to find them and remove, but can't find where are they stored. My exception is

com.databricks.api.base.DatabricksServiceException: QUOTA_EXCEEDED: You have exceeded the maximum number of allowed files on Databricks Community Edition. To ensure free access, you are limited to 10000 files and 10 GB of storage in DBFS. Please use dbutils.fs to list and clean up files to restore service. You may have to wait a few minutes after cleaning up the files for the quota to be refreshed. (Files found: 11492);

And I have tried to run some shell script to find out the number of files per each folder but unfortunately I cannot find suspicious, mostly lib, usr and other folder containing system or python files are there, cannot find anything that could be produced by my streaming. This script I use

find / -maxdepth 2 -mindepth 1 -type d | while read dir; do
  printf "%-25.25s : " "$dir"
  find "$dir" -type f | wc -l
done

Where can I find the reason for too many files problem? Maybe it's not connected to Streaming at all?

To make it clear, I have not uploaded many custom files to /FileStore

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
luk
  • 105
  • 4

1 Answers1

0

It looks like you have only checked for files on the local filesystem and not DBFS itself. You can take a look at DBFS by running the following cell in a Databricks notebook:

%sh
fs ls /

or:

%python
dbutils.fs.ls("/")

You could check for files there and remove them with dbutils.fs.rm or fs rm. Also take a look at the /tmp folder on DBFS and delete any files there.

Bram
  • 376
  • 1
  • 4
  • I am using `%sh` here so there is search starting from the real root `/`. There are some files in `/tmp` folder, already checked there. I am just more curious what exactly is *Streaming* producing and why :) – luk Aug 24 '20 at 13:39
  • update, these are the biggest folders in my filesystem, according to `find`: ```sh /databricks : 43126 /usr : 79378 ``` are these amounts OK for community Databricks? – luk Aug 24 '20 at 14:10
  • Though it seems like you are querying the local file system and not DBFS. I think it's not necessarily the case that files in DBFS are automatically mounted to the local file system. (E.g. `%sh ls /tmp` and `%fs ls /tmp` will most likely produce different results). I would suggest exploring DBFS with `%fs ls /` and most likely you will find some folder and files there that contain too many small files or a too high volume of data. – Bram Aug 24 '20 at 19:50