10

I am relatively new to databricks environment. My company has set up a databricks account for me where I am pulling data from a s3 bucket. I have background in traditional relational databases so it's a bit difficult for me to understand databricks.

I have following questions:

-Is mount just a connection(link to s3/external storage) with nothing stored on DBFS or it actually stores the data on DBFS ?

-I read somewhere that DBFS is also mount ? My understanding is that DBFS is databricks storage , how can I see what's the total storage available for the DBFS ?

-We have different clusters for different teams with in the company, I don't have access to all the clusters, while exporting the data from s3 do I have to set up something in my code, to ensure that the dataframe and tables which I am creating in databricks are not accessible to other users who are not part of the cluster which I am using.

-Where are the database tables stored? Is it on DBFS? In terms of storage options , is there any other storage apart from databases, DBFS,external(s3,azure,jdbc/odbc etc)?

-Are tables/dataframes always stored in-memory when we load them? Is there a way to see what's the limit for file size in in-memory ?

Thanks!

1 Answers1

16

Good questions! I'll do my best to answer these for you.

Is mount just a connection(link to s3/external storage) with nothing stored on DBFS or it actually stores the data on DBFS? I read somewhere that DBFS is also mount? My understanding is that DBFS is databricks storage , how can I see what's the total storage available for the DBFS?

DBFS is an abstraction layer on top of S3 that lets you access data as if it were a local file system. By default when you deploy Databricks you create a bucket that is used for storage and can be accessed via DBFS. When you mount to DBFS, you are essentially mounting a S3 bucket to a path on DBFS. More details here.

We have different clusters for different teams with in the company, I don't have access to all the clusters, while exporting the data from s3 do I have to set up something in my code, to ensure that the dataframe and tables which I am creating in databricks are not accessible to other users who are not part of the cluster which I am using.

Mounting an S3 bucket to a path on DBFS will make that data available to others in your Databricks workspace. If you want to make sure no one else can access the data, you will have to take two steps. First, use IAM roles instead of mounts and attach the IAM role that grants access to the S3 bucket to the cluster you plan on using. Second, restrict access to the cluster to only those who can access the data. This way you lock down which clusters can access the data, and which users can access those clusters.

Where are the database tables stored? Is it on DBFS? In terms of storage options , is there any other storage apart from databases, DBFS,external(s3,azure,jdbc/odbc etc)?

Database tables are stored on DBFS, typically under the /FileStore/tables path. Read more here.

Are tables/dataframes always stored in-memory when we load them? Is there a way to see what's the limit for file size in in-memory?

This depends on your query. If your query is SELECT count(*) FROM table then yes, the entire table is loaded into memory. If you are filtering then Spark will try to be efficient and only read those portions of the table that are necessary to execute the query. The limit for the file size is proportional to the size of your cluster. Spark will partition data in memory across the cluster. If you are still running out of memory then it's usually time to increase the size of your cluster or refine your query. Autoscaling on Databricks helps with the former.

Thanks!

You're welcome!

Edit: In memory refers to RAM, DBFS does no processing. To see the available space you have to log into your AWS/Azure account and check the S3/ADLS storage associated with Databricks.

If you save tables through Spark APIs they will be on the FileStore/tables path as well. The UI leverages the same path.

Clusters are comprised of a driver node and worker nodes. You can have a different EC2 instance for the driver if you want. It's all one system, that system being the cluster.

Spark supports partitioning the parquet files associated with tables. In fact, this is a key strategy to improving the performance of your queries. Partition by a predicate that you commonly use in your queries. This is separate from partitioning in memory.

Raphael K
  • 2,265
  • 1
  • 16
  • 23
  • 2
    Thanks Raphael for the detailed explanation ! Really appreciate your help. Few follow up questions :- Q1. In- memory refers to the RAM of cluster? The cluster which I am using has r5.4xlarge ,128.0 GB Memory, 16 Cores, 3.6 DBU configuration. So in this case my in-memory can handle data up-to 128GB? If the data let's say is 200GB then the remaining 72GB would be processed on DBFS with 128 GB in-memory? Q2. Is there a way to see size of DBFS available vs used space? –  Aug 23 '19 at 02:09
  • Q3. /FileStore/tables - from the link- stores the files that you upload via the Create table UI. I am using script to create tables and can't see table names under that path? Are these table names encrypted in that folder because I can lot of encrypted names? Q4. The cluster which I am using has r5.4xlarge ,128.0 GB Memory, 16 Cores, 3.6 DBU configuration for both 1 driver and 20 workers. Does that’s mean technically driver and worker are running on same system ? –  Aug 23 '19 at 02:12
  • Q5. are partitions created for in-memory or they can be done on dbfs files as well? –  Aug 23 '19 at 02:28
  • @ user11704694 partition means 2 ways you need to see like one as "writing the data or processed data from databricks with different hierarchy modes" and second way was to improve performance in databricks tables – sai saran Nov 28 '22 at 08:42