2

am trying to do some cost comparison between AWS Glue and Databricks hosted on an AWS environment. For the comparison, I have chosen m4.xlarge which is equivalent of 1 DPU in AWS Glue (4 vCPUs/16GB memory).

Assuming I have an pyspark job thats expected to run for 1 hour daily for 30 days with 5DPUs. My cost estimator as per AWS is as follows:

glue cost estimator : 5 DPUs x 30.00 hours x 0.44 USD per DPU-Hour = 66.00 USD (Apache Spark ETL job cost)

Databricks cost estimator : This gives an monthly estimate of 74 USD enter image description here

Am concerned if we have to pay any EC2 cost to AWS for the 6 nodes in addition to this 73 USD. This is due to the note added in the estimate "This Pricing Calculator provides only an estimate of your Databricks cost. Your actual cost depends on your actual usage. Also, the estimated cost doesn't include cost for any required AWS services (e.g. EC2 instances)."

That will be an additional 36 USD approximately for this instance type/count, in addition to databricks cost. Can someone please clarify so we can make a decision to go with AWS Glue or Databricks. I know in databricks we can choose any instance type, but the question is if i pay EC2 cost seperately. Thanks

Yuva
  • 2,831
  • 7
  • 36
  • 60

3 Answers3

3

The answer is yes.

You should pay for all the infrastructure used directly by Databricks.

As mentioned in the footnote you added: This Pricing Calculator provides only an estimate of your Databricks cost. Your actual cost depends on your actual usage. Also, the estimated cost doesn't include the cost for any required AWS services (e.g. EC2 instances).

Think of it as a software license on top of hardware costs that you would pay anyway, whether you use the software or not.

This point was verified with Databricks Solution Architect that accompanies our company while implementing the Databricks solution.

YevgenyM
  • 46
  • 4
1

As others have said, the answer is yes and you should think of Databricks as a good "machine manager" and AWS as providing the actual machines. The general formula for determining how much you pay for a job run is:

cost = total_worker_hours * worker_dbu_per_hour * cost_of_dbu + total_worker_hours * worker_ec2_instance_cost + total_driver_hours * driver_dbu_per_hour * cost_of_dbu + total_driver_hours * driver_ec2_instance_cost

You can simplify this to

total_cost = total_worker_hours * (worker_dbu_per_hour * cost_of_dbu + worker_ec2_instance_cost) + total_driver_hours * (driver_dbu_per_hour * cost_of_dbu + driver_ec2_instance_cost)

So what does this formula mean? Well total_worker_hours is the total number of instance hours you had for your workers. So if 8 workers ran were used for a 1 hour job, you'd have 8 worker hours. This calculation gets a bit more tricky with things like auto-scale, of course. Similarly, total_driver_hours is the total number of driver instance hours; but since there's only one driver, it's just the number of hours your job ran.

So the variables in the parentheses, like (driver_dbu_per_hour * cost_of_dbu + driver_ec2_instance_cost) just tells you your hourly rate for a driver (and similarly, your workers). Once you have these values, you're set to know how much you'd pay.

hima
  • 11
  • 1
0

In the screenshot from the Databricks cost comparison you have chosen All-purpose compute, this is a more expensive type used for ad hoc development. Likely you will use "jobs" when running your spark jobs on a schedule.

Jobs Compute: 0,2/DBU All-purpose Compute: 0,65/DBU

With using jobs compute: Databricks cost calulator

$22,5(Databricks DBU) + $36 (AWS EC2 cost, which will differ a bit depending on spot prices etc.) = $58,5