0

In my mrjob.conf i make settings for the additional volume:

  Instances.InstanceGroups.member.2.EbsConfiguration.EbsBlockDeviceConfigs.member.1.VolumeSpecification.SizeInGB: 250
  Instances.InstanceGroups.member.2.EbsConfiguration.EbsBlockDeviceConfigs.member.1.VolumeSpecification.VolumeType: gp2
  Instances.InstanceGroups.member.2.EbsConfiguration.EbsBlockDeviceConfigs.member.1.VolumesPerInstance: 1

when i run the cluster i see that for each instance i have both 10gb and 250gb volumes. But does EMR use 250gb storage to keep data? If not how to make it work?

mirt
  • 1,453
  • 1
  • 17
  • 35
  • I would caution that best advise for EMR is to actually use EMRFS where possible - i.e. directly mount S3 as the HDFS mount point. In your case, yes, I believe it will automatically pick up the 250 GB hard drives as space for writing HDFS to? – Henry Apr 19 '17 at 20:37
  • does EMR use 250gb storage to keep data? , which kind of data and who is generating the data ? – jc mannem Apr 20 '17 at 16:38

1 Answers1

2

YES, EMR mounts , formats and uses the EBS volumes for HDFS if you provision EBS volumes with EMR API during cluster launch.

You can see its mounting onto points like /mnt1/ , /mnt2/ etc and those mount points included in hdfs-site.xml. All writes to HDFS will automatically load balanced among these mounts based on policies set in hdfs-site.xml . The current policy being * all writes go to biggest volume until its remaining space is roughly equals with the rest of the volumes, then it starts using round robin.

Note that the mount points may not be used by everything for example EMR might not use those mounts to store yarn containers logs on local disks though. (which you can configure later)

jc mannem
  • 2,293
  • 19
  • 23