3

We are running Cascading with a Sink Tap being configured to store in Amazon S3 and were facing some FileAlreadyExistsException (see [1]). This was only from time to time (1 time on around 100) and was not reproducable.

Digging into the Cascading codem, we discovered the Hfs.deleteResource() is called (among others) by the BaseFlow.deleteSinksIfNotUpdate(). Btw, we were quite intrigued with the silent NPE (with comment "hack to get around npe thrown when fs reaches root directory").

From there, we extended the Hfs tap with our own Tap to add more action in the deleteResource() method (see [2]) with a retry mechanism calling directly the getFileSystem(conf).delete.

The retry mechanism seemed to bring improvement, but we are still sometimes facing failures (see example in [3]): it sounds like HDFS returns isDeleted=true, but asking directly after if the folder exists, we receive exists=true, which should not happen. Logs also shows randomly isDeleted true or false when the flow succeeds, which sounds like the returned value is irrelevant or not to be trusted.

Can anybody bring his own S3 experience with such a behavior: "folder should be deleted, but it is not"? We suspect a S3 issue, but could it also be in Cascading or HDFS?

We run on Hadoop Cloudera-cdh3u5 and Cascading 2.0.1-wip-dev.

[1]

org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3n://... already exists
    at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
    at com.twitter.elephantbird.mapred.output.DeprecatedOutputFormatWrapper.checkOutputSpecs(DeprecatedOutputFormatWrapper.java:75)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:923)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:882)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:882)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:856)
    at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:104)
    at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:174)
    at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.j

[2]

  @Override
  public boolean deleteResource(JobConf conf) throws IOException {
    LOGGER.info("Deleting resource {}", getIdentifier());

    boolean isDeleted = super.deleteResource(conf);
    LOGGER.info("Hfs Sink Tap isDeleted is {} for {}", isDeleted,
        getIdentifier());

    Path path = new Path(getIdentifier());

    int retryCount = 0;
    int cumulativeSleepTime = 0;
    int sleepTime = 1000;

    while (getFileSystem(conf).exists(path)) {
      LOGGER
          .info(
              "Resource {} still exists, it should not... - I will continue to wait patiently...",
              getIdentifier());
      try {
        LOGGER.info("Now I will sleep " + sleepTime / 1000
            + " seconds while trying to delete {} - attempt: {}",
            getIdentifier(), retryCount + 1);
        Thread.sleep(sleepTime);
        cumulativeSleepTime += sleepTime;
        sleepTime *= 2;
      } catch (InterruptedException e) {
        e.printStackTrace();
        LOGGER
            .error(
                "Interrupted while sleeping trying to delete {} with message {}...",
                getIdentifier(), e.getMessage());
        throw new RuntimeException(e);
      }

      if (retryCount == 0) {
        getFileSystem(conf).delete(getPath(), true);
      }

      retryCount++;

      if (cumulativeSleepTime > MAXIMUM_TIME_TO_WAIT_TO_DELETE_MS) {
        break;
      }
    }

    if (getFileSystem(conf).exists(path)) {
      LOGGER
          .error(
              "We didn't succeed to delete the resource {}. Throwing now a runtime exception.",
              getIdentifier());
      throw new RuntimeException(
          "Although we waited to delete the resource for "
              + getIdentifier()
              + ' '
              + retryCount
              + " iterations, it still exists - This must be an issue in the underlying storage system.");
    }

    return isDeleted;

  }

[3]

INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] at least one sink is marked for delete
 INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] sink oldest modified date: Wed Dec 31 23:59:59 UTC 1969
 INFO [pool-2-thread-15] (HiveSinkTap.java:148) - Now I will sleep 1 seconds while trying to delete s3n://... - attempt: 1
 INFO [pool-2-thread-15] (HiveSinkTap.java:130) - Deleting resource s3n://...
 INFO [pool-2-thread-15] (HiveSinkTap.java:133) - Hfs Sink Tap isDeleted is true for s3n://...
 ERROR [pool-2-thread-15] (HiveSinkTap.java:175) - We didn't succeed to delete the resource s3n://... Throwing now a runtime exception.
 WARN [pool-2-thread-15] (Cascade.java:706) - [...] flow failed: ...
 java.lang.RuntimeException: Although we waited to delete the resource for s3n://... 0 iterations, it still exists - This must be an issue in the underlying storage system.
    at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:179)
    at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:40)
    at cascading.flow.BaseFlow.deleteSinksIfNotUpdate(BaseFlow.java:971)
    at cascading.flow.BaseFlow.prepare(BaseFlow.java:733)
    at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:761)
    at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:710)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:619)
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470

2 Answers2

2

First, double check the Cascading compatibility page for supported distributions.

http://www.cascading.org/support/compatibility/

Note Amazon EMR is listed as they periodically run the compatibility tests and report the results back.

Second, S3 is an eventually consistent filesystem. HDFS is not. So assumptions about the behavior of HDFS don't carry over to storing data against S3. For example, a rename is really a copy and delete. Where the copy can take hours. Amazon has patched their internal distribution to accommodate many of the differences.

Third, there are no directories in S3. It is a hack and supported differently by different S3 interfaces (jets3t vs s3cmd vs ...). This is bound to be problematic considering the prior point.

Fourth, network latency and reliability are critical, especially when communicating to S3. Historically I've found the Amazon network to be better behaved when manipulating massive datasets on S3 when using EMR vs standard EC2 instances. I also believe their is a patch in EMR that improves matters here as well.

So I'd suggest try running the EMR Apache Hadoop distribution to see if your issues clear up.

cwensel
  • 1,426
  • 1
  • 8
  • 6
1

When running any jobs on Hadoop that use files in S3, the nuances of eventual consistency must be kept in mind.

I've helped troubleshoot many apps which turned out to have similar race conditions for delete as their root issue -- whether they were in Cascading or Hadoop streaming or written directly in Java.

There was discussion at one point of having notifications from S3 after a given key/value pair had been fully deleted. I haven't kept up on where that feature stood. Otherwise, it's probably best to design systems -- again, whether in Cascading or any other app that uses S3 -- such that data which is consumed or produced by a batch workflow gets managed in HDFS or HBase or a key/value framework (e.g., have used Redis for this). Then S3 gets used for durable storage, but not for intermediate data.

Paco
  • 602
  • 1
  • 9
  • 19