Tachyon on Dataproc Master Replication Error

Question

I have a simple example running on a Dataproc master node where Tachyon, Spark, and Hadoop are installed.

I have a replication error writing to Tachyon from Spark. Is there any way to specify it needs no replication?

15/10/17 08:45:21 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/tachyon/workers/1445071000001/3/8 could only be replicated to 0 nodes instead of minReplication (=1).  There are 0 datanode(s) running and no node(s) are excluded in this operation.
    at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3110)

The portion of the log I printed is just a warning, but a Spark error follows immediately.

I checked the Tachyon config docs, and found something that might be causing this:

tachyon.underfs.hdfs.impl   "org.apache.hadoop.hdfs.DistributedFileSystem"

Given that this is all on a Dataproc master node, with Hadoop preinstalled and HDFS working with Spark, I would think that this is a problem solvable from within Tachyon.

On that same cluster, have you verified the base HDFS setup is indeed healthy? If you run `hdfs dfsadmin -report` on the master node does it report a nonzero number of live datanodes? — Dennis Huo, Oct 18 '15 at 01:23
@DennisHuo This is probably it as workers are shutdown. Will try. — BAR, Oct 18 '15 at 01:48
@DennisHuo That solved it, Thank You. Now I am wondering why I cannot submit Spark jobs after the first restart (post Tachyon install). — BAR, Oct 18 '15 at 02:30
@BAR - What's the error or symptom when you try to submit Spark jobs after the first restart? — James, Oct 27 '15 at 16:00
@James That is the error. I have to write an answer of sorts.. problem is solved. — BAR, Oct 27 '15 at 18:53

score 2 · Accepted Answer · answered Oct 29 '15 at 17:50

You can adjust default replication by manually setting dfs.replication inside /etc/hadoop/conf/hdfs-site.xml to some value other than Dataproc's default of 2. Setting it just on your master should at least cover driver calls, hadoop fs calls, and it appears to correctly propagate into hadoop distcp calls as well so most likely you don't need to worry about also setting it on every worker as long as workers are getting their FileSystem configs from job-scoped configurations.

Note that replication of 1 already means a single copy of the data in total, rather than meaning "one replica in addition to the main copy". So, replication can't really go lower than 1. The minimum replication is controlled with dfs.namenode.replication.min in the same hdfs-site.xml; you can see it referenced here in BlockManager.java.

score 1 · Answer 2 · answered Oct 27 '15 at 18:58

This being a replication issue, one would naturally look at the status of worker nodes.

Turns out they were down for another reason. After fixing that, this error disappeared.

What I would like to know, and will accept as an answer, is how to change the replication factor manually.

Tachyon on Dataproc Master Replication Error

2 Answers2

Linked