1

Recently, our hadoop cluster running mapr 5.2 has been throwing an error that only seems to affect large files in HDFS. When the primary account (that is running all the mapr services) interacts with a large file, it works fine. When any other account tries to, though, we get an error like the following:

$ hdfs -get <large file>
2020-04-16 12:09:20,5178 ERROR Client fs/client/fileclient/cc/client.cc:5062 Thread: 602 Read failed for file <large file>, error Invalid argument(22), off 65536 len 65536 for fid 4212.137.132072
20/04/16 12:09:20 ERROR fs.Inode: Marking failure for: <large file>, error: Input/output error
20/04/16 12:09:20 ERROR fs.Inode: Throwing exception for: <large file>, error: Input/output error get: 2070.1218926.8072252 <large file> (Input/output error)

Note this happens for pretty much any larger file, but smaller files don't seem to be affected, even if they've been replicated through HDFS.

Every time this error pops up, I also see a corresponding error in mfs.log-3:

2020-04-16 11:55:53,0963 ERROR MapServerFile clishm.cc:157 shmat failed for shmid 1769111589, errno 13, trying again
2020-04-16 11:55:53,0963 ERROR MapServerFile clishm.cc:163 shmat failed 2nd time for shmid 1769111589, errno 13
2020-04-16 11:55:53,0963 ERROR FileServer fileserver.cc:3837 192.168.30.21:38254 shmat failed on shmid: 1769111589

Interestingly, if I try to copy the same file to local disk on a data node, instead of the name node, it works without any issue. My original thought was that this was a permissions issue (since it works fine for the one account and not the others), but I checked uid and gid on all nodes and they are the same, so the fact that this works on a data node but not the name node tells me it's a name node issue. But I'm unsure how to proceed in debugging this.

user268859
  • 13
  • 2
  • Can you check whether any Quota is set on the directory you are trying to copy into? – franklinsijo Apr 17 '20 at 17:54
  • You could also try to download over the REST-API or the POSIX client, instead of the HDFS client. If the behavior is the same, then issues with the local file system are indeed more likely to be the culprit. – Rick Moritz Jun 19 '20 at 13:12
  • Concerningly, this problem ended up resolving itself after a few weeks. I have no theories as to why, because I switched to working with the primary user and hadn't been troubleshooting this issue. – user268859 Oct 23 '20 at 15:36

0 Answers0