Recently, our hadoop cluster running mapr 5.2 has been throwing an error that only seems to affect large files in HDFS. When the primary account (that is running all the mapr services) interacts with a large file, it works fine. When any other account tries to, though, we get an error like the following:
$ hdfs -get <large file>
2020-04-16 12:09:20,5178 ERROR Client fs/client/fileclient/cc/client.cc:5062 Thread: 602 Read failed for file <large file>, error Invalid argument(22), off 65536 len 65536 for fid 4212.137.132072
20/04/16 12:09:20 ERROR fs.Inode: Marking failure for: <large file>, error: Input/output error
20/04/16 12:09:20 ERROR fs.Inode: Throwing exception for: <large file>, error: Input/output error get: 2070.1218926.8072252 <large file> (Input/output error)
Note this happens for pretty much any larger file, but smaller files don't seem to be affected, even if they've been replicated through HDFS.
Every time this error pops up, I also see a corresponding error in mfs.log-3:
2020-04-16 11:55:53,0963 ERROR MapServerFile clishm.cc:157 shmat failed for shmid 1769111589, errno 13, trying again
2020-04-16 11:55:53,0963 ERROR MapServerFile clishm.cc:163 shmat failed 2nd time for shmid 1769111589, errno 13
2020-04-16 11:55:53,0963 ERROR FileServer fileserver.cc:3837 192.168.30.21:38254 shmat failed on shmid: 1769111589
Interestingly, if I try to copy the same file to local disk on a data node, instead of the name node, it works without any issue. My original thought was that this was a permissions issue (since it works fine for the one account and not the others), but I checked uid and gid on all nodes and they are the same, so the fact that this works on a data node but not the name node tells me it's a name node issue. But I'm unsure how to proceed in debugging this.