ALthough the general question of hadoop/HDFS on Windows has been posed before, I haven't seen anyone presenting the use case I think is the most important for Windows support: How can Windows end stations participate in a HDFS environment and consume files stored in HDFS.
In particular, let's say we have a nice Linux based HDFS environment with lots of nodes and analysis jobs being run, etc, and all is happy. How can Windows desktops also consume the files? Suppose our analytics find interesting files out of the millions of mostly uninteresting. Now we want to bring them into a desktop application to visualize, etc. The most natural way for the desktop to consume these is via a Windows share, hopefully via a Windows server.
Windows' implementation of CIFS is orders of magnitude better than Samba -- I'm stating that as a fact, not a point of debate. That isn't to say that Samba cannot be made to work, only that there are good reasons to have a very strong preference for essentially exporting this HDFS file system as CIFS.
It's possible to do this via some workflow where we have a back-end process take the interesting files and copy them. But this is cumbersome in many cases and does not give the Windows-shackled analyst the freedom to explore the files on his own as easily.
Hence, what I'm looking for really is:
- Windows server
- HDFS as a "mounted" file system; Windows is thought of as a HDFS "client"
- Export this file system from Windows as a CIFS server
- Consume files on Windows desktop
- Have all the usual Windows group permissions work correctly (e.g. by mapping through to NFSv4 ACLs).
Btw, if we replace "HDFS" with "GPFS" in this question, it all does work. At the moment, this is a key differentiator between HDFS and GPFS in my environment. Yes, there are many many more points of comparison, but I'd like to not focus on GPFS vs HDFS in general at the moment.
Could someone please add the #GPFS tag?