15

I'm thinking about using hadoop to process large text files on my existing windows 2003 servers (about 10 quad core machines with 16gb of RAM)

The questions are:

  1. Is there any good tutorial on how to configure an hadoop cluster on windows?

  2. What are the requirements? java + cygwin + sshd ? Anything else?

  3. HDFS, does it play nice on windows?

  4. I'd like to use hadoop in streaming mode. Any advice, tool or trick to develop my own mapper / reducers in c#?

  5. What do you use for submitting and monitoring the jobs?

Thanks

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Luca Martinetti
  • 3,396
  • 6
  • 34
  • 49
  • 4
    Something like vmware instances of Linux running on Windows might be less painful than trying to use Windows directly. – James Moore Aug 23 '11 at 20:50

3 Answers3

9

From the Hadoop documentation:

Win32 is supported as a development platform. Distributed operation has not been well tested on Win32, so it is not supported as a production platform.

Which I think translates to: "You're on your own."

That said, there might be hope if you're not queasy about installing Cygwin and a Java shim, according to the Getting Started page of the Hadoop wiki:

It is also possible to run the Hadoop daemons as Windows Services using the Java Service Wrapper (download this separately). This still requires Cygwin to be installed as Hadoop requires its df command.

I guess the bottom line is that it doesn't sound impossible, but you'd be swimming upstream all the way. I've done a few Hadoop installs (on Linux for production, Mac for dev) now, and I wouldn't bother with Windows when it's so straightforward on other platforms.

bradheintz
  • 3,151
  • 19
  • 24
  • Tend to agree, i've installed Hadoop on Windows and its not so straight forward, had to troll through some nasty java errors to resolve some node deployment issues which i wouldn't recommend to anyone. You can follow this guide: [link](http://v-lad.org/Tutorials/Hadoop/14%20-%20start%20up%20the%20cluster.html) for a good Cygwin installation process, if you are starting clean it might be simpler. I did find a guide for installing Hadoop without using Cygwin (you just need to change a few references), cant seem to dig it out, but thats **really** uncharted territory. – ToOsIK Mar 28 '12 at 17:44
9

While not the answer you may want to hear, I would highly recommend repurposing the machines as, say, Linux servers, and running Hadoop there. You will benefit from tutorials and experience and testing performed on that platform, and spend your time solving business problems rather than operational issues.

However, you can still write your jobs in C#. Since Hadoop supports the "streaming" implementation, you can write your jobs in any language. With the Mono framework, you should be able to take pretty much any .NET code written on the Windows platform and just run the same binary on Linux.

You can also access HDFS from Windows fairly easily -- while I don't recommend running the Hadoop services on Windows, you can certainly run the DFS client from the Windows platform to copy files in and out of the distributed file system.

For submitting and monitoring jobs, I think that you're mainly on your own... I don't think that there are any good general-purpose systems developed for Hadoop job management yet.

Ilya Haykinson
  • 569
  • 4
  • 7
  • Thanks for your answer. Unfortunatelly I cannot reimage the servers, maybe I'll just use some linux EC2 instances.Porting to Mono is a bit tricky could work. Luca – Luca Martinetti Apr 24 '09 at 13:17
  • good luck! the EC2 portion should be pretty easy, and in my experience most .NET code runs on Mono without even recompiling -- so hopefully there won't really be a need to "port" – Ilya Haykinson Apr 24 '09 at 17:51
  • I think Cloudera has some hadoop managment tools... based on what I saw on youtube – makerofthings7 Nov 20 '10 at 20:50
2

If you're looking for map/reduce, you can try looking at MySpace's new map/reduce framework that runs on windows http://qizmt.myspace.com/

  • +1 for qizmt ref. A great option to start with, which has been production tested, utilizes his existing infrastructure and requires minimal modification. – Ralph Willgoss Jul 20 '10 at 06:49