28

I am not a professional programmer (my area is medical research), but I am quite capable in C/C++, and various scripting languages. A while back I got intrigued by Lisp, but I never got the time to seriously learn it. After a brief exposure to R I decided to invest more time in a functional programming language.

I would like the practicality of a JVM language and thus narrowed to Clojure and Scala. From what I understand, both can use already existing Java libraries and given at performance-critical code can be delegated to Java, have the potential to perform relatively equally well.

How do these languages compare in the application space I need them for? Are There any real-life projects in bioinformatics using either?

Already existing code would be a serious plus, as would be good documentation and a fairly gentle learning curve. Also, how does the concurrency model of the two compare with each other?

Any significant advantages/disadvantages any one has?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
kliron
  • 4,383
  • 4
  • 31
  • 47
  • 1
    related: http://stackoverflow.com/q/1528766/203968 – oluies Mar 09 '11 at 20:39
  • 2
    Thanks for all your useful answers. I think I'm gonna give Clojure a try. Time to learn a lisp at last! Thanks for the tips on Incanter too, looks very promising. – kliron Mar 10 '11 at 17:38
  • 1
    Following up on this, for anyone considering using Clojure for data science, here are some thoughts. Clojure is the only language I learned in less than a week using the REPL documentation, stackoverflow and this: http://clojure.org/cheatsheet. It is easily my go-to language to build anything more complicated than a trivial script. Java interop is fantastic with virtually no syntactic overhead, which remedies for the relatively small (but growing) number of libraries. I've used it to make web crawlers, build websites, do all kinds of data processing. – kliron May 07 '14 at 07:47
  • Most of the analysis and visualisation (<10% of the actual work) I do in R which is still adequate for the tasks I use it for. – kliron May 07 '14 at 07:49

9 Answers9

32

I can personally vouch for Clojure as a great tool for this kind of work. (I believe Scala would be great too, I just have less experience with it).

My personal research is in the field of predictive modelling / machine learning and is very computationally intensive - so I think it has many parallels with bioinformatics or biostatistics.

My personal approach / setup includes:

  • Incanter used primarily as a data visualisation tool. Great for producing quick visualisations which are usually just 1-liners at the REPL. There are also lots of statistical and numerical processing tools which I believe use the Colt library under the hood. I'm not an expert in R but I understand that Incanter is roughly "R translated to Clojure/Lisp".

  • Exploiting quite a few Java libraries as needed. Some of these are my own, for example algorithms that I have written in Java in order to get the best possible fine-tuned performance out of the JVM. But you could equally easily use any of the other great Java libraries available, as calling Java from Clojure is very simple (.methodName object param1 param2)

  • Quite a lot of higher order functions to automate my workflow. For example I have a higher order function that will run an optimisation algorithm of any kind in a loop for a specified amount of time and then produce an Incanter graph of the improvement on each iteration. Not rocket science, but really easy to code up in a few lines of Clojure.

  • Never really having to worry about performance. You can make Clojure go pretty fast if you want to (e.g. with type hints, primitive arithmetic support etc.) but normally it's irrelevant as you're going to spend 99%+ of your cycles in well-optimised library code anyway. Hence a bit of overhead in the "glue" code is negligible - I feel I gain much more in terms of personal productivity by having a dynamic, high-level, functional language to work in.

  • Major use of Clojure's concurrency features - this has to be one of Clojure's strongest features. I tend to use the STM to code concurrent processes with transactions that can't interfere with each other, then kick off long-running calculations in a future so that I can get on with other tasks and wait for notification of the result.

  • A slowly growing collection of macros to "extend the language" when needed. I actually use macros less than I thought I would (higher order functions are often a better choice). But when you need them they are invaluable - this is where you really appreciate the value of a homoiconic language. Since they effectively allow you to add new syntax to the language itself, they are very powerful when used correctly to build the DSL that you need.

In short - I don't think you can go wrong with Clojure as a researcher.

The one thing I probably wouldn't use it for (yet) is actually writing a new numerical library - this would probably be better done in Scala or pure Java as you would probably want to adopt a more imperative / OOP style.

mikera
  • 105,238
  • 25
  • 256
  • 415
  • I am curious: why not use a more vetted data science language like R or Python? I mean I would love to use cool stuff with Scala or Clojure but do your last few points really make up for the lack of data science-y tools that R or Python would provide? – evan.oman May 27 '16 at 21:07
  • You have to make a trade off I guess - are you happy to put up with some rough edges / living at the bleeding edge in exchange for a more powerful language? I'm personally happy with that but YMMV. Since I wrote this there have been quite a lot of cool developments for data science in Clojure BTW – mikera Jun 03 '16 at 06:07
22

I am not sure about bioinformatics and biostatistics per se, but I do scientific data analysis frequently and I appreciate that Scala allows me to write as-fast-as-Java code with relative ease. I believe that it is often possible in Clojure now, but I haven't seen the benchmarks to back that up. For the time being, I think the prudent thing to assume is that they do not perform equally well. See, for example, the Computer Languages Benchmark Game, where Scala is faster than Clojure in every single test. (Ignore the horrible "pidigits" result for Clojure--Scala (and Java) are calling the GMP library written in C, which Clojure could do but because of a technical detail requiring a different wrapping for the library, isn't presently allowed in the game). Looking at multicore comparisons doesn't improve Clojure's showing, and note that the Clojure code is no shorter for these sorts of lowish-level algorithmic tasks.

Clojure is ahead for the time being with parallel collections, though the upcoming 2.9 release of Scala should make up much of the difference. Neither has a gentle learning curve when coming from C++; Scala is maybe a little easier given that the syntax outwardly looks a little more familiar. I believe there are good materials for learning each of them.


Edit: P.S. You can call R from Java (and therefore from either Clojure or Scala) using rJava (specifically the JRI interface). Edit to edit: and, these days, rScala.

Edit #2: Scala was faster than Clojure in everything at the time of writing; as of this edit, Clojure's a little ahead in one (at the cost of a huge amount of code)--but anyway, the overall point stands. (And the Scala implementation on that one test could be sped up.)

Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
  • 3
    Beware of using the benchmark game to compare code size/ease, or even to compare performance. The rules require each language to use the same algorithm - reasonable, since we don't want Clojure winning just because the guy who writes the C program is dumb (for example). But this means that the code will often be very unidiomatic for Clojure, which frowns on mutability; it will often even perform worse because the language is optimized for different kinds of solutions. – amalloy Mar 09 '11 at 20:18
  • 4
    @amalloy - I agree, except that even a flawed benchmark is better than going based on _handwaving_ and _warm fuzzy feelings_. Want another example? Okay, how about http://wikis.sun.com/display/WideFinder/Results vs. the best Clojure result (specifically lauded by Tim Bray, the creator of the task): http://meshy.org/2009/12/13/widefinder-2-with-clojure.html Again, Clojure is decent at 8m4s but slower than Scala at 5m32s. But who knows if this means anything, because the algorithms that were used are different. – Rex Kerr Mar 09 '11 at 20:28
17

If you like R, give Incanter a try! It's R for Clojure.

Scala's is geared toward being syntactically easy for people coming from Java, which was intended to be syntactically easy for people coming from C though with two levels of indirection like this the advantage may be lost.

Clojure is getting a lot of traction in the Big Data space and maps very well onto Hadoop jobs for Huge Data. I think this would be a big advantage in the bioinformatics world.

Really, these things are largely personal taste so try both and see that makes you happy :)

If you are looking to get a feel for Clojure without a lot of "intellectual overhead" may I suggest using leiningen to get a test project started quickly?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Arthur Ulfeldt
  • 90,827
  • 27
  • 201
  • 284
  • 1
    massive +1 for Incanter - it's an amazing tool. most of the graphs I need to produce are literally 1-liners at the REPL..... – mikera Mar 09 '11 at 20:40
12

To build on Rex's answer I would like to add some Scala libraries/products that may be of interest to you:

oluies
  • 17,694
  • 14
  • 74
  • 117
10

I don't know Scala, so I can't offer a comparison, but I am actively using Clojure in bioinformatics projects.

The Java integration is excellent, and I have had no problem making use of the BioJava libraries.

Where Clojure's concurrency model shines is in the immutable default data types and functional programming with the seq abstraction.

In my bioinformatic work I very often find myself with a lot input data (say gene sequences) which need to be subjected to the same analysis. Once I have my analysis function I can map it over a sequence of inputs (with the results lazily generated). I have gotten full utilization of a large 48-core server simply by changing that map to a pmap.

Large scale parallelization with a single character change is hard to beat!

Of course pmap isn't a magic bullet and only helps when the analysis function computationally dominates, but the fact that map and pmap can just be plugged in and out shows the elegance and simplicity enabled by Clojure's design.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Alex Stoddard
  • 8,244
  • 4
  • 41
  • 61
8

I am only passingly familiar with Scala, so the best I can do is evangelize a bit for Clojure. It's a great language, but take all this advice with a grain of salt as it's coming from an enthusiast.

If you are looking for concurrency, Clojure is fantastic both for ease of programming and for performance. The immutable data structures mean that it's trivial to work with a coherent snapshot of the world without any manual and error-prone locking; the STM makes it fairly simple to change data in a thread-sensitive way without breaking anyone else's snapshots.

My understanding is that Scala has a lot of the nice functional tools that Clojure does, but Clojure will always win syntactically by virtue of being a Lisp. If you're looking to do some specialized bioinformatics stuff, Clojure is able to hide the bits of Lisp that you don't want, and raise your own constructs to the same level as the built-in language constructs. I can't find the reference right now, but there's some well-known quote about Lisp that goes like:

Lisp is not the perfect language for any program. But it is the perfect language for building the perfect language for every program.

That's horribly paraphrased, but in my experience it has been true. It looks like you'll want a fairly specialized set of tools, and no language will make those feel as natural as a Lisp.

amalloy
  • 89,153
  • 8
  • 140
  • 205
5

You have to ask yourself how important functional programming is for you. You know C++ so you probably know OO. I would say it's easier to do FP in Clojure (because you can't really drop back to OO-style) in Scala you will probebly end up dropping FP and do more OO style.

I can't really say anything about your application space.

Since you mentioned R, there is an R-like Clojure library for statistics called Incanter. I don't know about other existing projects in your application space.

There is a lot of information about both languages, so that should not be a problem. The learning curve is kind of steep with both languages. Clojure is a much smaller language and since you already know some lisp it should not be to hard to learn the important stuff. Scala has a type system that will be hard to pick up especially since your main experience is with C/C++.

Both languages have great concurrency models and you will probably be happy with both.

amalloy
  • 89,153
  • 8
  • 140
  • 205
nickik
  • 5,809
  • 2
  • 29
  • 35
2

I have some experience in Scala and only little knowledge in Clojure, but I programmed Lisp many years ago.

Lisp is a beautiful language, but it never made it to the world, because it was too limited. I believe you need a statically-typed language to develop robust systems. The type system in Scala is not difficult to master to benefit from it. If you want to do very advanced things with it to make your libraries idiot-proof, you can, but then you will need to study the type system a little more.

Scala favours immutable types, but you can use mutables without any problem, which you sometimes do need. Concurrency in Scala is very well implemented and frameworks like akka extend and enhance these possibilities.

Scala stands a better chance to become a mainstream language since it's a fuller language. I'm afraid that Clojure is too much like Lisp (but reimplemented on the JVM). I liked Lisp a lot, but it had too many disadvantages for real-life programs. With Scala I think we have the best of both worlds (OO and functional) in a clean marriage. On top of that, Scala seems to really catch on in the market.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
  • 1
    I'll definitely consider Scala too. I actually find it an advantage that Clojure is dynamic and can eventually, for my needs, be a "write everything in" kind of language. – kliron Mar 10 '11 at 18:44
1

We have been working on some experimental code in the Rudolf/BioClojure project on GitHub. Also, look at Jan Aert's BioClojure project which is more structured.

Additionally, there is a BioCaml project in the works...

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
jayunit100
  • 17,388
  • 22
  • 92
  • 167