21

One of F#'s claims is that it allows for interactive scripting and data manipulation / exploration. I've been playing around with F# trying to get a sense for how it compares with Matlab and R for data analysis work. Obviously F# does not have all practical functionality of these ecosystems, but I am more interested in the general advantages / disadvantages of the underlying language.

For me the biggest change, even over the functional style, is that F# is statically typed. This has some appeal, but also often feels like a straightjacket. For instance, I have not found a convenient way to deal with heterogeneous rectangular data -- think dataframe in R. Assume I'm reading a CSV file with names (string) and weights (float). Typically I load data in, perform some transformations, add variables, etc, and then run analysis. In R, the first part might look like:

df <- read.csv('weights.csv')
df$logweight <- log(df$weight)

In F#, it's not clear what structure I should use to do this. As far as I can tell I have two options: 1) I can define a class first that is strongly typed (Expert F# 9.10) or 2) I can use a heterogeneous container such as ArrayList. A statically typed class doesn't seem feasible because I need to add variables in an ad-hoc manner (logweight) after loading the data. A heterogeneous container is also inconvenient because every time I access a variable I will need to unbox it. In F#:

let df = readCsv("weights.csv")
df.["logweight"] = log(double df.["weight"])

If this were once or twice, it might be okay, but specifying a type every time I use a variable doesn't seem reasonable. I often deal with surveys with 100s of variables that are added/dropped, split into new subsets and merged with other dataframes.

Am I missing some obvious third choice? Is there some fun and light way to interact and manipulate heterogeneous data? If I need to do data analysis on .Net, my current sense is that I should use IronPython for all the data exploration / transformation / interaction work, and only use F#/C# for numerically intensive parts. Is F# inherently the wrong tool for quick and dirty heterogeneous data work?

kvb
  • 54,864
  • 2
  • 91
  • 133
Tristan
  • 6,776
  • 5
  • 40
  • 63

3 Answers3

8

Is F# inherently the wrong tool for quick and dirty heterogeneous data work?

For completely ad hoc, exploratory data mining, I wouldn't recommend F# since the types would get in your way.

However, if your data is very well defined, then you can hold disparate data types in the same container by mapping all of your types to a common F# union:

> #r "FSharp.PowerPack";;

--> Referenced 'C:\Program Files\FSharp-1.9.6.16\bin\FSharp.PowerPack.dll'

> let rawData =
    "Name: Juliet
     Age: 23
     Sex: F
     Awesome: True"

type csv =
    | Name of string
    | Age of int
    | Sex of char
    | Awesome of bool

let parseData data =
    String.split ['\n'] data
    |> Seq.map (fun s ->
        let parts = String.split [':'] s
        match parts.[0].Trim(), parts.[1].Trim() with
        | "Name", x -> Name(x)
        | "Age", x -> Age(int x)
        | "Sex", x -> Sex(x.[0])
        | "Awesome", x -> Awesome(System.Convert.ToBoolean(x))
        | data, _ -> failwithf "Unknown %s" data)
    |> Seq.to_list;;

val rawData : string =
  "Name: Juliet
     Age: 23
     Sex: F
     Awesome: True"
type csv =
  | Name of string
  | Age of int
  | Sex of char
  | Awesome of bool
val parseData : string -> csv list

> parseData rawData;;
val it : csv list = [Name "Juliet"; Age 23; Sex 'F'; Awesome true]

csv list is strongly typed and you can pattern match over it, but you have to define all of your union constructors up front.

I personally prefer this approach, since is orders of magnitude better than working with an untyped ArrayList. However, I'm not really sure what you're requirements are, and I don't know a good way to represent ad-hoc variables (except maybe as a Map{string, obj}) so YMMV.

Juliet
  • 80,494
  • 45
  • 196
  • 228
6

I think that there are a few other options.

(?) operator

As Brian mentioned, you can use the (?) operator:

type dict<'a,'b> = System.Collections.Generic.Dictionary<'a,'b>

let (?) (d:dict<_,_>) key = unbox d.[key]
let (?<-) (d:dict<_,_>) key value = d.[key] <- box value

let df = new dict<string,obj>()
df?weight <- 50.
df?logWeight <- log(df?weight)

This does use boxing/unboxing on each access, and at times you may need to add type annotations:

(* need annotation here, since we could try to unbox at any type *)
let fltVal = (df?logWeight : float)

Top level identifiers

Another possibility is that rather than dynamically defining properties on existing objects (which F# doesn't support particularly well), you can just use top level identifiers.

let dfLogWeight = log(dfWeight)

This has the advantage that you will almost never need to specify types, though it may clutter your top-level namespace.

Property objects

A final option which requires a bit more typing and uglier syntax is to create strongly typed "property objects":

type 'a property = System.Collections.Generic.Dictionary<obj,'a>

let createProp() : property<'a> = new property<'a>()
let getProp o (prop:property<'a>) : 'a = prop.[o]
let setProp o (prop:property<'a>) (value:'a) = prop.[o] <- value

let df = new obj()
let (weight : property<double>) = createProp()
let (logWeight : property<double>) = createProp()

setProp df weight 50.
setProp df logWeight (getProp df weight)
let fltVal = getProp df logWeight

This requires each property to be explicitly created (and requires a type annotation at that point), but no type annotations would be required after that. I find this much less readable than the other options, although perhaps defining an operator to replace getProp would alleviate that somewhat.

kvb
  • 54,864
  • 2
  • 91
  • 133
1

I am not sure if F# is a great tool here or not. But there is a third option - the question mark operator. I've been meaning to blog about this for a while now; Luca's recent PDC talk demo'd a CSV reader with C# 'dynamic', and I wanted to code a similar thing with F# using the (?) operator. See

F# operator "?"

for a short description. You can try to blaze ahead and play around with this on your own, or wait for me to blog about it. I have not tried it myself in earnest so I'm not sure exactly how well it will work out.

EDIT

I should add that Luca's talk shows how 'dynamic' in C# addresses at least a portion of this question for that language.

EDIT

See also

http://cs.hubfs.net/forums/thread/12622.aspx

where I post some basic starter CSV code.

Community
  • 1
  • 1
Brian
  • 117,631
  • 17
  • 236
  • 300