4

Being a programmer I occasionally find the need to analyze large amounts of data such as performance logs or memory usage data, and I am always frustrated by how much time it takes me to do something that I expect to be easier.

As an example to put the question in context, let me quickly show you an example from a CSV file I received today (heavily filtered for brevity):

date,time,PS Eden Space used,PS Old Gen Used, PS Perm Gen Used
2011-06-28,00:00:03,45004472,184177208,94048296
2011-06-28,00:00:18,45292232,184177208,94048296

I have about 100,000 data points like this with different variables that I want to plot in a scatter plot in order to look for correlations. Usually the data needs to be processed in some way for presentation purposes (such as converting nanoseconds to milliseconds and rounding fractional values), some columns may need to be added or inverted, or combined (like the date/time columns).

The usual recommendation for this kind of work is R and I have recently made a serious effort to use it, but after a few days of work my experience has been that most tasks that I expect to be simple seem to require many steps and have special cases; solutions are often non-generic (for example, adding a data set to an existing plot). It just seems to be one of those languages that people love because of all the powerful libraries that have accumulated over the years rather than the quality and usefulness of the core language.

Don't get me wrong, I understand the value of R to people who are using it, it's just that given how rarely I spend time on this kind of thing I think that I will never become an expert on it, and to a non-expert every single task just becomes too cumbersome.

Microsoft Excel is great in terms of usability but it just isn't powerful enough to handle large data sets. Also, both R and Excel tend to freeze completely (!) with no way out other than waiting or killing the process if you accidentally make the wrong kind of plot over too much data.

So, stack overflow, can you recommend something that is better suited for me? I'd hate to have to give up and develop my own tool, I have enough projects already. I'd love something interactive that could use hardware acceleration for the plot and/or culling to avoid spending too much time on rendering.

flodin
  • 5,215
  • 4
  • 26
  • 39

4 Answers4

9

@flodin It would have been useful for you to provide an example of the code you use to read in such a file to R. I regularly work with data sets of the size you mention and do not have the problems you mention. One thing that might be biting you if you don't use R often is that if you don't tell R what the column-types R, it has to do some snooping on the file first and that all takes time. Look at argument colClasses in ?read.table.

For your example file, I would do:

dat <- read.csv("foo.csv", colClasses = c(rep("character",2), rep("integer", 3)))

then post process the date and time variables into an R date-time object class such as POSIXct, with something like:

dat <- transform(dat, dateTime = as.POSIXct(paste(date, time)))

As an example, let's read in your example data set, replicate it 50,000 times and write it out, then time different ways of reading it in, with foo containing your data:

> foo <- read.csv("log.csv")
> foo
        date     time PS.Eden.Space.used PS.Old.Gen.Used
1 2011-06-28 00:00:03           45004472       184177208
2 2011-06-28 00:00:18           45292232       184177208
  PS.Perm.Gen.Used
1         94048296
2         94048296

Replicate that, 50000 times:

out <- data.frame(matrix(nrow = nrow(foo) * 50000, ncol = ncol(foo))) 
out[, 1] <- rep(foo[,1], times = 50000) 
out[, 2] <- rep(foo[,2], times = 50000) 
out[, 3] <- rep(foo[,3], times = 50000) 
out[, 4] <- rep(foo[,4], times = 50000) 
out[, 5] <- rep(foo[,5], times = 50000)
names(out) <- names(foo)

Write it out

write.csv(out, file = "bigLog.csv", row.names = FALSE)

Time loading the naive way and the proper way:

system.time(in1 <- read.csv("bigLog.csv"))
system.time(in2 <- read.csv("bigLog.csv",
                            colClasses = c(rep("character",2), 
                                           rep("integer", 3))))

Which is very quick on my modest laptop:

> system.time(in1 <- read.csv("bigLog.csv"))
   user  system elapsed 
  0.355   0.008   0.366 
> system.time(in2 <- read.csv("bigLog.csv",
                              colClasses = c(rep("character",2), 
                                             rep("integer", 3))))
   user  system elapsed 
  0.282   0.003   0.287

For both ways of reading in.

As for plotting, the graphics can be a bit slow, but depending on your OS this can be sped up a bit by altering the device you plot - on Linux for example, don't use the default X11() device, which uses Cairo, instead try the old X window without anti-aliasing. Also, what are you hoping to see with a data set as large as 100,000 observations on a graphics device with not many pixels? Perhaps try to rethink your strategy for data analysis --- no stats software will be able to save you from doing something ill-advised.

It sounds as if you are developing code/analysis as you go along, on the full data set. It would be far more sensible to just work with a small subset of the data when developing new code or new ways of looking at your data, say with a random sample of 1000 rows, and work with that object instead of the whole data object. That way you guard against accidentally doing something that is slow:

working <- out[sample(nrow(out), 1000), ]

for example. Then use working instead of out. Alternatively, whilst testing and writing a script, set argument nrows to say 1000 in the call to load the data into R (see ?read.csv). That way whilst testing you only read in a subset of the data, but one simple change will allow you to run your script against the full data set.

For data sets of the size you are talking about, I see no problem whatsoever in using R. Your point, about not becoming expert enough to use R, will more than likely apply to other scripting languages that might be suggested, such as python. There is a barrier to entry, but that is to be expected if you want the power of a language such as python or R. If you write scripts that are well commented (instead of just plugging away at the command line), and focus on a few key data import/manipulations, a bit of plotting and some simple analysis, it shouldn't take long to masters that small subset of the language.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • Haven't read your entire response yet but the problems with freezing are certainly not in reading the CSV file, but rather that some plot methods (e.g. plotting a transparent blot for each data point) tend to hang R. – flodin Jul 06 '11 at 08:59
  • @flodin Yes, I realised that later, but the import of big data often catches people out so I left it in. See later in my Answer where I address the graphics issue. Basically, your data aren't *large* by most standards so if you already know a bit of R, perhaps getting advice on the workflow there will better serve you than learning something new. My Answer started as a comment and then got too long so I expanded; why it might be a little bis disjointed. – Gavin Simpson Jul 06 '11 at 09:01
  • @flodin as an example of rethinking the workflow, approach, this blog posting from R-bloggers today shows one avenue for plotting large data sets in a scatterplot: http://sas-and-r.blogspot.com/2011/07/example-91-scatterplots-with-binning.html This isn't motivated from the point of speeding R up, but from how to best visualise and present the data. Plotting points with with an element of transparency is also often used when data are huge - the overplotting of partially transparent symbols shows where the data are densest. – Gavin Simpson Jul 06 '11 at 12:56
  • If I'm using software intended to handle large data sets and the GUI hangs when I give it large data sets, I don't think the response should be "change your workflow to use smaller datasets". IMO this is a flaw in R, although it alone would not be enough to make me look for something else. But when you put it together with other issues it is. – flodin Jul 07 '11 at 11:20
  • @flodin what are the other issues? Plenty of big-name companies use R for large data analysis tasks. The key thing is to do the right targeted analysis. You don't plot a million observations in a scatterplot for example, but you can bin the data first and then plot as per the hexagonal binning idea I mentioned in a comment. The issue there is doing the *right* analysis/plot. And I can hang any software that allows me to programme scripts in it by doing something silly - which seems to be the basis of your complaint about the graphics. – Gavin Simpson Jul 07 '11 at 12:39
  • Most software that lets you run scripts also has the ability to abort the script if it's taking too long (your browser, for example, won't hang because some javascript code is taking long to run). The other issues are those that I mentioned in my question, i.e. that it takes too much research and time to do simple things. – flodin Jul 07 '11 at 13:48
  • I don't know about R GUI on Windows, but I am quite capable of killing a process on linux without loosing any work. You aren't asking about a browser, you are asking about a stats/data analysis package. Your two options thus far seem to be python and R. Each has it's pros and cons, but both certainly come with a learning curve. I know I can do what @Fredrik shows in python in 1 line of R code. But you seem unwilling to learn..? – Gavin Simpson Jul 07 '11 at 17:55
5

R is a great tool, but I never had to resort to use it. Instead I find python to be more than adequate for my needs when I need to pull data out of huge logs. Python really comes with "batteries included" with built-in support for working with csv-files

The simplest example of reading a CSV file:

import csv
with open('some.csv', 'rb') as f:
    reader = csv.reader(f)
    for row in reader:
        print row

To use another separator, e.g. tab and extract n-th column, use

spamReader = csv.reader(open('spam.csv', 'rb'), delimiter='\t')
for row in spamReader:
   print row[n]

To operate on columns use the built-in list data-type, it's extremely versatile!

To create beautiful plots I use matplotlib scatter plot code

The python tutorial is a great way to get started! If you get stuck, there is always stackoverflow ;-)

Fredrik Pihl
  • 44,604
  • 7
  • 83
  • 130
  • You can create flashy fancy moving plots using `googleVIS` package in R http://code.google.com/p/google-motion-charts-with-r/. – Roman Luštrik Jul 06 '11 at 10:42
  • I just tried matplotlib on about 50k observations and I love that the default plot method is interactive (with zooming and displaying the value as you hover with the mouse) and still performs well. Plus, given that Python is a more generic language I think it is more efficient use of my time, as I can use the knowledge for other things too. – flodin Jul 11 '11 at 08:50
  • @flodin - I'm so glad for you, python is a great friend to have; and as you say, it's a generic language that can be used for almost anything (I wish I could use it to make my daughter go to sleep earlier, but I don't hold my breath for that feature to be implemented...)! – Fredrik Pihl Jul 11 '11 at 19:40
2

There seem to be several questions mixed together:

  1. Can you draw plots quicker and more easily?

  2. Can you do things in R with less learning effort?

  3. Are there other tools which require less learning effort than R?

I'll answer these in turn.

There are three plotting systems in R, namely base, lattice and ggplot2 graphics. Base graphics will render quickest, but making them look pretty can involve pathological coding. ggplot2 is the opposite, and lattice is somewhere in between.

Reading in CSV data, cleaning it and drawing a scatterplot sounds like a pretty straightforward task, and the tools are definitely there in R for solving such problems. Try asking a question here about specific bits of code that feel clunky, and we'll see if we can fix it for you. If your datasets all look similar, then you can probably reuse most of your code over and over. You could also give the ggplot2 web app a try.

The two obvious alternative languages for data processing are MATLAB (and its derivatives: Octave, Scilab, AcslX) and Python. Either of these will be suitable for your needs, and MATLAB in particular has a pretty shallow learning curve. Finally, you could pick a graph-specific tool like gnuplot or Prism.

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • I love your breakdown. :) It's really point 3 that is my question, the rest is just context to describe what I'm looking for in the alternative. And regarding point 1, it's not the speed itself that is the problem, but rather the fact that when I hit a wall of bad performance the only way out is to kill the process. – flodin Jul 07 '11 at 11:26
0

SAS can handle larger data sets than R or Excel, however many (if not most) people--myself included--find it a lot harder to learn. Depending on exactly what you need to do, it might be worthwhile to load the CSV into an RDBMS and do some of the computations (eg correlations, rounding) there, and then export only what you need to R to generate graphics.

ETA: There's also SPSS, and Revolution; the former might not be able to handle the size of data that you've got, and the latter is, from what I've heard, a distributed version of R (that, unlike R, is not free).

  • 1
    You are right about SAS in that it can handle more data than R, but your comments seem to have ignored the stated data set size: ~100000 observations. That isn't large in terms of RAM installed in today's desktop computers. I don't work with very big data sets that often but I have regularly worked with data sets of approximately that size. R only has problems when data sets start getting into the multi GB range because even with a good workstation and plenty of RAM, R will need to hold several copies in memory if you want to do anything useful. But even that is being worked round now. – Gavin Simpson Jul 06 '11 at 08:04
  • 1
    I think you missed the main point that I was looking for something simpler to use, not harder. Thanks for the effort though. – flodin Jul 06 '11 at 10:02