0

I have some historical trade date in a csv file in the format: unixtime, price, volume I I want to analyze that data.

I managed to do it in Python, but it's painfully slow (takes me about 2 days to run the algorithm for a 30 day data test).

I'm trying to do it in c/c++ or even Java or Scala, but my main issue is that I have no way to resample the data. I need to resample this data to the format: date time, open, high, low, close, volume for 15 minutes intervals but I can't find any way to do that in c/c++

In Python this does what I want (it uses a pandas Dataframe):

def resample_data(raw_data, time_frame):
    # resamples the ticker data in ohlc
    resampledData = raw_data.copy()
    ohlc_dict = {
        'open':'first',
        'high':'max',
        'low':'min',
        'close':'last',
        'price':'first'
        }

    resampledData = resampledData.resample(time_frame, how={'price':ohlc_dict, 'amount':'sum'})
    resampledData.amount = resampledData['amount']['sum'].fillna(0.0)
    resampledData['price']['close'] = resampledData['price']['close'].fillna(method='pad')
    resampledData = resampledData.apply(lambda x: x.fillna(resampledData['price']['close']))

    return resampledData

Any ideas (or a library) that does this in c/c++/Java/scala?

jbssm
  • 6,861
  • 13
  • 54
  • 81
  • Why don't you sort data by date if needed, then split it in chunks by 15 mins and aggregate each. This can be done iteratively so you don't have to hold too much data in memory. You can use standard Scala functions. Maybe you can also run this in parallel. – yǝsʞǝla Jan 04 '14 at 02:32
  • The problem is that doing this "manually" seems too error prone for me. Taking into account changing hours, days, months, years to keep checking for a 15 minutes (or different) intervals it's just something that will need extensive testing to check if all is being calculated correctly. It would be much better to use some library (like pandas for Python or Zoo for R) that already went trough the needed quality control. – jbssm Jan 04 '14 at 02:41
  • Unfortunately, this is not a library recommendation site. If you start work on something, we can help you in trouble spots. Personally, when dealing with datasets, I tend to go for SQL, especially when it looks like this should be simple aggregates; with a calendar/time-of-day table, different time periods become trivial. Otherwise, preparing "buckets" that you put data into should work, and _may_ be parallelizable. Any testing needed is in two parts; creating buckets based on time difference, and putting into those buckets. – Clockwork-Muse Jan 04 '14 at 02:54
  • I think you could define a function that produces intervals out of streamed data. If you deal with anything that is expressed in minutes, seconds, hours you can convert date to these units and then split. If you deal with days, weeks, etc then you would have to extract those units using something like JodaTime. Leap seconds perhaps can be ignored. – yǝsʞǝla Jan 04 '14 at 02:57
  • pandas has a resample function. Since you're using a pandas DataFrame, your solution is complex vs using the provided resample function. – Brandon Bradley Jan 23 '15 at 13:44

2 Answers2

1

Just a quick example of what you can do with standard Scala libraries. This code can be run in Scala REPL:

// not importing external libraries like Joda time and its Scala wrappers
import java.util.Date
import scala.annotation.tailrec

case class Sample(value: Double, timeMillis: Long)
case class SampleAggregate(startTimeMillis: Long, endTimeMillis: Long,
  min: Sample, max: Sample)

val currentMillis = System.currentTimeMillis
val inSec15min = 15 * 60
val inMillis15min = inSec15min * 1000
// sample each second:
val data = (1 to inSec15min * 100).map { i =>
  Sample(i, currentMillis + i*1000) }.toList

@tailrec
def aggregate(xs: List[Sample], intervalDurationMillis: Long,
  accu: List[SampleAggregate]): List[SampleAggregate] =
  xs match {
    case h :: t =>
      val start = h.timeMillis
      val (slice, rest) = xs.span(_.timeMillis < (start + intervalDurationMillis))
      val end = slice.last.timeMillis
      val aggr = SampleAggregate(start, end, slice.minBy(_.value),
        slice.maxBy(_.value))
      aggregate(rest, intervalDurationMillis, aggr :: accu)
    case Nil =>
      accu.reverse
  }

val result = aggregate(data, inMillis15min, Nil)

Fake data:

data.take(10).foreach(println)
Sample(1.0,1388809630677)
Sample(2.0,1388809631677)
Sample(3.0,1388809632677)
Sample(4.0,1388809633677)
Sample(5.0,1388809634677)
Sample(6.0,1388809635677)
Sample(7.0,1388809636677)
Sample(8.0,1388809637677)
Sample(9.0,1388809638677)
Sample(10.0,1388809639677)

Results:

result.foreach(println)
SampleAggregate(1388809630677,1388810529677,Sample(1.0,1388809630677),Sample(900.0,1388810529677))
SampleAggregate(1388810530677,1388811429677,Sample(901.0,1388810530677),Sample(1800.0,1388811429677))
SampleAggregate(1388811430677,1388812329677,Sample(1801.0,1388811430677),Sample(2700.0,1388812329677))
SampleAggregate(1388812330677,1388813229677,Sample(2701.0,1388812330677),Sample(3600.0,1388813229677))
SampleAggregate(1388813230677,1388814129677,Sample(3601.0,1388813230677),Sample(4500.0,1388814129677))
SampleAggregate(1388814130677,1388815029677,Sample(4501.0,1388814130677),Sample(5400.0,1388815029677))
SampleAggregate(1388815030677,1388815929677,Sample(5401.0,1388815030677),Sample(6300.0,1388815929677))

We can pass a function into span that will define intervals (hours or days). This can also be transformed into a Stream as it's being read from a file.

senia
  • 37,745
  • 4
  • 88
  • 129
yǝsʞǝla
  • 16,272
  • 2
  • 44
  • 65
0

Try looking at Saddle, for data manipulation. I have only just found this myself so am not sure of full capabilities but it is inspired by Pandas.

Gavin
  • 6,180
  • 3
  • 25
  • 25
  • Hello, Saddle was my first try actually. But although it implements a Pandas like DataFrame, it doesn't have the resampling functions to deal with TimeSeries. – jbssm Jan 04 '14 at 14:47