Reading Redis Timeseries is slower than Pandas with CSV

Question

I'm using Redis Timeseries in order to read timeseries data that was previously stored in a CSV file.

The problem: Redis is far slower than Python Pandas reading the same set of data from Redis Server.

I provide here a MWE in order to show the issue. Here, I generate some random data composed by the unix timestamp and a number; then I fill the CSV and Redis with the same data in order to measure the READING time (I'm not concerned about writing in this scenario).

import csv
from datetime import datetime
from random import randrange
from datetime import timedelta
import redis
import pandas as pd
import time


def random_date(start, end):
    delta = end - start
    int_delta = (delta.days * 24 * 60 * 60) + delta.seconds
    random_second = randrange(int_delta)
    return start + timedelta(seconds=random_second)

with open('justcsv.csv', mode='w', newline='') as file:
    file_writer = csv.writer(
        file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    # Init Redis
    r = redis.Redis(host="localhost", port="6379")
    r.flushall()
    # Create the key
    r_tsname = "TESTKEY"
    label = { "label" : r_tsname}
    key_name = "TESTKEY1"
    r.ts().create(key_name, labels=label)

    # Init random timestamp between two datetime
    d1 = datetime.strptime('1/1/2008 1:30 PM', '%m/%d/%Y %I:%M %p')
    d2 = datetime.strptime('1/1/2022 4:50 AM', '%m/%d/%Y %I:%M %p')
    
    # For loop cycle
    for x in range(30000):
        dt = random_date(d1, d2)
        timestamp = int(dt.timestamp())
        random_number = round(random.uniform(1.5, 1000.9), 2)
        # write current row in CSV
        file_writer.writerow([timestamp, random_number])
        # write current row in REDIS
        r.ts().add(key_name, timestamp, random_number)
    
# READ data from CSV with Pandas and benchmark it
start_csv = time.time()
df = pd.read_csv('justcsv.csv') # benchmark
end_csv = time.time()
print("CSV READING TIME IS: " + str(end_csv-start_csv))

# READ data from Redis and benchmark
thelabel = "label=" + "TESTKEY"
mrange_filters = [ thelabel ]
start_redis = time.time()
full_range = r.ts().range("TESTKEY1", "-", "+") #benchmark
end_redis = time.time()
print("REDIS READING TIME IS: " + str(end_redis-start_redis))

Benchmark result:

10000 iterations - slower x2
CSV READING TIME IS: 0.0124
REDIS READING TIME IS: 0.052

20000 iterations - slower x4
CSV READING TIME IS: 0.025
REDIS READING TIME IS: 0.102

30000 iterations - slower x10
CSV READING TIME IS: 0.0139
REDIS READING TIME IS: 0.153

I used the latest Docker image from: https://hub.docker.com/r/redislabs/redistimeseries

My observations:

From what I understood, Redis should be extremely faster in this task, even because in this context there is a built-in data structure for dealing with timestamps;
Redis should also be faster because the in-memory feature compared to reading a CSV file from disk
the time gap from CSV rapidly increases as the data size grows
even querying the data via redis-cli doesn't change the elapsed time

My questions:

Why Redis is slower (and so slow)?
Am I missing something?
Is there a way to fix this?

`compared to reading a CSV file from disk` It's worth noting that due to Linux's page cache, a file that is read from disk will be put in the page cache, making future reads very fast. So you're comparing the Pandas CSV read to the Redis time series read, no disks involved. — Nick ODell, Jul 06 '22 at 20:37

Lior Kogan · Accepted Answer · 2022-07-07T07:25:41.227

2

This is not a fair comparison!

For Pandas, you are:
- writing a CSV (text file)
- reading the CSV back (text file)
For RedisTimeSeries, you are
- adding the samples using TS.ADD. Redis need to create a time series / find the existing time series, convert the data from textual representation binary representation, find where to add the date within the time series (Redis cannot assume the the timestamps are ordered), compress it, index it, and store it on disk if you enable persistence.
- Then, when you load the data, Redis needs to: parse your TS.RANGE query, find the time series based on its key, find the data within the time series, decompress it, and convert it from binary representation to textual representation (that's a costly operation). Much more work!
On top of that, the client-server communications of course! (TCP packetization, RESP encoding/decoding, etc.)

edited Jul 07 '22 at 07:25

answered Jul 06 '22 at 19:45

Lior Kogan

19,919
6
53
85

`reading the CSV back (text file). No processing.` I don't think that's true. Pandas is converting from a textual representation into a numpy/BlockManager form. – Nick ODell Jul 06 '22 at 20:41
@NickODell You are correct, there is some processing (mostly inferring the datatype for each column based on the first few rows on read). – Lior Kogan Jul 07 '22 at 07:26
Thanks for the detailed explanation. This is not a Redis issue but a normal behaviour as also explained in https://stackoverflow.com/questions/43874559/pandas-is-faster-to-load-csv-than-sql – user840718 Jul 07 '22 at 09:06

Reading Redis Timeseries is slower than Pandas with CSV

1 Answers1