2

As part of a distributed crawler, we store all URLs in a Redis sorted set, which is the crawl queue and Redis hash (to de-duplicate and mark visited URLs).

We have about 11M URLs for various domains that we wish to visit in a file, which occupies 506 MB of space on disk.

However the same set of URLs when put in Redis sorted set, with decreasing priorities starting from integer 11M all the way to 0, takes 1.759 GB of RAM and the Redis hash from key: URL-> value: same URL, takes 2.048 GB of RAM space.

The redis server is hosted in High-memory (17GB) Extra Large EC2 Instance in AWS.

I want to figure out what's causing the space bloat in Redis, could it be because of inefficient way of storing them or should we optimize for memory in a certain way to avoid the space bloat? Any suggestion towards improving the memory performance would be gr8. Thanks in advance for any help!

This is the redis info dump:

redis_version:2.4.14
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
gcc_version:4.6.3
process_id:739
uptime_in_seconds:329647
uptime_in_days:3
lru_clock:1603627
used_cpu_sys:9521.58
used_cpu_user:3165.06
used_cpu_sys_children:19535.11
used_cpu_user_children:126500.32
connected_clients:76
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:12794713864
used_memory_human:11.92G
used_memory_rss:13586632704
used_memory_peak:16575849280
used_memory_peak_human:15.44G
mem_fragmentation_ratio:1.06
mem_allocator:jemalloc-2.2.5
loading:0
aof_enabled:0
changes_since_last_save:46321
bgsave_in_progress:1
last_save_time:1358213403
bgrewriteaof_in_progress:0
total_connections_received:1702
total_commands_processed:95112145
expired_keys:3488037
evicted_keys:0
keyspace_hits:43443780
keyspace_misses:38945
pubsub_channels:2
pubsub_patterns:0
latest_fork_usec:3820832
vm_enabled:0
role:master
db0:keys=116,expires=25
alpha_cod
  • 1,933
  • 5
  • 25
  • 43
  • you say that you use ordered set as a queue, this implies that after you visit a URL you remove it from the set, correct? – akonsu Jan 15 '13 at 02:52
  • yes, once a URL is visited, its removed from the sorted set and marked as completed (integer value 1) in the hash – alpha_cod Jan 15 '13 at 06:23
  • 1
    Use list instead of sorted set if you need a queue. Use set instead of a dictionary url->url. See http://stackoverflow.com/questions/10004565/redis-10x-more-memory-usage-than-data/10008222#10008222 – Didier Spezia Jan 15 '13 at 10:07
  • Totally agree with @DidierSpezia in this case. In fact I wrote a toolbox of Python-Queue like implementations backed by Redis. I implemented `Queue` in Redis List, but used Sorted Set for queues that always maintain unique items. See here for details https://github.com/woozyking/techies – woozyking Jan 16 '14 at 14:56

0 Answers0