How can I analyse ~13GB of data?

Question

I have ~300 text files that contain data on trackers, torrents and peers. Each file is organised like this:

tracker.txt

time torrent
    time peer
    time peer
    ...
time torrent
...

I have several files per tracker and much of the information is repeated (same information, different time).

I'd like to be able to analyse what I have and report statistics on things like

How many torrents are at each tracker
How many trackers are torrents listed on
How many peers do torrents have
How many torrents to peers have

The sheer quantity of data is making this hard for me to. Here's What I've tried.

MySQL

I put everything into a database; one table per entity type and tables to hold the relationships (e.g. this torrent is on this tracker).

Adding the information to the database was slow (and I didn't have 13GB of it when I tried this) but analysing the relationships afterwards was a no-go. Every mildly complex query took over 24 hours to complete (if at all).

An example query would be:

SELECT COUNT(DISTINCT torrent) 
    FROM TorrentAtPeer, Peer 
    WHERE TorrentAtPeer.peer = Peer.id 
    GROUP BY Peer.ip;

I tried bumping up the memory allocations in my my.cnf file but it didn't seem to help. I used the my-innodb-heavy-4G.cnf settings file.

EDIT: Adding table details

Here's what I was using:

Peer         Torrent                  Tracker        
-----------  -----------------------  ------------------  
id (bigint)  id (bigint)              id (bigint)
ip* (int)    infohash* (varchar(40))  url (varchar(255))
port (int)

TorrentAtPeer      TorrentAtTracker
-----------------  ----------------
id (bigint)        id (bigint)
torrent* (bigint)  torrent* (bigint)
peer* (bigint)     tracker* (bigint)
time (int)         time (int)

*indexed field. Navicat reports them as being of normal type and Btree method.
id - Always the primary key

There are no foreign keys. I was confident in my ability to only use IDs that corresponded to existing entities, adding a foreign key check seemed like a needless delay. Is this naive?

Matlab

This seemed like an application that was designed for some heavy lifting but I wasn't able to allocate enough memory to hold all of the data in one go.

I didn't have numerical data so I was using cell arrays, I moved from these to tries in an effort to reduce the footprint. I couldn't get it to work.

Java

My most successful attempt so far. I found an implementation of Patricia Tries provided by the people at Limewire. Using this I was able to read in the data and count how many unique entities I had:

13 trackers
1.7mil torrents
32mil peers

I'm still finding it too hard to work out the frequencies of the number of torrents at peers. I'm attempting to do so by building tries like this:

Trie<String, Trie<String, Object>> peers = new Trie<String, Trie<String, Object>>(...);
for (String line : file) {
    if (containsTorrent(line)) {
        infohash = getInfohash(line);
    }
    else if (containsPeer(line)) {
        Trie<String, Object> torrents = peers.get(getPeer(line));
        torrents.put(infohash, null);
    }
}

From what I've been able to do so far, if I can get this peers trie built then I can easily find out how many torrents are at each peer. I ran it all yesterday and when I came back I noticed that the log file wan't being written to, I ^Z the application and time reported the following:

real 565m41.479s
user 0m0.001s
sys  0m0.019s

This doesn't look right to me, should user and sys be so low? I should mention that I've also increased the JVM's heap size to 7GB (max and start), without that I rather quickly get an out of memory error.

I don't mind waiting for several hours/days but it looks like the thing grinds to a halt after about 10 hours.

I guess my question is, how can I go about analysing this data? Are the things I've tried the right things? Are there things I'm missing? The Java solution seems to be the best so far, is there anything I can do to get it work?

Whats the database server running? What is the spec of the database server? Is it on your local machine where it could suffer from bottlenecks? — Dean, Jul 12 '12 at 11:19
You might consider using [Hadoop](http://hadoop.apache.org/) and [HBase](http://hbase.apache.org/) for analysing your data. — uckelman, Jul 12 '12 at 11:11
I am aware of these technologies although I've never used them. I don't have access to any kind of cluster of machines, I really only have the one. Can I still benefit from using something like this? — WilliamMayor, Jul 12 '12 at 11:15
With 64 bit versions of Matlab, the memory problem should be solved. You may want to click in some extra RAM though. — Dennis Jaheruddin, Jul 23 '13 at 08:36

score 6 · Answer 1 · edited Dec 29 '13 at 14:36

6

You state that your MySQL queries took too long. Have you ensured that proper indices are in place to support the kind of request you submitted? In your example, that would be an index for Peer.ip (or even a nested index (Peer.ip,Peer.id)) and an index for TorrentAtPeer.peer.

As I understand you Java results, you have much data but not that many different strings. So you could perhaps save some time by assigning a unique number to each tracker, torrent and peer. Using one table for each, with some indexed value holding the string and a numeric primary key as the id. That way, all tables relating these entities would only have to deal with those numbers, which could save a lot of space and make your operations a lot faster.

edited Dec 29 '13 at 14:36

BenMorel

34,448
50
182
322

answered Jul 12 '12 at 10:37

MvG

57,380
22
148
276

Yes indexing the correct columns will help allot. It uses more DB space and ram but reduces query times by many fold. – Piotr Kula Jul 12 '12 at 10:39
I'll add the table details to the question, I did have indexes on what I thought would be the best fields. Take a look – WilliamMayor Jul 12 '12 at 10:49

Fabian Barney · Accepted Answer · 2012-07-13T14:29:45.103

5

I would give MySQL another try but with a different schema:

do not use id-columns here
use natural primary keys here:

Peer: ip, port
Torrent: infohash
Tracker: url
TorrentPeer: peer_ip, torrent_infohash, peer_port, time
TorrentTracker: tracker_url, torrent_infohash, time
use innoDB engine for all tables

This has several advantages:

InnoDB uses clustered indexes for primary key. Means that all data can be retrieved directly from index without additional lookup when you only request data from primary key columns. So InnoDB tables are somewhat index-organized tables.
Smaller size since you do not have to store the surrogate keys. -> Speed, because lesser IO for the same results.
You may be able to do some queries now without using (expensive) joins, because you use natural primary and foreign keys. For example the linking table TorrentAtPeer directly contains the peer ip as foreign key to the peer table. If you need to query the torrents used by peers in a subnetwork you can now do this without using a join, because all relevant data is in the linking table.

If you want the torrent count per peer and you want the peer's ip in the results too then we again have an advantage when using natural primary/foreign keys here.

With your schema you have to join to retrieve the ip:

SELECT Peer.ip, COUNT(DISTINCT torrent) 
    FROM TorrentAtPeer, Peer 
    WHERE TorrentAtPeer.peer = Peer.id 
    GROUP BY Peer.ip;

With natural primary/foreign keys:

SELECT peer_ip, COUNT(DISTINCT torrent) 
    FROM TorrentAtPeer 
    GROUP BY peer_ip;

EDIT Well, original posted schema was not the real one. Now the Peer table has a port field. I would suggest to use primary key (ip, port) here and still drop the id column. This also means that the linking table needs to have multicolumn foreign keys. Adjusted the answer ...

edited Jul 13 '12 at 14:29

answered Jul 12 '12 at 11:50

Fabian Barney

14,219
5
40
60

I'll give MySQL another try. Using natural indexes seems odd to me, am I not duplicating information and therefore increasing the space required? A torrent's infohash is a 40 character string, storing this once in the Torrent table and then once more for every instance in TorrentAtPeer would seem to take up more space than using an ID? I removed some fields from the tables for brevity, what I shouldn't have removed was the Peer's port. I'm generalising to say that if Peers share IPs but not ports then they should be considered the same Peer. – WilliamMayor Jul 12 '12 at 12:11
Your normal non-linking tables become one column tables. Here you save space and make queries faster. For the other tables it is much more important that you often can avoid joins which is far more important than saving the few bytes. And I think you can greatly benefit from InnoDBs clustered index here. Therefore the relevant data have to be in the primary key. – Fabian Barney Jul 12 '12 at 12:19
OK, I'll do some reading on clustered indexing. Do you have any good recommendations? If not I'll hit up Wikipedia and the MySQL site. Thanks for your help! – WilliamMayor Jul 12 '12 at 12:25
MySQL site should be ok afaik. The point is that a InnoDB table uses the primary key as clustered index when such exist. And when all the relevant data is part of this index then there is no additional lookup. When you have a surrogate id-column as primary key then the InnoDB clustered index do not contain relevant data for most of the queries and there would be additional lookups to retrieve the needed values of the other columns of the table. – Fabian Barney Jul 12 '12 at 12:31
Btw I am with you that surrogate id-columns are a good idea in most cases but not here. – Fabian Barney Jul 12 '12 at 12:48
Hey Fabian, thanks for all your help, sorry for messing you around with the wrong schema. I've added the data to a database and now I'm querying over it; a definite step in the right direction. I've asked another question about query performance here: http://stackoverflow.com/questions/11469818/improving-mysql-performance-on-a-run-once-query-with-a-large-dataset. Any chance you could help me out? – WilliamMayor Jul 13 '12 at 11:45
Glad to help you. Do not worry about the wrong schema. I was just a bit amped, because I spent some time on it. I'll have a look on your other question. – Fabian Barney Jul 13 '12 at 14:27

score 0 · Answer 3 · edited May 23 '17 at 11:59

0

If you could use C++, you should take a look at Boost flyweight.

Using flyweight, you can write your code as if you had strings, but each instance of a string (your tracker name, etc.) uses only the size of a pointer.

Regardless of the language, you should convert the IP address to an int (take a look at this question) to save some more memory.

edited May 23 '17 at 11:59

Community

1
1

answered Jul 12 '12 at 14:20

tstenner

10,080
10
57
92

Unfortunately I don't know any C++. I had had thoughts that being able to use pointers would have made my storage problems a little easier. I decided that it would be too inefficient to determine whether or not I had already processed a torrent; I would have to follow millions of pointers. Am I right? My knowledge of these things is sketchy. – WilliamMayor Jul 13 '12 at 11:51
@WilliamMayor suppose a solution in C++ shouldn't look much different than a solution in java. The beauty of Boost Flyweight is, that you don't have to care about pointers at all. – tstenner Jul 15 '12 at 11:24

score 0 · Answer 4 · answered Jul 12 '12 at 17:53

0

You most likely have a problem that can be solved by NOSQL and distributed technologies.

i) I would write a distributed system using Hadoop/HBase.

ii) Rent several tens / hundred AWS machines, but only for a few seconds (It'll still cost you less than a $0.50)

iii) Profit!!!

answered Jul 12 '12 at 17:53

roshan

1,323
18
31

This strikes me as being a very interesting (and nerd-cool) idea. I have no experience with writing distributed systems, do you have any recommended readings? Can I trial run and debug such systems on my personal machine? – WilliamMayor Jul 13 '12 at 11:52

How can I analyse ~13GB of data?

4 Answers4

Linked