1

I have an idea to implement a real-time keyword-based torrent search mechanism using the existing BitTorrent DHT, and I would like to know if it is feasible and realistic.


We have a torrent, and we would like to be able to find it from a keyword using the DHT only.

  • H is a hash function with a 20 bytes output
  • infohash is the info_hash of the torrent (20 bytes)
  • sub(hash, i) returns 2 bytes of hash starting at byte i (for example, sub(0x62616463666568676a696c6b6e6d706f72717473, 2) = 0x6463)
  • announce_peer(hash, port) publishes a fake peer associated with a fake info_hash hash. The IP of the fake peer is irrelevant and we use the port number to store data (2 bytes).
  • get_peers(hash) retrieves fake peers associated with fake info_hash hash. Let's consider that this function returns a list of port number only.
  • a ++ b means concatenate a and b (for example, 0x01 ++ 0x0203 = 0x010203)

Publication

id <- sub(infohash, 0)
announce_peer( H( 0x0000 ++ 0x00 ++ keyword ), id               )
announce_peer( H( id     ++ 0x01 ++ keyword ), sub(infohash, 2 ))
announce_peer( H( id     ++ 0x02 ++ keyword ), sub(infohash, 4 ))
announce_peer( H( id     ++ 0x03 ++ keyword ), sub(infohash, 6 ))
announce_peer( H( id     ++ 0x04 ++ keyword ), sub(infohash, 8 ))
announce_peer( H( id     ++ 0x05 ++ keyword ), sub(infohash, 10))
announce_peer( H( id     ++ 0x06 ++ keyword ), sub(infohash, 12))
announce_peer( H( id     ++ 0x07 ++ keyword ), sub(infohash, 14))
announce_peer( H( id     ++ 0x08 ++ keyword ), sub(infohash, 16))
announce_peer( H( id     ++ 0x09 ++ keyword ), sub(infohash, 18))

Search

ids <- get_peers(H( 0x0000 ++ 0x00 ++ keyword ))
foreach (id : ids)
{
    part1 <- get_peers(H( id ++ 0x01 ++ keyword ))[0]
    part2 <- get_peers(H( id ++ 0x02 ++ keyword ))[0]
    part3 <- get_peers(H( id ++ 0x03 ++ keyword ))[0]
    part4 <- get_peers(H( id ++ 0x04 ++ keyword ))[0]
    part5 <- get_peers(H( id ++ 0x05 ++ keyword ))[0]
    part6 <- get_peers(H( id ++ 0x06 ++ keyword ))[0]
    part7 <- get_peers(H( id ++ 0x07 ++ keyword ))[0]
    part8 <- get_peers(H( id ++ 0x08 ++ keyword ))[0]
    part9 <- get_peers(H( id ++ 0x09 ++ keyword ))[0]

    result_infohash <- id ++ part1 ++ part2 ++ ... ++ part9
    print("search result:" ++ result_infohash)
}

I know there would be collisions with id (2 bytes only), but with relatively specific keywords it should work...

We could also build more specific keywords by concatenating several words in alphanumeric order. For example, if we have words A, B and C associated with a torrent, we could publish keywords A, B, C, A ++ B, A ++ C, B ++ C and A ++ B ++ C.


So, is this awful hack feasible :D ? I know that Retroshare is using BitTorrent's DHT.

1 Answers1

1

It is unlikely to be practical because it does not even try to be efficient (number of lookups) or reliable (failure rate multiplied by number of lookups). And that is for a single keyword, not boolean queries which would blow up the lookup complexity even further.

Not to mention that it doesn't even solve the hard problems of distributed searching such as avoiding spam and censoring.

Additional problems are that each node could only publish one torrent under a keyword and it would require multiple nodes to somehow coordinate what they publish under which keyword before they run into the collision problem.

Of course you might be able to make it work in a handful of instances, but that is is irrelevant because uses of p2p protocols should be designed in a way such that they still work in that case that all nodes nodes used that feature in a similar fashion. Clearly a (m * n * 10)-fold [m = torrents per keyword, n = number of search terms] blowup of network traffic is not acceptable.

If you are seriously interested in distributed keyword search I recommend that you hit google scholar and arxiv and look for existing research, it is a non-trivial topic.

For bittorrent specifically you should also look beyond BEP 5. BEP 44 provides arbitrary data storage, BEPs 46, 49 and 51 describe additional building blocks and abstractions. But I would consider none of them sufficient for a realtime distributed multi-keyword search as one would expect it from a local database or an indexing website.

the8472
  • 40,999
  • 5
  • 70
  • 122