How p2p search engines could prevent corruption of distributed index by malicious peers?

Question

As a hobby I'm writing simple and primitive distributed web search engine and it occurred to me it currently has no protection against malicious peers trying to skew search results.

Current architecture of the project is storing inverse index and ranking factors in kad dht with peers updating this inverse index as they crawl web.

I've used google scholar in attempt to find some solution but it seems most of the authors of proposed p2p web search ignore above-mentioned problem.

I think I need some kind of reputation system or trust metrics, but my knowledge in this domain is sufficiently lacking and I would very much appreciate a few pointers.

score 3 · Accepted Answer · answered Jul 23 '14 at 14:47

One way you could avoid this is to only use reliable nodes for storing and retrieving values. The reliability of a node will have to be computed by known-good nodes, and it could be something like the similarity of a node's last few computed ranking factors compared to the same ranking factors computed by known-good nodes (i.e. compare the node's scores for google.com to known-good scores for google.com). Using this approach, you'll need to avoid the "rogue reliable node" problem (for example, by using random checks or reducing all reliability scores randomly).

Another way you could approach this is to duplicate computation of ranking factors across multiple nodes, fetch all of the values at search time, and rank them on the client side (using variance, for example). You could also limit searches to sites that only have >10 duplicate values computed, so that there is some time before new sites are ranked. Additionally, any nodes with values outside of the normal range could be reported by the client in the background, and their reliability scores could be computed this way. This approach is time-consuming for the end user (unless you replicate known-good results to known-good nodes for faster lookups).

Also, take a look at this paper which describes a sybil-proof weak-trust system (which, as the authors explain, is more robust than the impossible sybil-proof strong-trust system): http://www.eecs.harvard.edu/econcs/pubs/Seuken_aamas14.pdf

Thank you for the link, the paper is very interesting. – Moonwalker Jul 24 '14 at 14:58 — Moonwalker, Jul 24 '14 at 14:58

score 1 · Answer 2 · edited Jul 16 '14 at 20:39

The problem you are describing is Byzantine General’s problem or Byzantine Fault Tolerance. You can read more about it on wikipedia but there must be plenty of papers written about it.

I don’t remember the exact algorithm, but basically it’s mathematically proven that for t traitors (malicious peers) you will need 3*t + 1 peers in total, in order to detect the traitors.

My general thought would be, this is a huge overhead in implementation and resource waste on the indexing side, and while there is enough research to be done in distributed indexing and distributed search, not many people are tackling it yet. Also the problem has been basically solved with the Byzantine General’s it “just" needs to be implemented on top of an existing (and working) distributed search engine.

Thanks for the links, but I need something practical and considering that running sybil is even easier that honest node I need different strategy. — Moonwalker, Jul 16 '14 at 17:38

score 0 · Answer 3 · answered Jul 23 '14 at 15:14

If you don't mind having a time delay on index updates, you could opt for a block-chain algorithm similar to what bitcoin uses to secure funds.

Changes to the index (deltas only!) can be represented in a text or binary file format, and crunched by peers who accept a given block of deltas. A malicious peer would have to out-compute the rest of the network for a period of time in order to skew the index in their favor.

I believe the bitcoin hashing algorithm (SHA-256) to be flawed in that custom hardware renders the common users' hardware useless. A block chain using the litecoin's algorithm (scrypt) would work well, because cpus and gpus are effective tools in the computation.

You would weigh the difficulty accordingly, so that news block are produced on a fairly regular schedule -- maybe 2-5 minutes. A user of the search engine could posibly choose to use the index at least 30 minutes old, to guarantee that enough users in the network vouch for its contents.

more info: https://en.bitcoin.it/wiki/Block_chain https://en.bitcoin.it/wiki/Block_hashing_algorithm https://litecoin.info/block_hashing_algorithm https://www.coinpursuit.com/pages/bitcoin-altcoin-SHA-256-scrypt-mining-algorithms/

How p2p search engines could prevent corruption of distributed index by malicious peers?

3 Answers3