4

I'm looking to distribute some information to different machines for efficient and extremely fast access without any network overhead. The data exists in a relational schema, and it is a requirement to "join" on relations between entities, but it is not a requirement to write to the database at all (it will be generated offline).

I had alot of confidence that SQLite would deliver on performance, but RDMBS seems to be unsuitable at a fundamental level: joins are very expensive due to cost of index lookups, and in my read-only context, are an unnecessary overhead, where entities could store direct references to each other in the form of file offsets. In this way, an index lookup is switched for a file seek.

What are my options here? Database doesn't really seem to describe what I'm looking for. I'm aware of Neo4j, but I can't embed Java in my app.

TIA!

Edit, to answer the comments:

  • The data will be up to 1gb in size, and I'm using PHP so keeping the data in memory is not really an option. I will rely on the OS buffer cache to avoid continually going to disk.
  • Example would be a Product table with 15 fields of mix type, and a query to list products with a certain make, joining on a Category table.
  • The solution will have to be some kind of flat file. I'm wondering if there already exists some software that meets my needs.

@Mark Wilkins:

The performance problem is measured. Essentially, it is unacceptable in my situation to replace a 2ms IO bound query to Memcache with an 5ms CPU bound call to SQLite... For example, the categories table has 500 records, containing parent and child categories. The following query takes ~8ms, with no disk IO: SELECT 1 FROM categories a INNER JOIN categories B on b.id = a.parent_id. Some simpler, join-less queries are very fast.

Scott
  • 4,070
  • 3
  • 21
  • 16
  • You are saying that this is a relational schema and that you need to join on relations, but you don't want to use a relational database? Why not use a flat file? – Oded Dec 19 '11 at 21:38
  • Give examples and approximate sizing of your "DB". – CAFxX Dec 19 '11 at 22:26

2 Answers2

1

I may not be completely clear on your goals as to the types of queries you are needing. But the part about storing file offsets to other data seems like it would be a very brittle solution that is hard to maintain and debug. There might be some tool that would help with it, but my suspicion is that you would end up writing most of it yourself. If someone else had to come along later and debug and figure out a homegrown file format, it would be more work.

However, my first thought is to wonder if the described performance problem is estimated at this point or actually measured. Have you run the tests with the data in a relational format to see how fast it actually is? It is true that a join will almost always involve more file reads (do the binary search as you mentioned and then get the associated record information and then lookup that record). This could take 4 or 5 or more disk operations ... at first. But in the categories table (from the OP), it could end up cached if it is commonly hit. This is a complete guess on my part, but in many situations the number of categories is relatively small. If that is the case here, the entire category table and its index may stay cached in memory by the OS and thus result in very fast joins.

If the performance is indeed a real problem, another possibility might be to denormalize the data. In the categories example, just duplicate the category value/name and store it with each product record. The database size will grow as a result, but you could still use an embedded database (there are a number of possibilities). If done judiciously, it could still be maintained reasonably well and provide the ability to read full object with one lookup/seek and one read.

Mark Wilkins
  • 40,729
  • 5
  • 57
  • 110
1

In general probably the fastest thing you can do at first is to denormalize your data thus avoiding JOINs and other mutli-table lookups.

Using SQLite you can certainly customize all sorts of things and tailor them to your needs. For example, disable all mutexing if you're only accessing via one thread, up the memory cache size, customize indexes (including getting rid of many), custom build to disable unnecessary meta data, debugging, etc.

Take a look at the following:

This is all of course assuming a database is what you need.

NuSkooler
  • 5,391
  • 1
  • 34
  • 58
  • Denormalization isn't an option due to how much it would increase the size of the data. – Scott Dec 20 '11 at 20:00
  • I'll re-iterate to look at the bullet point items listed even if de-normalization isn't an option (Still, I wouldn't completely discount that as you may be able to de-normalize minor points of data and get major speed boosts with little extra space cost). Custom tailor a SQLite build + pragma setup, etc. for your needs. Use EXPLAIN to dissect all of your queries and tailor them for speed + make use of the query planner. – NuSkooler Dec 20 '11 at 20:44