How to solve log slowness with or without NoSql

Question

I am having a problem regarding a Log Searching Speed and Disk Size. It is extremely big, it has about 220 millions rows and 25 gigabyte disk size and takes several minutes to fetch some selects.

How does it work? The log is saved in the database using Sql Anywhere, currently version 9 and soon will be migrated to 11 (we tried to 12, but due some driver and some problems, we went back to 11).

The log consists with two tables (name changed to english so the people here are able to understand):

LogTable

Id, DateTime, User, Url, Action and TableName. Action is what the used did: insert/delete/update TableName is which table in the database was affected.

LogTableFields

Id, LogTable_Id, FieldName, NewValue, OldValue. LogTable_Id is foreign key from LogTable. FieldName is the field of the table from DB.

Important to note that NewValue and OldValue are type of varchar. Because it's recorded every kind of fields from other tables (datetime, int, etc).

Why it was made this way? Because we must record everything important. The system is made to an Institutional Department of Traffic (i don't know if it's spelled this way in proper english, but now you can an ideia what this is about) and sometimes they demand some kind of random report.

Until now, we have made our report simply doing some SQL select. However it takes several minutes to complete, even if datetime filtered. Isn't and issue to complain when it's not request often.

But they are demanding more and more reports that it is necessary to create a feature in the software with a nice and beauty report. As we never know theirs needs, we must go back to log and unbury the data.

Some information requested are only in the log. (e.g what user gave improperly access of the vehicle to someone)

Some ideas suggested until now:

Idea 1: I did some researches and I was told to work with NoSql using CouchDB. But the little i read i feel NoSql isn't a solution for my problem. I can't argue why for non experience in it.

Idea 2: Separate the Log Tables physically from the Database or from the machine.

Idea 3: Create a mirror from every table with a version field to keep history.

I'd like a macro optimization or architecture change if needed.

score 1 · Accepted Answer · answered Jan 14 '12 at 08:45

This seems like a pretty standard audit table. I'm not sure you need to go to a NoSQL solution for this. 220mil rows will be comfortably handled by most RDBMs.

It seems that the biggest problem is the table structure. Generally you flatten the table to improve logging speed and normalize it to improve reporting speed. As you can see these are conflicting.

If you were using something like MS SQL, you could build a single flat table for logging performance, then build a simple Analysis Services cube on top of it.

Another option would be to just optimize for reporting assuming you could maintain sufficient logging throughput. To do that, you may want to create a structure like this:

create table LogTable (
  LogTableID int identity(1,1),
  TableName varchar(100),
  Url varchar(200)
)

create table LogUser (
  LogUserID int indentity(1,1),
  UserName varchar(100)
)

create table LogField (
  LogFieldID int identity(1,1),
  FieldName varchar(100),
)

create table LogData (
  LogDataID bigint identity(1,1),
  LogDate datetime,
  LogTableID int references LogTable(LogTableID),
  LogFieldID int references LogField(LogFieldID),
  LogUserID int references LogUserID(LogUserID),
  Action char(1), -- U = update, I = insert, D = delete
  OldValue varchar(100),
  NewValue varchar(100)
)

This should still be fast enough to log data quickly, but provide enough performance for reporting. Index design is also important, generally done in order of increasing cardinality, so something like LogData(LogTableID, LingFieldID, LogDate). You can also get fancy with partitioning to allow for parallelized queries.

Thanks, this helped a lot. Less disk space, faster search. I improved a little bit creating for URL and IP too, following the same logic. And in my case it worths doing it. The price was a bigger sql and more tables, but does not bother me. — Ismael, Jan 16 '12 at 13:27

score 1 · Answer 2 · answered Jan 18 '12 at 06:23

Adding proper indices is going to be the biggest improvement you can make. You don't mention having any indices, so I assume you don't have any. That would make it very slow.

For example, limiting your query to a particular range of DateTime doesn't help at all unless you have an index on DateTime. Without an index, the database still needs to touch nearly all 25GB of data to find the few rows that are in the right time range. But with an index, it could quickly identify the few rows that are in the time range you care about.

In general, you should always ask your database what plan it is using to execute a query that is taking too long. I'm not particularly familiar with Sql Anywhere, but I know it has a Plan Viewer that can do this. You want to identify big sequential scans and put indices on those fields instead.

I doubt you would see a measurable improvement from breaking up the table and using integer foreign keys. To the extent that your queries touch many columns, you'll just end up joining all those tables back together anyway.

we're still doing the migration and probaly have some info if it worked. I'll check your suggestion too. Thanks. — Ismael, Jan 18 '12 at 10:36

How to solve log slowness with or without NoSql

LogTable

LogTableFields

2 Answers2