0

At my office we have a legacy accounting system that stores all of its data in plaintext files (TXT extension) with fixed-width records. Each data file is named e.g., FILESALE.TXT. My goal is to bring this data into our MySQL server for read-only usage by many other programs that can't interface with the legacy software. Each file is essentially one table.

There are about 20 files in total that I need to access, roughly 1gb of total data. Each line might be 350-400 characters wide and have 30-40 columns. After pulling the data in, no MySQL table is much bigger than 100mb.

The legacy accounting system can modify any row in the text file, delete old rows (it has a deleted record marker -- 0x7F), and add new rows at any time.

For several years now I have been running a cron job every 5 minutes that:

  1. Checks each data file for last modification time.
  2. If the file is not modified, skip it. Otherwise:
  3. Parse the data file, clean up any issues (very simple checks only), and spit out a tab-delimited file of the columns I need (some of the columns I just ignore).
  4. TRUNCATE the table and imports the new data into our MySQL server like this:

    START TRANSACTION;
    TRUNCATE legacy_sales;
    LOAD DATA INFILE '/tmp/filesale.data' INTO TABLE legacy_sales;
    COMMIT;
    

The cron script runs each file check and parse in parallel, so the whole updating process doesn't really take very long. The biggest table (changed infrequently) takes ~30 seconds to update, but most of the tables take less than 5 seconds.

This has been working ok, but there are some issues. I guess it messes with database caching, so each time I have to TRUNCATE and LOAD a table, other programs that use the MySQL database are slow at first. Additionally, when I switched to running the updates in parallel, the database can be in a slightly inconsistent state for a few seconds.

This whole process seems horribly inefficient! Is there a better way to approach this problem? Any thoughts on optimizations or procedures that might be worth investigating? Any neat tricks from anyone who faced a similar situation?

Thanks!

Raolin
  • 379
  • 1
  • 4
  • 14

3 Answers3

2

Couple of ideas:

  • If the rows in the text files have a modification timestamp, you could update your script to keep track of when it runs, and then only process the records that have been modified since the last run.

  • If the rows in the text files have a field that can act as a primary key, you could maintain a fingerprint cache for each row, keyed by that id. Use this to detect when a row changes, and skip unchanged rows. I.e., in the loop that reads the text file, calculate the SHA1 (or whatever) hash of the whole row, and then compare that to the hash from your cache. If they match, the row hasn't changed, so skip it. Otherwise, update/insert the MySQL record and the store the new hash value in the cache. The cache could be a GDBM file, a memcached server, a fingerprint field in your MySQL tables, whatever. This will leave unchanged rows untouched (and thus still cached) on MySQL.

  • Perform updates inside a transaction to avoid inconsistencies.

Alex Howansky
  • 50,515
  • 8
  • 78
  • 98
0

Two things come to mind and I won't go into too much detail but feel free to ask questions:

  1. A service that offloads the processing of the file to an application server and then just populates the mySQL table, you can even build in intelligence by checking for duplicate records, rather than truncating the entire table.

  2. Offload the processing to another mysql server and replicate / transfer it over.

RandomUs1r
  • 4,010
  • 1
  • 24
  • 44
0

I agree with alex's tips. If you can, update only modified fields and mass update with transactions and multiple inserts grouped. an additional benefit of transactions is faster updat

if you are concerned about down time, instead of truncating the table, insert into a new table. then rename it.

for improved performance, make sure you have proper indexing on the fields.

look at database specific performance tips such as _ delayed_inserts in mysql improve performance _ caches can be optimized _ even if you do not have unique rows, you may (or may not) be able to md5 the rows

Youn Elan
  • 2,379
  • 3
  • 23
  • 32