14

Does anyone know of any tools to provide simple, fast queries of flat files using a SQL-like declarative query language? I'd rather not pay the overhead of loading the file into a DB since the input data is typically thrown out almost immediately after the query is run.

Consider the data file, "animals.txt":

dog 15
cat 20
dog 10
cat 30
dog 5
cat 40

Suppose I want to extract the highest value for each unique animal. I would like to write something like:

cat animals.txt | foo "select $1, max(convert($2 using decimal)) group by $1"

I can get nearly the same result using sort:

cat animals.txt | sort -t " " -k1,1 -k2,2nr

And I can always drop into awk from there, but this all feels a bit awkward (couldn't resist) when a SQL-like language would seem to solve the problem so cleanly.

I've considered writing a wrapper for SQLite that would automatically create a table based on the input data, and I've looked into using Hive in single-processor mode, but I can't help but feel this problem has been solved before. Am I missing something? Is this functionality already implemented by another standard tool?

Halp!

Jonas
  • 121,568
  • 97
  • 310
  • 388
plinehan
  • 700
  • 7
  • 17

8 Answers8

4

I wrote TxtSushi mostly to do SQL selects on flat files. Here is the command chain for your example (all of these commands are from TxtSushi):

tabtocsv animals.txt | namecolumns - | tssql -table animals - \
'select col1, max(as_int(col2)) from animals group by col1'

namecolumns is only required because animals.txt doesn't have a header row. You can get a quick sense of what is possible by looking through the example scripts. There are also links to similar tools on the bottom of the main page.

Keith
  • 2,820
  • 5
  • 28
  • 39
  • Very nice. How well does it scale? I'm hoping to deal with muti-gigabyte files which exceed the available RAM on my machine. – plinehan Apr 01 '10 at 17:26
  • 2
    It does any kind of row filtering or column selection using a streaming approach but as soon as you ask it to do anything requiring a sort (group by, join on, order by all require sorts) it wants to read the full table into memory. In this case you can give the -external-sort option which tells TxtSushi to sort on disk instead, but my current implementation of external sort is very inefficient and needs some work. – Keith Apr 01 '10 at 22:51
1

Perl DBI using DBD::AnyData

harschware
  • 13,006
  • 17
  • 55
  • 87
1

you can use sqlite. Here's an example using Python.

import sqlite3
conn = sqlite3.connect('/tmp/test.db')
cursor = conn.cursor()
try:
    cursor.execute("""create table table1 (word varchar not null, number varchar not null)""")
except: pass
cursor.execute("insert into table1 values ('dog', '15')")
cursor.execute("insert into table1 values ('cat', '20')")
cursor.execute("insert into table1 values ('dog', '10')")
cursor.execute("select max(number) , word from table1 group by word")
print cursor.fetchall()

output

$ ./python.py
[(u'20', u'cat'), (u'15', u'dog')]
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
  • I'd like to be able to do everything right from the shell, which this solution doesn't seem to support. – plinehan Apr 01 '10 at 18:19
1

I just stumbled across this Python script which does something like what you want, although it only supports very basic queries.

David Johnstone
  • 24,300
  • 14
  • 68
  • 71
0

I made a tool that might help. http://www.mccoyonlinestore.com/index.php?txtSearch=mccoy_rdbms your sql could be "Select Max(value) from animals" or it could by "Select * from animals order by value desc"

0

We'll I have a lightweight ORM for sqlite that would simplify this task without requiring any configuration files, etc.

If you can using PowerShell has a lot of powerful capabilities for parsing and querying text files (example here). Otherwise using .NET/Mono you can cut that up in and use LINQ in no time.

mythz
  • 141,670
  • 29
  • 246
  • 390
  • I'd like to be able to do everything right from the shell, which this solution doesn't seem to support. – plinehan Apr 01 '10 at 18:19
  • Which one? Powershell is bash on steroids which lets you do everything from the shell. As for OrmLite (which is what I would use), you write a few lines program that imports all the data to your db of choice then you can use the sqlite3.exe to query from the command prompt. – mythz Apr 05 '10 at 08:50
0

I never managed to find a satisfying answer to my question, but I did at least find a solution to my toy problem using uniqs "-f" option, which I had been unaware of:

cat animals.txt | sort -t " " -k1,1 -k2,2nr \
| awk -F' ' '{print $2, " ", $1}' | uniq -f 1

The awk portion above could, obviously, be skipped entirely if the input file were created with columns in the opposite order.

I'm still holding out hope for a SQL-like tool, though.

plinehan
  • 700
  • 7
  • 17
0

You can look for HXTT JDBC Drivers. They provide JDBC drivers for most type of flat files, excel etc .

You can execute simple SQL queries on it.

They have trial versions available as well

Harshil
  • 883
  • 1
  • 8
  • 25