4

Background:

sqlite is great for doing SQL operations on data loaded into databases, but many times in my line of work I need to do selects, joins, and where statements on files that aren't loaded into a database and not necessarily worth the time to do loading/initialization into a database. Also, the random access characteristics of sqlite often make operations that are being performed on every row in a database slower.

Question:

Is there a suite of SQL type commands/fxns (preferably python/bash) that doesn't need sqlite and works on just raw tab spaced files? For instance, instead of using tables to select rows, just use column numbers.

Example

select col1,col2,col3 from fileName.tsv where col1[int] < 3

Note: I realize a lot of this can be accomplished with awk, cut, bash-join, etc; I was wondering if there was something more SQLesque?

sequenceGeek
  • 767
  • 1
  • 8
  • 20
  • I'm not sure about Python, but if you wanted to join us over here on the Dark Side (we have cookies! :P), you can use the Perl module DBD::CSV (http://search.cpan.org/~hmbrand/DBD-CSV-0.33/lib/DBD/CSV.pm). –  Sep 20 '11 at 23:42
  • 1
    @JackManey NEVER! :) Yeah DBD::CSV looks good, I showed it to my perl co-worker so he can use it. Thanks for the tip. – sequenceGeek Sep 21 '11 at 01:48

3 Answers3

2

You can hack something together using the csv module and list comprehensions:

import csv

reader = csv.reader(open('data.csv', 'r'))
rows = [row for row in reader]

# select * from data where first column < 4
# this won't actually work as-is! see the edit below
[row for row in rows if row[0] < 4]

# select * from data where second column >= third column
[row for row in rows if row[1] >= row[2]]

# select columns 1 and 3 from data where first column is "mykey"
[[row[0], row[2]] for row in rows if row[0] == "mykey"]

You can probably do some even more impressive things with Python’s functional programming tools, although if you’re not already familiar with FP it’s probably too big of a topic to learn just for this ;-)


Edit: A couple more tips:

  • If you’re only going to perform one “query” per script, you can cut out the intermediate storage of the data (rows in my example):

    import csv
    reader = csv.reader(open('data.csv', 'r'))
    result = [row for row in reader if row[0] == "banana"]
    
  • The csv reader generates all of its output as text, so if you want to treat one column as e.g. an integer you’ll have to do that yourself. For example, if your second and third columns are integers,

    import csv
    reader = csv.reader(open('data.csv', 'r'))
    rows = [[row[0], int(row[1]), int(row[2])] for row in reader]
    # perform a "select" on rows now
    

    (This means that my first example above won’t actually work as-is.) If all of your columns are integers then you can invoke the functional map function:

    import csv
    reader = csv.reader(open('data.csv', 'r'))
    rows = [map(int, row) for row in reader]
    
bdesham
  • 15,430
  • 13
  • 79
  • 123
1

After googling python equivalent of DBD::CSV, I found KirbyBase. That looks as though it'll fit the bill.

Since I generally don't use Python, however, I've never tried it.

Edited to add: Okay, after taking a glance at the documentation, the query commands aren't exactly SQL, but they're a lot more SQLesque than using awk.

0

I would strongly recommend Microsoft's log parser 2.2 ...except I figure you're using Linux. Pretty sure that won't work. But I'll put links here in case anyone's not using Linux.

http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=24659 http://www.codinghorror.com/blog/2005/08/microsoft-logparser.html

Jody
  • 8,021
  • 4
  • 26
  • 29