-1

I have been provided with some data in an SQL table, which needs to be transformed according to some rules. However, this transformation requires dynamically addressing column names, i.e., the names of columns in which values should be stored are themselves stored as values in other columns, etc. Also, depending on if values are present in a range of different columns, multiple output rows should be produced for one single input row, so this is a kind of transposing I guess.

Let me illustrate it with an example:

id | targetColumnName | col1 | col2 | col3 | col4
-----------------------------------------------------
 1                foo             5
 2                foo      7            42
 3                bar             3             6
 4                foo                    5
 5                bar
 6                bar                    2     12

This should produce the following:

id | foo | bar
--------------
 1     5
 2     7
 2    42
 3           3
 3           6
 4     5
 6           2
 6          12

In other words, for each present value in any of the colX columns, produce an output row with that value in the column designated by the targetColumnName column.

This leads me to believe that using SQL at all is not a good idea, at least not by creating a simple query. It's probably possible to create a compound statement / stored procedure, but still seems like the wrong tool for the job to me.

I have the data also in csv format, and I have a feeling that a powerful script language with text manipulation capabilities would suit the job better, but I have only scratched the surface of Python, I don't know any Perl and only a little awk, etc. I know I could pull it off in Java, but I still have a feeling this can be done with pretty much a magic one-liner in Python. :)

So the questions are essentially 1) is it reasonable to do this with SQL(ite) and 2) what other tool would be better?

JHH
  • 8,567
  • 8
  • 47
  • 91
  • 1
    This is not a good question. Everyone can answer it differently, i would prefer a stored procedure approach, someone else would write a java app. You can not get a definite answer for such questions. – tafa May 31 '16 at 07:23
  • I am not asking for a java solution, I can do that myself. I realize that asking for which language does a job best *may* become opinionated, but I am not simply asking "which is best", I am asking which language has the kind of built in functionality suited for this kind of processing. – JHH May 31 '16 at 07:32

1 Answers1

1

While not exactly a one-liner, this would be relatively easy in Python (*). The csv module in the standard library is your friend.

Assuming the target column names are not known and the file is to big to be read into memory, you have to read the CSV once to get all possible column names and then a second time to do the actual transformation.

Here is some pseudo code:

import csv

fieldnames = set('id')
with open('in.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        fieldname = row[1]
        fieldnames.add(row[1])
fieldnames = sorted(fieldnames)

with open('in.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile)
    with open('out.csv', 'w') as outfile:
        writer = csv.DictWriter(outfile, fieldnames=fieldnames)
        for row in reader:
            id_ = row[0]
            fieldname = row[1]
            for field in row[2:]:
                if field:
                    writer.writerow({'id': id_, fieldname: field})

You might have to tweak some parameters to csv.reader depending on the dialect of the CSV.

(*) 17 non-empty lines :)

Daniel Hepper
  • 28,981
  • 10
  • 72
  • 75