1

I have many csv files which are "column" oriented and that I need to pre-process to finally index them.

This is time oriented data, with a very large number of columns for each "device" (up to 128 columns) like:

LDEV_XXXXXX.csv             
Serial number : XXXXX(VSP)              
From : 2014/06/04 05:58             
To   : 2014/06/05 05:58             
sampling rate : 1               

"No.","time","00:30:00X(X2497-1)","00:30:01X(X2498-1)","00:30:02X(X2499-1)"
"242","2014/06/04 10:00",0,0,0
"243","2014/06/04 10:01",0,0,0
"244","2014/06/04 10:02",9,0,0
"245","2014/06/04 10:03",0,0,0
"246","2014/06/04 10:04",0,0,0
"247","2014/06/04 10:05",0,0,0

My goal is to transpose (if it the term is the right one) data into rows, such that i will be able to manipulate the data much more efficiently, such as:

"time",device,value
"2014/06/04 10:00","00:30:00X(X2497-1)",0
"2014/06/04 10:00","00:30:01X(X2498-1)",0
"2014/06/04 10:00","00:30:02X(X2499-1)",0
"2014/06/04 10:01","00:30:00X(X2497-1)",0
"2014/06/04 10:01","00:30:01X(X2498-1)",0
"2014/06/04 10:01","00:30:02X(X2499-1)",0
"2014/06/04 10:02","00:30:00X(X2497-1)",9
"2014/06/04 10:02","00:30:01X(X2498-1)",0
"2014/06/04 10:02","00:30:02X(X2499-1)",0

And so on...

Note: I have let the raw data (which is uses "," as a separator), you would note that I need to delete the 6 first lines the "No" column which has no interest, but this is not the main goal and difficulty)

I have a python starting code to transpose csv data, but it doesn't exactly what i need...

import csv
import sys
infile = sys.argv[1]
outfile = sys.argv[2]

with open(infile) as f:
    reader = csv.reader(f)
    cols = []
    for row in reader:
        cols.append(row)

with open(outfile, 'wb') as f:
    writer = csv.writer(f)
    for i in range(len(max(cols, key=len))):
        writer.writerow([(c[i] if i<len(c) else '') for c in cols])

Note the number of columns are arbitrary, something a few, and up to 128 depending on files.

I'm pretty sure this is a common need but I couldn't yet find the exact python code that does this, or I couldn't get...

Edit:

More precision:

Each timestamp row will be repeated by the number of devices, so that the file will have much more lines (multiplied by the number of devices) but only a few rows (timestamp,device,value) The final desired result has been updated :-)

Edit:

I would like to be able to use the script using argument1 for infile and argument2 for outfile :-)

Guilmxm
  • 47
  • 1
  • 9
  • given this `244,2014/06/04 10:02,9,0,0` shouldn't the transformed result look like this: `2014/06/04 10:02,00:30:02X(X2499-1),9` <---THIS – colinro Jun 18 '14 at 23:12
  • and you probably want a row for each device at each time, so there would be multiple rows with time value `2014/06/04 10:02`, one for each device with data recorded from that time slice? – colinro Jun 19 '14 at 08:48
  • Yes absolutely, the number of lines will be multiplicated by the number of device, the file will have much more lines but only few columns, so the timestamp will be repeated for each device – Guilmxm Jun 19 '14 at 09:37
  • Awesome, I feel comfortable answering now. you should update your question to clarify that last point. – colinro Jun 19 '14 at 09:40
  • 1
    You are absolutely right, i've corrected the final desired result and improved the question, thank you and sorry for that imprecision :-) – Guilmxm Jun 19 '14 at 09:56

2 Answers2

2

First you should get the data into the structure that you want, then you can write it out easily. Also, for csv's with complicated structure it's frequently more useful to open it with a DictReader.

from csv import DictReader, DictWriter

with open(csv_path) as f:
  table = list(DictReader(f, restval=''))

transformed = []
for row in table:
  devices = [d for d in row.viewkeys() - {'time', 'No.'}]
  time_rows = [{'time': row['time']} for i in range(len(devices))]
  for i, d in enumerate(devices):
    time_rows[i].update({'device': d, 'value': row[d]})
  transformed += time_rows

this produces a list like

[{'device': '00:30:00X(X2497-1)', 'value': '0', 'time': '2014/06/04 10:00'},  
 {'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:00'},  
 {'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:00'},  
 {'device': '00:30:00X(X2497-1)', 'value': '0', 'time': '2014/06/04 10:01'},  
 {'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:01'},  
 {'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:01'},  
 {'device': '00:30:00X(X2497-1)', 'value': '9', 'time': '2014/06/04 10:02'},  
 {'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:02'},  
 {'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:02'},  
 {'device': '00:30:00X(X2497-1)', 'value': '0', 'time': '2014/06/04 10:03'},  
 {'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:03'},  
 {'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:03'},  
 {'device': '00:30:00X(X2497-1)', 'value': '0', 'time': '2014/06/04 10:04'},  
 {'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:04'},  
 {'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:04'},  
 {'device': '00:30:00X(X2497-1)', 'value': '0', 'time': '2014/06/04 10:05'},  
 {'device': '00:30:02X(X2499-1)', 'value': '0', 'time': '2014/06/04 10:05'},  
 {'device': '00:30:01X(X2498-1)', 'value': '0', 'time': '2014/06/04 10:05'}]

which is exactly what we wanted. Then to write it back out you can use a DictWriter.

# you might sort transformed here so that it gets written out in whatever order you like

column_names = ['time', 'device', 'value']
with open(out_path, 'w') as f:
  writer = DictWriter(f, column_names)
  writer.writeheader()
  writer.writerows(transformed)    
colinro
  • 417
  • 4
  • 10
  • `devices = [d for d in row.keys() if d not in ('time', 'No.') and row[d]]` is a bit of a mental stumbling block. I would suggest `devices = [d[k] for k in d.viewkeys() - {'time', 'No.'}]` – Jon Clements Jun 19 '14 at 11:14
  • @colinro, Many thanks for answering, i'm trying to test your code, will it work with python 2x ? (python 3 won't be available basely in my server) Also, and please accept all my apologies for that, i realized that double quotes (but not for values) went out while pasting the source data, i hope this does not change the way your code works ? I'm currently learning python and so i'm very very new...and quickly lost ! – Guilmxm Jun 19 '14 at 12:39
  • I'm trying to use the following python script (i have python 2) and produces some error: Traceback (most recent call last): File "soluce1.py", line 22, in writer.writeHeader() AttributeError: DictWriter instance has no attribute 'writeHeader' – Guilmxm Jun 19 '14 at 12:54
  • @JonClements Thanks, I edited that part. Your way is much cleaner. @Guilmxm That was a typo on my part. Many apologies! the method is `writeheader` with no caps, and similarly `writerows` for the following line. – colinro Jun 19 '14 at 21:25
1

EDIT : Expect quotes (") around No., port code to python 2 with indication for python 3 and remove debugging print

EDIT2 : fixed stupid bug not incrementing indexes

EDIT3 : new version allowing the input file to contain multiple headers each followed by data

I am not sure it is worth to use csv module, because you separator is fixed, you have no quotes, and no field containing newline or separator character : line.strip.split(',') is enough.

Here is what I tried :

  • skip lines until one begins with No. and read fields after 2 firsts to get identifiers
  • proceed line by line
    • take date on second field
    • print on line for each field after 2 firsts using identifier

Code for python 2 (remove first line from __future__ import print_function for python 3)

from __future__ import print_function

class transposer(object):
    def _skip_preamble(self):
        for line in self.fin:
            if line.strip().startswith('"No."'):
                self.keys = line.strip().split(',')[2:]
                return
        raise Exception('Initial line not found')
    def _do_loop(self):
        for line in self.fin:
            elts = line.strip().split(',')
            dat = elts[1]
            ix = 0
            for val in elts[2:]:
                print(dat, self.keys[ix], val, sep=',', file = self.out)
                ix += 1

    def transpose(self, ficin, ficout):
        with open(ficin) as fin:
            with open(ficout, 'w') as fout:
                self.do_transpose(fin, fout)
    def do_transpose(self, fin, fout):
        self.fin = fin
        self.out = fout
        self._skip_preamble()
        self._do_loop()

Usage :

t = transposer()
t.transpose('in', 'out')

If input file contains multiple headers, it is necessary to reset the list of keys on each header :

from __future__ import print_function

class transposer(object):
    def _do_loop(self):
        line_number = 0
        for line in self.fin:
            line_number += 1
            line = line.strip();
            if line.strip().startswith('"No."'):
                self.keys = line.strip().split(',')[2:]
            elif line.startswith('"'):
                elts = line.strip().split(',')
                if len(elts) == (len(self.keys) + 2):
                    dat = elts[1]
                    ix = 0
                    for val in elts[2:]:
                        print(dat, self.keys[ix], val, sep=',', file = self.out)
                        ix += 1
                else:
                    raise Exception("Syntax error line %d expected %d values found %d"
                                    % (line_number, len(self.keys), len(elts) - 2))

    def transpose(self, ficin, ficout):
        with open(ficin) as fin:
            with open(ficout, 'w') as fout:
                self.do_transpose(fin, fout)
    def do_transpose(self, fin, fout):
        self.fin = fin
        self.out = fout
        self.keys = []
        self._do_loop()
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Hi Serge ! Thanks you very much for your help and your answer, i'm checking this right now and will return :-) – Guilmxm Jun 19 '14 at 10:17
  • Serge, I will need to use Python 2, verson 3x won't be available in my server installation. I'm sorry for that but i'm very novice in Python, could detail me a little more how i can use the class in a python script ? – Guilmxm Jun 19 '14 at 12:03
  • Serge, I've seen that that double quotes i have in data source (but not for values) went out while pasting the text, i'm definitively sorry for that ! I guess this may change the way your code should work as you mentioned it ? I don't really need to have the same double quotes in desired output, with or without this is the same for me... I'm trying to use your code, and probably i don't get it, as i have a syntax error: File "./transpose.py", line 22 print(dat, self.keys[ix], val, sep=',', file = self.out) ^ SyntaxError: invalid syntax – Guilmxm Jun 19 '14 at 12:33
  • Ok great it almost works, but i only have 1 device repeated each line, but values will seem to be transposed with success. Do you any idea ? – Guilmxm Jun 19 '14 at 14:02
  • @Guilmxm Post edited - I only looked at the general aspect and the '9' in values ... grmmph :-( – Serge Ballesta Jun 19 '14 at 14:08
  • Cool !!! This works great now, thank you very very much :-) Very cool! Very last thing, i wanted add a check of in file if exists and is not empty, strangely this does not seem to be that simple in python... – Guilmxm Jun 19 '14 at 14:58
  • These are methods of `os.path` module. File existence : `os.path.exists(filename)` returns boolean, file size : `os.path.getsize(filename)` returs long or raise exception. I let it to you as an exercise ... :-) – Serge Ballesta Jun 19 '14 at 15:26
  • You're right Serge ;-) Many thanks for your great help ! – Guilmxm Jun 19 '14 at 15:30
  • Hi Serge, i hope you're fine and again thank you for your great help. I discovered that cvs files had multi header, in the middle of the file the last 126 devices (up to 126) are listed with a new header and data after. With the following code, the script takes the header from the first header and not the second (but devices values are managed with sucess), would you have any idea how to manage this ? The structure is exactly the same (new header with new devices and data after it) Thanks – Guilmxm Jun 23 '14 at 15:59
  • I updated my answer with a slight variation to allow the occurence of multiple headers and raising errors if the number of values is incoherent with the last header. – Serge Ballesta Jun 23 '14 at 17:55
  • Works perfect !!! I think i've said it many times...but that's very nice of you to have updated your answer, very impressed by python's power, thank for you so much for your clever help ! – Guilmxm Jun 23 '14 at 18:55