Data munging with python: transforming string into rows

Question

I'm fairly new to python, and I need to perform some data munging. I want some advice on the best practice for this: libraries, modules, better code to implementment, or just direction.

So I have text file with data organised in the following format:

A:100 B:200 C:300

A:150 B:350 C:370

I.e. each line represents related data, 3 variables, named A/B/C. Each new line shows data regarding the same variables but related to another object.

So my data is in a text file, and I want to output it in the following format:

100 '\t' 150

200 '\t' 350

300 '\t' 370

I.e. rows of data for variables A, B and C, tabulated, such that I can export to a graphing toolkit (probably Origin Pro).

This is the code I've come up with thus far:

with open("example.txt", 'r') as file:
    for line in file.readlines():
        line = line.replace(' ', '\n')
        line = line.split(':',  1)[-1]
        print line

This transforms the data into the following:

100
B:200
C:300

150
B:350
C:370

As the .split() is obviously only performed on each line prior to using .replace() to add new lines between the data. I feel like after I use .replace(), I need to start looping over the lines again to perform .split() or even just line[2:] iteratively, to remove the leading variable names - but then I can't think how I'm going to tabulate the data also for each line to create columns?

Any ideas? Thanks!

The indentation in your for-loop is incorrect. Please edit the question. This code would cause a SyntaxError. — Håken Lid, Jul 12 '17 at 11:06
`str.replace` and `str.split` do not modify the string in place. You have to assign the output for this to make any sense. e.g. `line = line.replace(' ', '\n')` — Håken Lid, Jul 12 '17 at 11:09
Correct on all those, have edited to make the code more syntactically correct, thanks! — tyrfingnir, Jul 12 '17 at 12:57

score 0 · Answer 1 · answered Jul 12 '17 at 11:13

Obviously, you need some data structure to hold values for your variables. The most suitable one is a dictionary of lists:

d = {'A': [], 'B': [], 'C': []}

Then you go through the file, splitting each line into 'Name:value' items, then splitting these items on the : sign and storing values appriopriately:

with open("example.txt", 'r') as file:
    for line in file:
        elements = line.rstrip().split()
        for e in elements:
            (name, value) = e.split(':')
            d[name].append(value)

and finally you output your data, one variable per row:

for k in sorted(d):
    print('\t'.join(d[k]))

zipa · Answer 2 · 2017-07-12T11:23:15.103

I suggest using pandas library:

import pandas as pd

df = pd.read_csv(path_to_infile, sep=' ', header=None, names=['A', 'B', 'C'])
df = df.applymap(lambda x: int(x[2:]))
df.to_csv(path_to_outfile, sep='\t', header=False, index=False)

After you import pandas you can use read_csv function to load file into the dataframe with some additional arguments like:

sep that is used to declare separator
header that can be used to indicate that you don't have column names
names that is used to assign names to columns

After that you can use applymap function on whole dataframe to remove variable names and colons.

Finally, you can use to_csv method to save your file to a desired location, again without header but this time without index as well because it will add index column by default.

As you can see, again you can use sep argument to declare new separator.

score 0 · Answer 3 · answered Jul 12 '17 at 11:21

You could do it using pandas which makes it really easy:

import pandas as pd

df = pd.read_csv(file_path,sep=' ',header=None,names = ['A','B','C'])
output = df.apply(lambda x:x[.str.replace('^([A-Z]:)','')],axis=1).T
output.to_csv(file_path,sep='\t',header = False, index = False)

documentation:

pandas.apply

pandas.read_csv

pandas.to_csv

Data munging with python: transforming string into rows

3 Answers3