0

I have to work with a dataset that contains multiple lines per observation. The number of rows per observation can vary. The file is structured so that information is not repeated.

The file contains a segment ID, which relates the output to a specific piece of information. Here is a sample of the file layout.

Segment Id definitions

SegementID  Table                   Number of Occurrences
1           Customer Information    1
3           Items bought            1-10

Table layout - Customer Information

ID  Name

Table layout - Items Bought

Item    Cost    Date

Here is a sample of how the output file would look.

SegementID          
1   100     matt
3   ball    3.25    1/16/2016
3   cat     5.55    1/17/2016
1   200     lucy
3   doll    500.35  2/1/2016
3   ball    3.25    2/2/2016
3   dog     5.55    2/3/2016

Notice that segment ID = 1, relates to customer information. Segment ID 3 then shows all the transactions that customer has made.

I would like to make the structure that has customer ID available on each transaction line. What is the best way to do this?

ID  Item    Cost    Date
100 ball    3.25    1/16/2016
100 cat     5.55    1/17/2016
200 doll    500.35  2/1/2016
200 cat     3.25    2/2/2016
200 dog     5.55    2/3/2016
CompChemist
  • 903
  • 2
  • 8
  • 15
  • By "make the structure" do you mean writing another file with that structure? What are your criteria to judge "the best way"? – Rory Daulton Aug 09 '16 at 20:42
  • I was referring to a data structure (pandas), but a object or dictionary would work as well. By the best, I am referring to the cleanest code. – CompChemist Aug 09 '16 at 21:11

1 Answers1

1

Here is a quick solution using Pandas

import pandas as pd

df = pd.DataFrame()

with open("file.txt", "r") as f:
    f.next()
    for row,line in enumerate(f):
        info =  line.split()
        if info[0] == '1':
            client = info[1]
        else:
            df[row] = [client, info[1],info[2],info[3]]   

df = df.transpose()
df.columns = ["ID","Item","Cost","Date"]
Daniel
  • 5,095
  • 5
  • 35
  • 48