0

I've got some files of big data to parse through. Each file has repetitions of certain tags but only one case of others. For example, each file has parents for name and date which only show once in every block of data but have many children like patent citations, non-patent citations, and classification.

So I parse through finding all cases of each three of these children and store them every iteration of parents in each file to individual lists. The problem is that the children are always of different lengths and I want to write them all on one row of a CSV file.

For example for one iteration in a file for my list inputs are like:

Name = [Jon]
Date = [1985]
Patcit = [1, 2, 3]
Npatcit = [4, 5, 6, 7, 8]
Class = [9, 10]

This is my second iteration, incoming lists

Name = [Nikhil]
Date = [1988]
Patcit = [1, 2, 3]
Npatcit = [4, 5, 6, 7]
Class = [9, 10, 11, 12, 13]

This is my third iteration, incoming lists

Name = [Neetha]
Date = [1986]
Patcit = [1, 2]
Npatcit = [4, 5]
Class = [9, 10, 11, 12]

And I want an output written to a CSV file to look like:

Name     Date     Patcit   Npatcit     Class
Jon      1985     1,2,3   4,5,6,7,8    9,10
Nikhil   1988     1,2,3    4,5,6,7   9,10,11,12,13               
Neetha   1986      1,2       4,5      9,10,11,12

(Repeat next name and date iteration on the next row)

2 Answers2

0

If you want to make a string out of a list you can try this thing:

x = ",".join(patcit)
#the str itself will be the dividor
#x is now 1,2,3
#the type of x is str

Later you can use .split(",") to turn it back to list of strings

huck dupr
  • 36
  • 5
0

You can convert data to dictionary and append() to existing DataFrame

It will need to convert list [1, 2, 3] (and similar) to string "1,2,3" (etc.)

import pandas as pd

df = pd.DataFrame(columns=['Name', 'Date', 'Patcit', 'Npatcit', 'Class'])

# -------------------------------

Name = ['Jon']
Date = [1985]
Patcit = [1, 2, 3]
Npatcit = [4, 5, 6, 7, 8]
Class = [9, 10]

row = {
    'Name': Name[0],
    'Date': Date[0],
    'Patcit':  ','.join(str(x) for x in Patcit),
    'Npatcit': ','.join(str(x) for x in Npatcit),
    'Class':   ','.join(str(x) for x in Class),
}

df = df.append(row, ignore_index=True)

#print(df)

# -------------------------------

Name = ['Nikhil']
Date = [1988]
Patcit = [1, 2, 3]
Npatcit = [4, 5, 6, 7]
Class = [9, 10, 11, 12, 13]

row = {
    'Name': Name[0],
    'Date': Date[0],
    'Patcit':  ','.join(str(x) for x in Patcit),
    'Npatcit': ','.join(str(x) for x in Npatcit),
    'Class':   ','.join(str(x) for x in Class),
}

df = df.append(row, ignore_index=True)

print(df)

Result

     Name  Date Patcit    Npatcit          Class
0     Jon  1985  1,2,3  4,5,6,7,8           9,10
1  Nikhil  1988  1,2,3    4,5,6,7  9,10,11,12,13

And later you can write to csv using standard separator - comma - or other separator.

df.to_csv('output.csv', sep=';')

Or see other question which describes how to write fixed-width-file

furas
  • 134,197
  • 12
  • 106
  • 148