Performing ETL on what is almost a free format data interface is always ugly and unstable
Approach
- consider column definitions as meta data
- find the meta data, get start and end row of this
- clean up meta data - there are column definitions that don't exist !
- now have column names
- consider rest of rows after last meta data row to be actual data
reshape()
this into an array cols X max possible rows in remaining data
- do post transform validation logic to filter to what appears to be interesting data set
import pandas as pd
with open("data1.txt") as f:
data = f.read()
# pd.read_csv(io.StringIO(data))
dfraw = pd.DataFrame(data.split("\n"))
# find first and last row that contain column meta information
colsi = dfraw.loc[dfraw[0].str.contains("Column")].index
assert len(colsi)==2, "failed to find column labels"
# first stab... based on finding meta information
cols = dfraw.loc[colsi[0]:colsi[1]]
# exclude meta rows that are empty or labeled "Grade"
cols = cols[(cols[0]!="") & (cols[0]!="Grade") & (cols[0]!="(Column Names)--------------------")]
# shift dfraw start row to start of data after column meta data
dfraw = dfraw.loc[colsi[1]+1:].reset_index(drop=True)
# create dataframe with all remaining rows, truncating so reshape does not fail
# using numpy reshape() to transform sets of rows into columns
df = pd.DataFrame(dfraw.loc[:(len(dfraw)-(len(dfraw)%len(cols)))-1].values.reshape(int(len(dfraw)/len(cols)), len(cols)),
columns=cols[0].values)
# filter to valid "Egg #" and Day0 is defined
df = df[(df["Egg #"].str.match("^[0-9]+[ ]+[A-Z][0-9]")) & (df["Day0"].str.strip()!="")]
print(df.to_string(index=False))
output
Egg # Day0 Maturity IVF/ICSI Day1 Day2 Grade Day 3 Grade- Frag% Day4 Day5 Day6 Day7 Comment Fate Freezing ID (End of Column names)-------------------------
1 A1 (egg #)------------ MII ICSI 2PN/2PB (Day 1)---------- 2BC (Day 2 grade)-------- 6BC (Day 3)----------- 15 (Grade- Frag%)-------- - (Day 4) -------- - (Day 5) -------- - (Day 6) -------- - (Day 7) -------- (Comments) ------- F (Fate) A102519-01EV (Freezing ID)-------------
2 A2 MII ICSI 2PN/2PB 5B 12B 10 - - - - F B102519-01EV
3 A3 MII ICSI 2PN/2PB 5B 8B 10 - - - - ET
4 A4 MII ICSI 2PN/2PB 5B 10A 0 - - - - ET
5 B1 MII ICSI 2PN/2PB 5BC 8BC 10 - - - - F A102519-01EV
6 B2 MI MII/ICSI 2PN/2PB SAME - SAME - - - D