-2

Link to txt file example Additional File Link

I want to be able to use python to read a txt file and select the column and row information. The txt in the file is in one single column with the data for the column further down the file like the example below. I have put some additional info in (). I have several different txt files with similar formatting, and so I would love to be able to just replace each file and run them.

My goal is to be able to put the data into a csv or excel file.

The format of the data is like the image below

Here is a link to the txt file. About 1/3 down you will see in () where the column names are and right below that I have marked where the data for those columns is. There is additional data further down the sheet but I just need to get a good start. I have several similar txt files that I would need to be able to run the code on.

1 Answers1

0

Performing ETL on what is almost a free format data interface is always ugly and unstable Approach

  1. consider column definitions as meta data
  2. find the meta data, get start and end row of this
  3. clean up meta data - there are column definitions that don't exist !
  4. now have column names
  5. consider rest of rows after last meta data row to be actual data
  6. reshape() this into an array cols X max possible rows in remaining data
  7. do post transform validation logic to filter to what appears to be interesting data set
import pandas as pd
with open("data1.txt") as f:
    data = f.read()
# pd.read_csv(io.StringIO(data))
dfraw = pd.DataFrame(data.split("\n"))

# find first and last row that contain column meta information
colsi = dfraw.loc[dfraw[0].str.contains("Column")].index
assert len(colsi)==2, "failed to find column labels"

# first stab... based on finding meta information
cols = dfraw.loc[colsi[0]:colsi[1]]
# exclude meta rows that are empty or labeled "Grade"
cols = cols[(cols[0]!="") & (cols[0]!="Grade") & (cols[0]!="(Column Names)--------------------")]

# shift dfraw start row to start of data after column meta data
dfraw = dfraw.loc[colsi[1]+1:].reset_index(drop=True)
# create dataframe with all remaining rows, truncating so reshape does not fail
# using numpy reshape() to transform sets of rows into columns
df = pd.DataFrame(dfraw.loc[:(len(dfraw)-(len(dfraw)%len(cols)))-1].values.reshape(int(len(dfraw)/len(cols)), len(cols)),
            columns=cols[0].values)

# filter to valid "Egg #" and Day0 is defined
df = df[(df["Egg #"].str.match("^[0-9]+[ ]+[A-Z][0-9]")) & (df["Day0"].str.strip()!="")]

print(df.to_string(index=False))

output

                      Egg #  Day0 Maturity    IVF/ICSI                        Day1                   Day2 Grade                     Day 3                 Grade- Frag%                   Day4                   Day5                   Day6                   Day7                  Comment         Fate Freezing ID    (End of Column names)-------------------------
 1   A1 (egg #)------------  MII                  ICSI  2PN/2PB  (Day 1)----------  2BC   (Day 2 grade)--------  6BC   (Day 3)-----------  15   (Grade- Frag%)--------  -    (Day 4) --------  -    (Day 5) --------  -    (Day 6) --------  -    (Day 7) --------       (Comments) -------  F    (Fate)                       A102519-01EV (Freezing ID)-------------
                     2   A2   MII                 ICSI                     2PN/2PB                           5B                       12B                           10                      -                      -                      -                      -                                     F                                                  B102519-01EV
                     3   A3   MII                 ICSI                     2PN/2PB                           5B                        8B                           10                      -                      -                      -                      -                                    ET                                                              
                     4   A4   MII                 ICSI                     2PN/2PB                           5B                       10A                            0                      -                      -                      -                      -                                    ET                                                              
                     5   B1   MII                 ICSI                     2PN/2PB                          5BC                       8BC                           10                      -                      -                      -                      -                                     F                                                  A102519-01EV
                     6   B2    MI                                         MII/ICSI                      2PN/2PB                      SAME                            -                   SAME                      -                      -                      -                                     D                                            
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • Thank you so much Rob. I was able to get it to run great just had to specify the codec to use for the file. I definitely appreciate your help. I may have some other questions in the future regarding these data sets. – Nick Bowman Jan 01 '21 at 00:09
  • @NickBowman you are welcome. Can you accept the answer - happy to provide ideas / approaches if you have any further questions – Rob Raymond Jan 01 '21 at 09:19
  • couple of questions: On the txt doc I uploaded I added the items in (column names) and the labels for the data for those columns. Would there be a way to find those items without referring to the added labels? Im working with quite a few txt docs that have the same formatting and I would love to just be able to run the code on them without having to make edits to them. I tried editing your code with no luck. I would also like to be able to export the finished dataset to the same excel as the other data sets once I run them. Is that possible? – Nick Bowman Jan 02 '21 at 20:13
  • the approach is find the rows that are the column names. this line `colsi = dfraw.loc[dfraw[0].str.contains("Column")].index` needs to be changed to find the rows that are the column names. Possibly look for first occurrence of "Egg #" and first occurrence of "Freezing ID". There are multiple ways to modify Excel workbooks with python / pandas. I don't do this as I'm not a windows user and don't have Excel... – Rob Raymond Jan 03 '21 at 15:20
  • Hey @Rob Raymond - sorry to bring this one back up, as I'm getting a little deeper into your code I'm having some difficulty adding a few things. I was able to get it to export super easily to a csv. The problem I'm having is with this line (colsi = dfraw.loc[dfraw[0].str.contains("Column")].index) I can't seem to get the code working for a start of the df at Egg #, and end on Freezing ID. Would there also be a way to add additional columns from a few items that are delimited with a (:) in the txt file? If so where would those lines of code go? I would like to add their information at the end – Nick Bowman Jan 21 '21 at 18:52
  • Provide a different source data example and I'll take a look – Rob Raymond Jan 21 '21 at 20:21
  • I have added an additional file link at the top. If you recall that column headers start at Egg # and end with Freezing ID. There are a few other items I would like to create columns for that are (:) delimited in the sheet. Such as Retrieval Date(which would be a column): and the info after in the rows. My ultimate goal is to be able to create these columns with the appropriate info and run the code on several sheets with similar formatting and export each to a CSV with one sheet below the other with the same headers if that makes sense. – Nick Bowman Jan 22 '21 at 21:09