0

I am processing log files with pandas with the following structure, all log files have the same structure and contain data about one machine that should be reducable to 1 row:

Column1     Value1
Column2     Value2
Column3     Value3
Column4     Value4
Column5     Value5

I am using the following code:

import pandas as pd
import glob

log_files = [i for i in glob.glob(inputdir+'\**\*.log', recursive=True)]

appended_data = []

for logfile in log_files:
    fileobject = open(logfile)
    df = pd.read_csv(fileobject, sep='\t',  lineterminator='\n', names=['Column','Value'])
    df = df.pivot(columns = 'Column', values = 'Value')
    appended_data.append(df)

logdf = pd.concat(appended_data)
logdf = logdf.reset_index(drop=True)
logdf = logdf.rename_axis(columns=None)

This however creates 1 row for each column, instead of reducing all rows to one:

Column  Column1 Column2 Column3 Column4 Column5 
0   1   NaN NaN NaN NaN 
1   NaN 2   NaN NaN NaN 
2   NaN NaN 3   NaN NaN 
3   NaN NaN NaN 4   NaN 
4   NaN NaN NaN NaN 5   

The df should have the following format:

    Column1 Column2 Column3 Column4 Column5 
0   1       2       3       4       5   

Is there an effective way to solve this issue by changing the read CSV settings or by transforming the df?

The following solution works but I don't think it's particulary great.

df.sort_values(by='A',inplace=True)
df = df.fillna(method='ffill')
df.drop_duplicates(["A"],keep='last',inplace=True)
Stijn
  • 121
  • 9
  • HI, I don't see in your code what happens to `appended_data` after you are done iterating on log_files. – Laurent Apr 09 '22 at 15:39
  • @Laurent, I use pd.concat on the processed log files. I added the code in my original post. – Stijn Apr 11 '22 at 08:42

1 Answers1

0

I think you are concatenating along rows, like this:

import pandas as pd

appended_data = [
    pd.DataFrame({"Column1": ["1"],}),
    pd.DataFrame({"Column2": ["2"],}),
    pd.DataFrame({"Column3": ["3"],}),
    pd.DataFrame({"Column4": ["4"],}),
    pd.DataFrame({"Column5": ["5"],}),
]

logdf = pd.concat(appended_data)
logdf = logdf.reset_index(drop=True)
logdf = logdf.rename_axis(columns=None)

print(logdf)
# Output
  Column1 Column2 Column3 Column4 Column5
0       1     NaN     NaN     NaN     NaN
1     NaN       2     NaN     NaN     NaN
2     NaN     NaN       3     NaN     NaN
3     NaN     NaN     NaN       4     NaN
4     NaN     NaN     NaN     NaN       5

As per Pandas documentation, you can concatenate along columns by specifying axis=1, like this:

logdf = (
    pd
    .concat(appended_data, axis=1)
    .reset_index(drop=True)
    .rename_axis(columns=None)
)

print(logdf)
# Output
  Column1 Column2 Column3 Column4 Column5
0       1       2       3       4       5
Laurent
  • 12,287
  • 7
  • 21
  • 37