I have a txt file which is formatted in this way:
thi is a junk data line to be ignored abc xyz dsfgsrj
AFKSDNG-RBI 20200706 MARS stu base-1
AFKSDNG-UBI 20200706 JUPITER uyt base-2
AFKSDNG-ABI 20200706 MARS stu base-1
AFKSDNG-XBI 20200706 JUPITER uyt base-2
AFKSDNG-XBI 20200706 MARS stx base-1
Please note that I have only raw data in the txt file without any column name in the header indicating the context of each column.
Each column is separated from the other by one or more whitespaces.
So for example if I wanted to count the occurances of 'MARS' it would be 2 and not 3 because the last record has the 4th column different ('stx') from the previous ones.
I need to count all the unique occurences and produce an excel file like the following:
Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 (occurences)
AFKSDNG-RBI 20200706 MARS stu base-1 2
AFKSDNG-UBI 20200706 JUPITER uyt base-2 2
AFKSDNG-ABI 20200706 MARS stu base-1 2
AFKSDNG-XBI 20200706 JUPITER uyt base-2 2
AFKSDNG-XBI 20200706 MARS stx base-1 1
EVEN BETTER OUTPUT WOULD BE TO REMOVE THE DUPLICATED RECORDS AFTER COUNTING THEM SO:
Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 (occurences)
AFKSDNG-RBI 20200706 MARS stu base-1 2
AFKSDNG-UBI 20200706 JUPITER uyt base-2 2
AFKSDNG-XBI 20200706 MARS stx base-1 1
I tried writing this code in python for reading and producing an Excel:
import pandas as pd
df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)
df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')
But I cannot figure out how to count the occurences. I'm new to python and pandas so any help would be highly appreciated.
-------------------------------------UPDATE---------------------------------------
I noticed a little issue if we slightly change the source txt file. As I stated before the last 'MARS' is different from the previous ones because the 4th column 'stx' is different. In order to be unique it only takes one column from the 3rd, 4th or the 5th one to be different.
EXAMPLE
thi is a junk data line to be ignored abc xyz dsfgsrj
AFKSDNG-RBI 20200706 MARS stu base-1
AFKSDNG-UBI 20200706 JUPITER uyt base-2
AFKSDNG-ABI 20200706 MARS stu base-1
AFKSDNG-XBI 20200706 JUPITER uyt base-2
AFKSDNG-XBI 20200706 MARS stx base-1 // different cuz stx is different
AFKSDNG-XBI 20200706 PLUTO stu base-1 // even though here stu and base-1 is like 'MARS' we have 'PLUTO' so this is a new row
In the accepted answer of @jezrael 'PLUTO' is counted with 'MARS'