Read unique values from txt file with pandas

Question

I have a txt file which is formatted in this way:

 thi is    a junk data    line to be ignored abc xyz dsfgsrj
AFKSDNG-RBI 20200706    MARS        stu    base-1
AFKSDNG-UBI 20200706    JUPITER     uyt    base-2
AFKSDNG-ABI 20200706    MARS        stu    base-1
AFKSDNG-XBI 20200706    JUPITER     uyt    base-2
AFKSDNG-XBI 20200706    MARS        stx    base-1

Please note that I have only raw data in the txt file without any column name in the header indicating the context of each column.

Each column is separated from the other by one or more whitespaces.

So for example if I wanted to count the occurances of 'MARS' it would be 2 and not 3 because the last record has the 4th column different ('stx') from the previous ones.

I need to count all the unique occurences and produce an excel file like the following:

Column 1     Column 2     Column 3   Column 4   Column 5    Column 6 (occurences)
AFKSDNG-RBI   20200706      MARS        stu        base-1     2
AFKSDNG-UBI   20200706      JUPITER     uyt        base-2     2
AFKSDNG-ABI   20200706      MARS        stu        base-1     2
AFKSDNG-XBI   20200706      JUPITER     uyt        base-2     2
AFKSDNG-XBI   20200706      MARS        stx        base-1     1

EVEN BETTER OUTPUT WOULD BE TO REMOVE THE DUPLICATED RECORDS AFTER COUNTING THEM SO:

Column 1     Column 2     Column 3   Column 4   Column 5    Column 6 (occurences)
AFKSDNG-RBI   20200706      MARS        stu        base-1     2
AFKSDNG-UBI   20200706      JUPITER     uyt        base-2     2
AFKSDNG-XBI   20200706      MARS        stx        base-1     1

I tried writing this code in python for reading and producing an Excel:

import pandas as pd

df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)
df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')

But I cannot figure out how to count the occurences. I'm new to python and pandas so any help would be highly appreciated.

-------------------------------------UPDATE---------------------------------------

I noticed a little issue if we slightly change the source txt file. As I stated before the last 'MARS' is different from the previous ones because the 4th column 'stx' is different. In order to be unique it only takes one column from the 3rd, 4th or the 5th one to be different.

EXAMPLE

thi is    a junk data    line to be ignored abc xyz dsfgsrj
AFKSDNG-RBI 20200706    MARS        stu    base-1
AFKSDNG-UBI 20200706    JUPITER     uyt    base-2
AFKSDNG-ABI 20200706    MARS        stu    base-1
AFKSDNG-XBI 20200706    JUPITER     uyt    base-2
AFKSDNG-XBI 20200706    MARS        stx    base-1 // different cuz stx is different
AFKSDNG-XBI 20200706    PLUTO       stu    base-1 // even though here stu and base-1 is like 'MARS' we have 'PLUTO' so this is a new row

In the accepted answer of @jezrael 'PLUTO' is counted with 'MARS'

jezrael · Accepted Answer · 2023-05-25T09:58:02.800

1

For count use GroupBy.transform with DataFrame.drop_duplicates:

df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)
print (df)
             0         1        2    3       4
0  AFKSDNG-RBI  20200706     MARS  stu  base-1
1  AFKSDNG-UBI  20200706  JUPITER  uyt  base-2
2  AFKSDNG-ABI  20200706     MARS  stu  base-1
3  AFKSDNG-XBI  20200706  JUPITER  uyt  base-2
4  AFKSDNG-XBI  20200706     MARS  stx  base-1

df['new'] = df.groupby([2,3,4])[2].transform('size')

df = df.drop_duplicates([2,3,4])
print (df)
             0         1        2    3       4  new
0  AFKSDNG-RBI  20200706     MARS  stu  base-1    2
1  AFKSDNG-UBI  20200706  JUPITER  uyt  base-2    2
4  AFKSDNG-XBI  20200706     MARS  stx  base-1    1

df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')

If need set columns names:

df = pd.read_csv('CD202205.txt', engine='python', sep='\s{3,}', header=None, skiprows=1)

f = lambda x: f'Column {x+1}'
df = df.rename(columns=f)
print (df)
      Column 1  Column 2 Column 3 Column 4 Column 5
0  AFKSDNG-RBI  20200706     MARS      stu   base-1
1  AFKSDNG-UBI  20200706  JUPITER      uyt   base-2
2  AFKSDNG-ABI  20200706     MARS      stu   base-1
3  AFKSDNG-XBI  20200706  JUPITER      uyt   base-2
4  AFKSDNG-XBI  20200706     MARS      stx   base-1

df['Column 6']=df.groupby(['Column 3','Column 4','Column 5'])['Column 3'].transform('size')

df = df.drop_duplicates(['Column 3','Column 4','Column 5'])
print (df)
      Column 1  Column 2 Column 3 Column 4 Column 5  Column 6
0  AFKSDNG-RBI  20200706     MARS      stu   base-1         2
1  AFKSDNG-UBI  20200706  JUPITER      uyt   base-2         2
4  AFKSDNG-XBI  20200706     MARS      stx   base-1         1

df.to_excel('export.xlsx', index=False, sheet_name='SHEET1')

EDIT: Test with new data:

df['new'] = df.groupby([2,3,4])[2].transform('size')

df = df.drop_duplicates([2,3,4])
print (df)
             0         1        2    3       4  new
0  AFKSDNG-RBI  20200706     MARS  stu  base-1    2
1  AFKSDNG-UBI  20200706  JUPITER  uyt  base-2    2
4  AFKSDNG-XBI  20200706     MARS  stx  base-1    1
5  AFKSDNG-XBI  20200706    PLUTO  stu  base-1    1

edited May 25 '23 at 09:58

answered May 25 '23 at 09:22

jezrael

822,522
95
1,334
1,252

In this case where I should specify the name of the input txt file? And how would I export it in excel format? – twww123 May 25 '23 at 09:25
@twww123 - added your code to answer. – jezrael May 25 '23 at 09:26
actually I inserted skiprows=1 because in that raw data txt file the very first line is junk data that is intented to be ingored...I reviewd my question with the updated txt file format check it – twww123 May 25 '23 at 09:29
@twww123 - OK, then super. Removing from my answer. – jezrael May 25 '23 at 09:29
I tested it with various txt files I think there is a little issue if we change one thing, I updated my question please check it – twww123 May 25 '23 at 09:52
@twww123 - what is expected ouput? – jezrael May 25 '23 at 09:56
@twww123 - Added EDIT part to my answer, it is testing columns 2,3,4 for 3rd, 4th or the 5th columns (python count from `0`) – jezrael May 25 '23 at 10:00
If you notice in the 'UPDATE' section of my answer I inserted a new record in the source txt file 'PLUTO' basically pluto is counted as mars because it has the same 4th and 5th row (stx and base-1). The thing is that from the 3rd, 4th and 5th column if I only have even one of them that is different for me that record is a new one – twww123 May 25 '23 at 10:01
@twww123 - hmmm, so need count by `4th and 5th columns`, not by `3rd, 4th and 5th column` ? – jezrael May 25 '23 at 10:03
perfect bro, but beside the indexing fix, would in this way each record counted as unique/new if even one from the 3rd 4th and 5th (starting from 1) is different? – twww123 May 25 '23 at 10:03
thi should be the differentiation algorithm: MARS stu base-1 if i have PLUTO stu base-1 they are 2 different records. If i have MARS xyt base-1 still new record. If I have MARS stu base-5 of course new record, only if I have MARS stu base-1 again its 2 occurences – twww123 May 25 '23 at 10:07
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/253821/discussion-between-twww123-and-jezrael). – twww123 May 25 '23 at 10:08

Read unique values from txt file with pandas

1 Answers1