3

I am new to python language. I have a table in csv format with n columns where the header is Tax_id and every column contains species names like this

9606                  9606.1               508771  
0                root                root                  root    
1  cellular organisms  cellular organisms    cellular organisms
2           Eukaryota           Eukaryota             Eukaryota
3        Opisthokonta        Opisthokonta                   Sar
4             Metazoa             Metazoa             Alveolata
5           Eumetazoa           Eumetazoa           Apicomplexa
6           Bilateria           Bilateria           Conoidasida
7       Deuterostomia       Deuterostomia              Coccidia
8            Chordata            Chordata        Eucoccidiorida
9            Craniata            Craniata           Eimeriorina

What I want to do is to be able to write a python code that counts each species with their occurrence only for columns that contains species named "Metazoa".

#to return something like

    Eumetazoa 2
    Bilateria 2
    Craniata  2
Mumdooh
  • 37
  • 4
  • 1
    Welcome to StackOverflow! You need to be more specific what kind of output you are looking for. Just a random count of all strings that occur in matching columns? Also, each column represents the taxonomic ranks of a species, right? – xjcl Apr 27 '21 at 00:06
  • Thanks for your reply, yes every column has taxonomic ranks. therefore the output I am looking for is a count for all groups in every column that contains "Metazoa" as my table has over 3000 columns and I just want the count for all groups in columns that only contain "Metazoa" – Mumdooh Apr 27 '21 at 01:03

1 Answers1

2

I'm still not entirely sure what you want. Does something like this work? Note you need to install the package pandas for this (How to install pandas).

What I assume your CSV file looks like:

s = '''9606,9606.1,508771
0,root,root,root
1,cellular organisms,cellular organisms,cellular organisms
2,Eukaryota,Eukaryota,Eukaryota
3,Opisthokonta,Opisthokonta,Sar
4,Metazoa,Metazoa,Alveolata
5,Eumetazoa,Eumetazoa,Apicomplexa
6,Bilateria,Bilateria,Conoidasida
7,Deuterostomia,Deuterostomia,Coccidia
8,Chordata,Chordata,Eucoccidiorida
9,Craniata,Craniata,Eimeriorina'''

Algorithm:

import pandas as pd
from io import StringIO

def filter_and_count(df, search_string):
    df_filtered = df.loc[:, (df == search_string).any(axis=0)]
    return pd.melt(df_filtered)['value'].value_counts()

df = pd.read_csv(StringIO(s))  # replace this with read_csv("filename")
print(filter_and_count(df, 'Metazoa'))

Output:

root                  2
Metazoa               2
Deuterostomia         2
Craniata              2
cellular organisms    2
Chordata              2
Eumetazoa             2
Opisthokonta          2
Eukaryota             2
Bilateria             2
Name: value, dtype: int64
xjcl
  • 12,848
  • 6
  • 67
  • 89
  • what about if I want to start counting the phyla from the 5th element in each column? How could it be tweaked ? – Mumdooh May 04 '21 at 12:34
  • 1
    `filter_and_count(df[4:], 'Metazoa')` should work – xjcl May 04 '21 at 13:02
  • Thanks for your quick answer, what I meant is to start counting the phyla from the 5th element in each column whether there is 'Metazoa' or not. In other words, to start the counting from the 5th element also for columns that don’t contain 'Metazoa' – Mumdooh May 04 '21 at 13:20
  • 1
    `pd.melt(df[4:])['value'].value_counts()` then – xjcl May 04 '21 at 13:34
  • I appreciate you being able to answer my questions, what about if want to count all the phyla from the 5th element in each Row instead of counting columns. can you please tell me how the code is modified? I tried to axis=0 to axis=1 in df_filtered = df.loc[:, (df == search_string).any(axis=0)] but it didn’t work – Mumdooh May 09 '21 at 23:04
  • 1
    So your input is different? You could try transposing it before the function call: `filter_and_count(df.transpose(), search_string)` – xjcl May 10 '21 at 07:55