0

I am trying to write a code where it takes a Metadata.txt as an input, then it identifies the common genes across the different input files whose names are extracted from the Metadata.txt file.

Example of Metadata.txt

SIG1    SIG2
File1   File3
File2   File4
File3   File5
File4

The files in my directory are File1.xls, File2.xls, File3.xls...File6.xls. For simplicity, i have same inputs for File1 and File 3, as well as for File 2 and 4.

File1.xls or File3.xls

TargetID    FoldChange  p-value Adjusted-p
A   0.543528215 0.000518847 0.000518847
B   0.638469898 0.00204759  0.00204759
C   1.936595724 0.00250229  0.00250229
D   0.657322154 0.012840013 0.012840013
E   1.728842021 0.00251105  0.00251105
F   2.024842641 0.000719261 0.000719261
G   4.049059413 2.25E-05    2.25E-05
H   0.478660942 0.000352179 0.000352179
I   0.449304016 0.000489521 0.000489521

File2.xls or File4.xls

TargetID    FoldChange  p-value Adjusted-p
JJ  0.453537892 4.22E-06    4.22E-06
A   0.558325503 0.001697851 0.001697851
B   0.637336564 7.64E-05    7.64E-05
D   1.804853034 0.000492439 0.000492439
E   0.378445825 1.72E-05    1.72E-05
JJJJ    1.601997491 0.019618883 0.019618883

File5.xls

TargetID    FoldChange  p-value Adjusted-p
A   3.140223972 0.013347275 0.013347275
B   1.5205222   0.032318774 0.032318774
C   1.532760451 0.043763101 0.043763101
D   1.522865896 0.001791471 0.001791471

The goal is to output two files "SIG1.txt" and "SIG2.txt" which has the common genes between File1/File2 and File3/File4/File5, respectively. So the metadata is providing a platform to iterate over things. Here is what I had so far:

md_input = pd.read_table("Metadata.txt", sep="\t")  #opens the metadata file

for c in range(0, len(md_input.columns)):
        first_file=md_input.ix[0,c]+".xls"
        print first_file #this will print "File1.xls" for column1 and File3.xls for column#2
        first_sig=pd.read_table(first_file, sep="\t", usecols=["TargetID", 'FoldChange']) #opens the first file
        list1=list(first_file.iloc[:,0]) #takes column of first file and converts to list
        #Then, I aim to iterate over the remaining files in each column of the metadata and find the intersection/common with each other. I tried the following:
        for i in range(1, md_input.count()[c]):
            list2=[]
            df=pd.read_table("{}.xls".format(md_input.ix[i,c]), sep="\t", usecols=["TargetID", 'FoldChange'])
            list2=list(df.iloc[:,0]) #assign the LIST
            common=list(set(list_up_0).intersection(set(list2))) #find intersection

print common

When i print the 'common', i only get the common with the LAST file. Which is expected given how i wrote the loop/code. I am unable to find a way to iterate over all the files in the column, keep it open and then identify an intersection.

Please advise if i need to clarify the above further. I know it sounds complicated but it shouldn't be. t tried to simplify it and i hope that worked

BioProgram
  • 684
  • 2
  • 13
  • 28
  • Maybe you need some space in front of `print common`??? try to align with `common=list(set(list_up_0).intersection(set(list2))) #find intersection`... – su79eu7k May 03 '16 at 02:35
  • that wouldn't solve it. it will just give you the common at each step then. – BioProgram May 03 '16 at 02:42

1 Answers1

1

I was finally able to get it to work. I am not sure if this is the simplest way but it works. I think the confusing part in the script below is the utilization of the globals key to allow opening multiple files and assigning file names based on the # in the for loop. Anyway, the script works and it also takes into consideration the fold changes. I hope this will be useful to others.

md_input = pd.read_table('Metadata.txt', sep="\t")
list_cols=list(md_input.columns)
df=pd.DataFrame(columns=list_cols)
for c in range(0, len(md_input.columns)):
    sets_up=[]
    sets_down=[]
    for i in range(0, md_input.count()[c]):
        globals()["file_"+str(i)]=md_input.ix[i,c]+".xls"
        globals()["sig_"+str(i)]=pd.read_table(globals()["file_"+str(i)], sep="\t", usecols=["TargetID", 'FoldChange'])
        globals()["List_up"+str(i)]=[]
        globals()["List_down"+str(i)]=[]
        for z in range(0, len(globals()["sig_"+str(i)].index)):
            if globals()["sig_"+str(i)].ix[z,'FoldChange']>=1.5:
                globals()["List_up"+str(i)].append(globals()["sig_"+str(i)].iloc[z,0])
            elif globals()["sig_"+str(i)].ix[z,'FoldChange']<=1.5:
                globals()["List_down"+str(i)].append(globals()["sig_"+str(i)].iloc[z,0])
        sets_up.append(set(globals()["List_up"+str(i)]))
        sets_down.append(set(globals()["List_down"+str(i)]))

    common_up=list(set.intersection(*sets_up))
    common_down=list(set.intersection(*sets_down))
    common=common_up + common_down

    for x in range(0, len(common)):
        df.loc[x,md_input.columns[c]]=common[x]

df.to_csv("Output.xls",sep="\t", index=False)
BioProgram
  • 684
  • 2
  • 13
  • 28