I am trying to write a code where it takes a Metadata.txt as an input, then it identifies the common genes across the different input files whose names are extracted from the Metadata.txt file.
Example of Metadata.txt
SIG1 SIG2 File1 File3 File2 File4 File3 File5 File4
The files in my directory are File1.xls, File2.xls, File3.xls...File6.xls. For simplicity, i have same inputs for File1 and File 3, as well as for File 2 and 4.
File1.xls or File3.xls
TargetID FoldChange p-value Adjusted-p A 0.543528215 0.000518847 0.000518847 B 0.638469898 0.00204759 0.00204759 C 1.936595724 0.00250229 0.00250229 D 0.657322154 0.012840013 0.012840013 E 1.728842021 0.00251105 0.00251105 F 2.024842641 0.000719261 0.000719261 G 4.049059413 2.25E-05 2.25E-05 H 0.478660942 0.000352179 0.000352179 I 0.449304016 0.000489521 0.000489521
File2.xls or File4.xls
TargetID FoldChange p-value Adjusted-p JJ 0.453537892 4.22E-06 4.22E-06 A 0.558325503 0.001697851 0.001697851 B 0.637336564 7.64E-05 7.64E-05 D 1.804853034 0.000492439 0.000492439 E 0.378445825 1.72E-05 1.72E-05 JJJJ 1.601997491 0.019618883 0.019618883
File5.xls
TargetID FoldChange p-value Adjusted-p A 3.140223972 0.013347275 0.013347275 B 1.5205222 0.032318774 0.032318774 C 1.532760451 0.043763101 0.043763101 D 1.522865896 0.001791471 0.001791471
The goal is to output two files "SIG1.txt" and "SIG2.txt" which has the common genes between File1/File2 and File3/File4/File5, respectively. So the metadata is providing a platform to iterate over things. Here is what I had so far:
md_input = pd.read_table("Metadata.txt", sep="\t") #opens the metadata file
for c in range(0, len(md_input.columns)):
first_file=md_input.ix[0,c]+".xls"
print first_file #this will print "File1.xls" for column1 and File3.xls for column#2
first_sig=pd.read_table(first_file, sep="\t", usecols=["TargetID", 'FoldChange']) #opens the first file
list1=list(first_file.iloc[:,0]) #takes column of first file and converts to list
#Then, I aim to iterate over the remaining files in each column of the metadata and find the intersection/common with each other. I tried the following:
for i in range(1, md_input.count()[c]):
list2=[]
df=pd.read_table("{}.xls".format(md_input.ix[i,c]), sep="\t", usecols=["TargetID", 'FoldChange'])
list2=list(df.iloc[:,0]) #assign the LIST
common=list(set(list_up_0).intersection(set(list2))) #find intersection
print common
When i print the 'common', i only get the common with the LAST file. Which is expected given how i wrote the loop/code. I am unable to find a way to iterate over all the files in the column, keep it open and then identify an intersection.
Please advise if i need to clarify the above further. I know it sounds complicated but it shouldn't be. t tried to simplify it and i hope that worked