1

I have multiple tab files with same name in different folders like this

F:/RNASEQ2019/ballgown/abundance_est/RBRN02.sorted.bam\t_data.ctab
F:/RNASEQ2019/ballgown/abundance_est/RBRN151.sorted.bam\t_data.ctab

Each file have 5-6 common columns and I want to pick up two columns- Gene and FPKM. Gene column is same for all only FPKM value differ. I want to pickup Gene and FPKM column form each file and make a master file like this

Gene RBRN02 RBRN03 RBRN151
gene1   67  699     88
gene2   66  77      89

I did this

import os

path ="F:/RNASEQ2019/ballgown/abundance_est/"

files =[]

## r=root, d=directory , f=file

for r, d, f in os.walk(path):
    for file in f:
        if 't_data.ctab' in file:
            files.append(os.path.join(r, file))

df=[]

for f in files:
    df.append(pd.read_csv(f, sep="\t"))

But this is not doing side wise merge. How do I get that above format? please help

Pasha
  • 6,298
  • 2
  • 22
  • 34
jit c
  • 33
  • 1
  • 7

3 Answers3

2

IIUC, you can get your desired result with a simple list comprehension :

dfs = [pd.read_csv(f,sep='\t') for f in files]
df = pd.concat(dfs)
print(df)

or as a one liner

df = pd.concat([pd.read_csv(f,sep='\t') for f in files])
Umar.H
  • 22,559
  • 7
  • 39
  • 74
2

Using datatable, you can read multiple files at once by specifying the pattern:

import datatable as dt
dfs = dt.fread("F:/RNASEQ2019/ballgown/abundance_est/**/t_data.ctab",
               columns={"Gene", "FPKM"})

If there are multiple files, this will produce a dictionary where each key is the name of the file, and the corresponding value is the content of that file, parsed into a frame. The optional columns parameter limits which columns you want to read.

In your case it seems like you want to rename the columns based on the name of the file where it came from, so you may do something like this:

frames = []
for filename, frame in dfs.items():
    mm = re.search(r"(\w+)\.sorted\.bam", filename)
    frame.names = {"FPKM": mm.group(1)}
    frames.append(frame)

In the end, you can cbind the list of frames:

df = dt.cbind(frames)

If you need to work with a pandas dataframe, you can convert easily: df.to_pandas().

Pasha
  • 6,298
  • 2
  • 22
  • 34
  • I was using this on dummy data and it was o.k. but when I am using it on real data set the second part of script is showing error "AttributeError: 'NoneType' object has no attribute 'group'. I search and understand that re.search is getting null value which is showing this error. Is this means my search in re.search step not o.k.? My present dict key is like this `/home/RNASeqBC_2019/TSampleStringtieOUT/sorted_RBRT02.bam/t_data.ctab'` and I change the re.search expression `r"(\w+)\sorted_RBRT\.bam"`, is this correct? please help – jit c Feb 04 '20 at 04:25
  • 1
    @jitc you need to come up with a regular expression that will match all the files that you have. If some of your files look like `{FPKM}.sorted.bam` while others are `sorted_{FPKM}.bam`, then either use a regular expression that matches both styles, or use 2 regular expressions. A single regex could be `r"(?:sorted\W)?(\w+)(?:\Wsorted)?\.bam"`. – Pasha Feb 04 '20 at 07:15
0

How about reading each file in a separate data frame and then merging them?

farheen
  • 111
  • 2
  • 3
  • 13