0

I have a bunch of tab-separated text files that I need to process. I need to get the headers for all unique values in the column 'study'.

For example: If my data is like:

csv1:

name   study   id   race
aaa   cb10   123   asian
bbb   cb10   128 
ccc   vj97   864   

csv2:

name study vaccine
aaa cb10
bbb cb10 abc
ccc vj97 poi

from multiple files, my output should be the column headers for all the studies in 'study' column:

cb10- name,study,id,race,vaccine
vj97- name,study,id,vaccine

I have the below code:

import os
import sys
import glob, ntpath, csv

def get_header_for_tsv_file(tsv_data):
    if not os.path.exists("Results"):
        os.makedirs("Results")

    #output_path = os.path.join ("Results",study + ".csv")

    result = []
    search_for = study
    header = tsv_data.next()
    #output_file = open (output_path, "ab")
    #for row in tsv_data:
    if data["study"] in search_for:
        print data

def path_leaf(path):
    head, tail = ntpath.split(path)
    return tail or ntpath.basename(head)

def get_tsv_list():
    tsv_list = glob.glob(os.getcwd()+"\*.txt")
    return tsv_list

def get_tsv_data(tsv_name):
    file_name = os.path.join(tsv_name + ".txt")
    if not os.path.exists(file_name):
        print "Error: Couldn't find file:", file_name
        sys.exit(-1)

    input_data = open (file_name)
    input_data = csv.DictReader(input_data, delimiter = "\t")
    return input_data

def run(tsv_name):
    tsv_data = get_tsv_data(tsv_name)
    header_data = get_header_for_tsv_file(tsv_data)

if __name__ == "__main__":
    tsv_list = get_tsv_list()
    filename = [path_leaf(path) for path in tsv_list]
    for index in range(0, len(filename)):
        tsv_name_list = filename[index]
        tsv_name = os.path.splitext(os.path.basename(tsv_name_list))[0]
        tsv_data = get_tsv_data(tsv_name)
        for data in tsv_data:
            study = data["study"]
            run(tsv_name)

I'm looking to do it using default csv package instead of pandas, if possible. Is there a way I can do it?

pam
  • 1,175
  • 5
  • 15
  • 28

1 Answers1

0

In pseudocode:

load all file via pandas
take the unique values from the studys - series
make a set from the values above.
Output them
Christian Sauer
  • 10,351
  • 10
  • 53
  • 85
  • Thank you for your answer. I'm looking to do it using default csv package instead of pandas, if possible. Is there a way I can do it? – pam Oct 12 '15 at 17:13
  • Yes, I think so: Just iterate over each row (as per https://docs.python.org/2/library/csv.html ) , append the content of the second "cell" to a list and convert that list to a set at the end. Warning: this assumes that study is always on position #2 - if not, some work is needed to get the position from the header row – Christian Sauer Oct 12 '15 at 18:45