1

I am given the following CSV file which I extracted from an excel spreadsheet. Just to give some background information which could be of assistance, it discusses AGI Numbers (think of it as protein identifiers), unmodified peptide sequences for those protein identifiers, and then modified peptide sequences with modifications made on the unmodified sequences, the index/indeces of those modifications, and then the combined spectral count for repeated peptides. The text file is called MASP.GlycoModReader.txt and the information is in the following format below:

AGI,UnMd Peptide (M) = x,Mod Peptide (oM) = Ox,Index/Indeces of Modification,counts,Combined 
Spectral count for repeated Peptides

AT1G56070.1,NMSVIAHVDHGKSTLTDSLVAAAGIIAQEVAGDVR,NoMSVIAHVDHGKSTLTDSLVAAAGIIAQEVAGDVR,2,17
AT1G56070.1,LYMEARPMEEGLAEAIDDGR,LYoMEARPoMEEGLAEAIDDGR,"3, 9",1
AT1G56070.1,EAMTPLSEFEDKL,EAoMTPLSEFEDKL,3,7
AT1G56070.1,LYMEARPMEEGLAEAIDDGR,LYoMEARPoMEEGLAEAIDDGR,"3, 9",2
AT1G56070.1,EGPLAEENMR,EGPLAEENoMR,9,2
AT1G56070.1,DLQDDFMGGAEIIK,DLQDDFoMGGAEIIK,7,1

The output file that needs to result after extracting the above is in the following format below:

AT1G56070.1,{"peptides": [{"sequence": "NMSVIAHVDHGKSTLTDSLVAAAGIIAQEVAGDVR", "mod_sequence":    
"NoMSVIAHVDHGKSTLTDSLVAAAGIIAQEVAGDVR" , "mod_indeces": 2, "spectral_count": 17}, {"sequence": 
"LYMEARPMEEGLAEAIDDGR" , "mod_sequence": "LYoMEARPoMEEGLAEAIDDGR", "mod_indeces": [3, 9], 
"spectral_count": 3}, {"sequence": "EAMTPLSEFEDKL" , "mod_sequence": "EAoMTPLSEFEDKL", 
"mod_indeces": [3,9], "spectral_count": 7}, {"sequence": "EGPLAEENMR", "mod_sequence": 
"EGPLAEENoMR", "mod_indeces": 9, "spectral_count": 2}, {"sequence": "DLQDDFMGGAEIIK", 
"mod_sequence": "DLQDDFoMGGAEIIK", "mod_indeces": [7], "spectral_count": 1}]}

I have provided my solution below: If anyone has a better solution in another language or can possibly analyze mine and let me know if there are more efficient methods of coming about this, then please comment below. Thank you.

    #!/usr/bin/env node

    var fs = require('fs');
    var csv = require('csv');
    var data ="proteins.csv";

    /* Uses csv nodejs module to parse the proteins.csv file.
    * Parses the csv file row by row and updates the peptide_arr.
    * For new entries creates a peptide object, for similar entries it updates the
    * counts in the peptide object with the same AGI#.
    * Uses a peptide object to store protein ID AGI#, and the associated data.
    * Writes all formatted peptide objects to a txt file - output.txt.
    */

    // Tracks current row
    var x = 0;
    // An array of peptide objects stores the information from the csv file
    var peptide_arr = [];

    // csv module reads row by row from data 
    csv()
    .from(data)
    .to('debug.csv')
    .transform(function(row, index) {
        // For the first entry push a new peptide object with the AGI# (row[0]) 
        if(x == 0) {
        // cur is the current peptide read into row by csv module
        Peptide cur = new Peptide( row[0] );

        // Add the assoicated data from row (1-5) to cur
        cur.data.peptides.push({
            "sequence" : row[1];
            "mod_sequence" : row[2];
            if(row[5]){
            "mod_indeces" : "[" + row[3] + ", " + row[4] + "]";
            "spectral_count" : row[5];  
            } else {
            "mod_indeces" : row[3];
            "spectral_count" : row[4];  
            }
        });

        // Add the current peptide to the array
        peptide_arr.push(cur);
        }

        // Move to the next row
        x++;
    });

    // Loop through peptide_arr and append output with each peptide's AGI# and its data
    String output = "";
    for(var peptide in peptide_arr) 
    {
        output = output + peptide.toString()
    }
    // Write the output to output.txt
    fs.writeFile("output.txt", output);

    /* Peptide Object :
     *  - id:AGI#
     *  - data: JSON Array associated
     */
    function Peptide(id) // this is the actual function that does the ID retrieving and data 
                        // storage
{
    this.id = id;
    this.data = {
        peptides: []
    };
}

/* Peptide methods :
 *  - toJson : Returns the properly formatted string
 */
Peptide.prototype = {
    toString: function(){
        return this.id + "," + JSON.stringify(this.data, null, " ") + "/n"
    }
};

Edited note: It seems when I run this solution I posted, I am getting a memory leak error; it is infinitely running while not producing any substantial, readable output. If anyone could be willing to assist in assessing why this is occurring, that would be great.

zsyed92
  • 31
  • 8

1 Answers1

0

Does your version work? It looks like you only ever create one Peptide object. Also, what is the "if(row[5])" statement doing? In your example data there are always 5 elements. Also, mod_indeces is always supposed to be a list, correct? Because in your example output file mod_indeces isn't a list in the first peptide. Anyway, here is what I came up with in python:

import csv
import json
data = {}
with open('proteins.csv','rb') as f:
    reader = csv.reader(f)
    for row in reader:
        name = row[0]
        sequence = row[1]
        mod_sequence = row[2]
        mod_indeces = map(int,row[3].split(', '))
        spectral_count = int(row[4])
        peptide = {'sequence':sequence,'mod_sequence':mod_sequence,
                   'mod_indeces':mod_indeces,'spectral_count':spectral_count}
        if name in data:
            data[name]['peptides'].append(peptide)
        else:
            data[name] = {'peptides':[peptide]}
    f.close()

f = open('output.txt','wb')
for protein in data:
    f.write(protein)
    f.write(',')
    f.write(json.dumps(data[protein]))
    f.write('\n')
f.close()

If you are on windows and want to view the file as plain text, you may want to replace '\n' with '\r\n' or os.linesep.

If you want to skip some rows (if there is a header or something), you can do something like this:

import csv
import json
data = {}
rows_to_skip = 1
rows_read = 0
with open('proteins.csv','rb') as f:
    reader = csv.reader(f)
    for row in reader:
        if rows_read >= rows_to_skip:
            name = row[0]
            sequence = row[1]
            mod_sequence = row[2]
            mod_indeces = map(int,row[3].split(', '))
            spectral_count = int(row[4])
            peptide = {'sequence':sequence,'mod_sequence':mod_sequence,
                       'mod_indeces':mod_indeces,'spectral_count':spectral_count}
            if name in data:
                data[name]['peptides'].append(peptide)
            else:
                data[name] = {'peptides':[peptide]}
        rows_read += 1
    f.close()

f = open('output.txt','wb')
for protein in data:
    f.write(protein)
    f.write(',')
    f.write(json.dumps(data[protein]))
    f.write('\n')
f.close()

If you want the keys of the dictionary to be in a particular order, you can use an orderedDict instead of the default dict. Just replace the peptide line with the following:

peptide = OrderedDict([('sequence',sequence),
                       ('mod_sequence',mod_sequence),
                       ('mod_indeces',mod_indeces),
                       ('spectral_count',spectral_count)])

Now the order is preserved. That is, sequence is followed by mod_sequence followed by mod_indeces followed by spectral_count. To change the order, just change the order of elements in the OrderedDict.

Note that you will also have to add from collections import OrderedDict in order to be able to use OrderedDict.

Matthew Wesly
  • 1,238
  • 1
  • 13
  • 14
  • Thank you Matthew! I saved your script in Python format and ran it from a terminal on Mac OS X. I received the following error, which could be on my part through running it, but I'll post it anyways: – zsyed92 Jul 23 '13 at 22:18
  • Traceback (most recent call last): File "/Users/zsyed/PythonPeptideJSON.py", line 8, in for row in reader: _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode? – zsyed92 Jul 23 '13 at 22:18
  • And thank you for your feedback on my program. I will take into consideration what you said and try going through it again – zsyed92 Jul 23 '13 at 22:19
  • Strange, I didn't have that problem. People [here](http://stackoverflow.com/questions/2930673/python-and-csv-help) said opening the file in 'rU' mode seems to fix that problem, so maybe give that a shot. – Matthew Wesly Jul 23 '13 at 22:33
  • Awesome, that works, but I am getting another error unfortunately :\. Sorry for bugging you about this: – zsyed92 Jul 23 '13 at 22:39
  • Traceback (most recent call last): File "PythonPeptideJSON.py", line 12, in mod_indeces = map(int,row[3].split(', ')) ValueError: invalid literal for int() with base 10: 'Index/Indeces of Modification' – zsyed92 Jul 23 '13 at 22:40
  • oh, looks like "Index/Indeces of Modification" the name of the mod_indeces column of the csv? You'll probably want to skip the first row seeing as it doesn't contain any data. – Matthew Wesly Jul 23 '13 at 22:57
  • it seems when I run my solution I posted, I am getting a memory leak error; it is infinitely running while not producing any substantial, readable output. If anyone could be willing to assist in assessing why this is occurring, that would be great. – zsyed92 Jul 24 '13 at 03:03
  • Are you saying I should just remove the 'name = row [0]' line? – zsyed92 Jul 24 '13 at 20:44
  • That would be skipping the first column. I edited my post to allow you to skip the first x number of rows – Matthew Wesly Jul 24 '13 at 21:48
  • awesome because I need to remove the headers from the final output text anyways. Specifically, "AGI,UnMd Peptide (M) = x,Mod Peptide (oM) = Ox,Index/Indeces of Modification,counts,Combined Spectral count for repeated Peptides" should not be there. – zsyed92 Jul 24 '13 at 21:56
  • Oh, I didn't realize that was part of the file. I thought that was you explaining what each column represented. Did the new solution work for you? – Matthew Wesly Jul 24 '13 at 22:01
  • Awesome, @MatthewWesly, this works! Thank you very much. Once I build reputation, I will vote up and accept this answer. Quick question again. The script that you have generates the data in the following order: "mod_sequence, spectral_count, mod_indeces, sequence." The output file that I need to replicate is formatted: "sequence, mod_sequence, mod_indeces, spectral_count." So my question is, is it possible to switch the order such that "sequence" is first, instead of last, right before "mod_sequence," and "mod_indeces" comes before "spectral_count" instead of after? Thank you so much! – zsyed92 Jul 24 '13 at 22:12
  • Why does the order matter? Dictionaries are naturally unsorted. Regardless, you can use an OrderedDict instead of the default dict. I've updated my answer to have that option. – Matthew Wesly Jul 25 '13 at 21:05
  • Thank you so much for your assistance my friend! Yes, it worked perfectly. If I were to add another column of information, how would I go about doing so? Would I need to change the number of rows read or anything as such? I tried doing so, but I am getting another compilation error unfortunately – zsyed92 Aug 02 '13 at 22:20
  • If it worked, feel free to mark the answer as correct. And do you mean another column was added to the csv and you just need to do something with the value and add it to the dictionary. E.g. `peak_absorbance = int(row[5])`. As you loop through the file, `row` is an array of data containing the values in the columns of that row. Thus `row[0]` is the first column and `row[5]` is the 6th column. – Matthew Wesly Aug 03 '13 at 02:37
  • Sorry for not clarifying. I meant adding on to the original csv file. Interestingly enough, I added another column containing information for each cell, denoting it as 'bar_id.' And as I followed what you said above, I get the error we got last time: Traceback (most recent call last): File "PythonPeptideJSON.py", line 13, in mod_indeces = map(int,row[3].split(', ')) ValueError: invalid literal for int() with base 10: '' – zsyed92 Aug 05 '13 at 17:33
  • That means one of the rows has a mod index that cannot be cast to an integer. `map(int,row[3].split(","))` is essentially the same as `[int(w) for w in row[3].split(",")]`. Figure out why the value in that row is not an integer and handle that case. – Matthew Wesly Aug 05 '13 at 18:02