-1

My text file data looks like this:(protein-protein interaction data)

transcription_factor protein

Myc Rilpl1

Mycn Rilpl1

Mycn "Wdhd1,Socs4"

Sox2 Rilpl1

Sox2 "Wdhd1,Socs4"

Nanog "Wdhd1,Socs4"

I want it to look like this:( To see each protein has how many transcription_factor interact with)

protein transcription_factor

Rilpl1 Myc, Mycn, Sox2

Wdhd1 Mycn, Sox2, Nanog

Socs4 Mycn, Sox2, Nanog

After using my code, what I got is this:(how can I get rid off the "" and separate the two protein to new line)

protein transcription_factor

Rilpl1 Myc, Mycn, Sox2

"Wdhd1,Socs4" Mycn, Nanog, Sox2

Here is my code:

input_file = ARGV[0]
hash = {}
File.readlines(input_file, "\r").each do |line|
  transcription_factor, protein = line.chomp.split("\t")

  if hash.has_key? protein
    hash[protein] << transcription_factor
  else
    hash[protein] = [transcription_factor]
  end
end

hash.each do |key, value|
  if value.count > 2
    string = value.join(', ')
    puts "#{key}\t#{string}"
  end
end
Michael
  • 65
  • 1
  • 7
  • Where do `transcription_factor prtoein [sic]` and `protein transcription_factor` go? – sawa Feb 05 '14 at 14:14
  • sorry, what do you mean? – Michael Feb 05 '14 at 14:17
  • 2
    Do your text files really have blank lines between each line of text? If not, please fix your examples so they're accurate. As in real life, GIGO, so we need good input samples. – the Tin Man Feb 05 '14 at 14:21
  • Also, tab characters are not separating the fields in the lines, but you did not even explain that there should be tab characters. -1 and close vote. – sawa Feb 05 '14 at 14:25
  • By the way, what is "protin"? I am sure the OP's question is not meant to be a question from a professional, but still, that remains as a question to me. – sawa Feb 05 '14 at 14:29
  • my bad, there is no blank lines between each line of my text file, I just can't figure out how to get the line together. – Michael Feb 05 '14 at 14:30
  • Thanks for the warning, i am new to this field but i am willing to learn. – Michael Feb 05 '14 at 14:35

1 Answers1

1

Here is a quick way to fix your problem:

...
transcription_factor, proteins = line.chomp.split("\t")
proteins.to_s.gsub(/"/,'').split(',').each do |protein|
  if hash.has_key? protein
    hash[protein] << transcription_factor
  else
    hash[protein] = [transcription_factor]
  end
end
...

The above snippet basically removes the quotes from the proteins if there are any and then for each protein found it does what you had already written.

Also if you would like to eliminate the if you can define the hash like this:

hash = Hash.new {|hash,key| hash[key]= []}

which means that for every new key it will return a new array. So now you can skip the if and write

hash[protein] << transcription_factor
Nikos
  • 1,052
  • 5
  • 7
  • Yet another way to what the `if...else...end` does is: `(hash[protein] ||= []) << transcription_factor`. This obviates the need for a Hash default. – Wayne Conrad Feb 26 '14 at 23:47