2

Basically I have these files (medline from NCBI). Each is associated with a journal title. Each has 0, 1 or more genbank identification numbers (GBIDs). I can associate the number of GBIDs per file with each journal name. My problem is that I may have more than one file associated with the same journal, and I don't know how to add the number of GBIDs per file into a total number of GBIDs per journal.

My current code: jt stands for journal title, pulled out properly from the file. GBIDs are added to the count as encountered.

... up to this point, the first search is performed, each "pmid" you can think of as a single file, so each "fetch" goes through all the files one at a time...

  pmid_list.each do |pmid|

   ncbi_fetch.pubmed(pmid, "medline").each do |pmid_line|

    if pmid_line =~ /JT.+- (.+)\n/
        jt = $1
        jt_count = 0
        jt_hash[jt] = jt_count

        ncbi_fetch.pubmed(pmid, "medline").each do |pmid_line_2|

            if pmid_line_2 =~ /SI.+- GENBANK\/(.+)\n/
                gbid = $1
                jt_count += 1
                gbid_hash["#{gbid}\n"] = nil
            end 
        end 

        if jt_count > 0
            puts "#{jt} = #{jt_count}"

        end
    end
  end
end

My result:

 Your search returned 192 results.
 Virology journal = 8
 Archives of virology = 9
 Virus research = 1
 Archives of virology = 6
 Virology = 1

Basically, how do I get it to say Archives of virology = 15, but for any journal title? I tried a hash, but the second archives of virology just overwrote the first... is there a way to make two keys add their values in a hash?

Full code:

 #!/usr/local/bin/ruby

 require 'rubygems'
 require 'bio'


Bio::NCBI.default_email = 'kepresto@uvm.edu'

ncbi_search = Bio::NCBI::REST::ESearch.new
ncbi_fetch = Bio::NCBI::REST::EFetch.new


print "\nQuery?\s" 

query_phrase = gets.chomp

"\nYou said \"#{query_phrase}\". Searching, please wait..."

pmid_list = ncbi_search.search("pubmed", "#{query_phrase}", 0)

puts "\nYour search returned #{pmid_list.count} results."

if pmid_list.count > 200
puts "\nToo big."
exit
end

gbid_hash = Hash.new
jt_hash = Hash.new(0)


pmid_list.each do |pmid|

ncbi_fetch.pubmed(pmid, "medline").each do |pmid_line|

    if pmid_line =~ /JT.+- (.+)\n/
        jt = $1
        jt_count = 0
        jt_hash[jt] = jt_count

        ncbi_fetch.pubmed(pmid, "medline").each do |pmid_line_2|

            if pmid_line_2 =~ /SI.+- GENBANK\/(.+)\n/
                gbid = $1
                jt_count += 1
                gbid_hash["#{gbid}\n"] = nil
            end 
        end 

        if jt_count > 0
            puts "#{jt} = #{jt_count}"

        end
        jt_hash[jt] += jt_count
    end
end
end


jt_hash.each do |key,value|
# if value > 0
    puts "Journal: #{key} has #{value} entries associtated with it. "
# end
end

# gbid_file = File.open("temp_*.txt","r").each do |gbid_count|
#   puts gbid_count
# end
kbearski
  • 35
  • 5
  • sorry, using ruby, with bioruby gems – kbearski Apr 13 '12 at 04:24
  • OK after my answer and your edit the above code should now work. You say it doesn't. What does the output look like? And only the lines with `Journal: ... has ... entries associated it` since that's the only `puts` that's being done after all the searching is completed. – yamen Apr 15 '12 at 21:38

1 Answers1

3

At the top somewhere declare the jt_hash to start with zero's:

jt_hash = Hash.new(0)

Then, after:

puts "#{jt} = #{jt_count}"

Put:

jt_hash[jt] += jt_count

This makes it so that jt_count is incremented in the hash, rather than overwritten. This will print out everything as it happens, so you'll get something like:

Your search returned 192 results.
Virology journal = 8
Archives of virology = 9
Virus research = 1
Archives of virology = 15
Virology = 1

If you then want everything to just print once just put something right at the end that goes through jt_hash and prints everything:

jt_hash.each { |elem|
  puts "#{elem[1]} = #{elem[0]}"
}
yamen
  • 15,390
  • 3
  • 42
  • 52
  • Actually, how's this for irony, I was really only missing the (0) when I look back at all the different ways I tried. I love it when my prof doesn't explain anything! – kbearski Apr 13 '12 at 04:41
  • The `Hash.new(0)` trick is actually way more useful than most people realize. When using objects that can be modified, like strings, make sure to use the block method. `Hash.new('')` or `Hash.new([ ])` can lead to surprises. – tadman Apr 13 '12 at 04:57
  • Can you post the full code as an edit in your original question and we'll take another look? – yamen Apr 13 '12 at 05:46
  • Remove this line: `jt_hash[jt] = jt_count` - that resets your count to 0 every time, so you'll never accumulate. – yamen Sep 26 '12 at 22:10