You can get the list of PMIDS by webscraping using mechanize gem in ruby. Do gem install mechanize
and then you can get the required result by running the ruby script below:
require 'mechanize'
agent = Mechanize.new
elements = agent.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=Cancer+TFF').search(".rprtid").to_a
pmids = elements.map{|x| x.elements.last.text}
puts "List of pmids:"
puts pmids
File.open( "output_pmid_abstracts.txt", "w" ) do |file|
for pmid in pmids
puts "Getting Abstract for PMID: #{pmid}"
abstract = agent.get("http://togows.dbcls.jp/entry/ncbi-pubmed/#{pmid}/abstract").body
file.puts "pmid:#{pmid}"
file.puts abstract
file.puts ""
end
end
puts "Done"
This will make output_pmid_abstracts.txt
file in your current directory which will look something like below:
pmid:27220894
BACKGROUND & AIMS: Gastric cancer has familial clustering in incidence, and the familial relatives of gastric ...
...
pmid:26479350
Trefoil factor family (TFF) peptides are a group of molecules bearing a characteristic three-loop trefoil domain ...
...
PS: Please make sure that you absolutely need to install mechanize
gem first! Or else you will obviously end up getting error: require': cannot load such file -- mechanize (LoadError)
, because it is not able to find the required library/gem. By any case if even after gem install mechanize
you get require error, then do sudo gem install mechanize
and then try.
Update 1:
As mentioned by nik in comment, this code only loads the first page (20 entries) of the search even though it has more. So I am updating the code the fix this problem. Some URL's are different now.
I first get a list of all the pmids by a API and then lookup each pmid's abstract by webscraping.
require 'mechanize'
agent = Mechanize.new
search_terms = "Cancer+TFF"
url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=#{search_terms}&retmax=10000"
all_pmids = agent.get(url).search("IdList").text.strip.split("\n").map{|x| x.strip.to_i}
puts "List of pmids:"
puts all_pmids
File.open( "output_pmid_abstracts.txt", "w" ) do |file|
for pmid in all_pmids
puts "Extracting Abstract for pmid: #{pmid}"
abstract_url = "http://www.ncbi.nlm.nih.gov/pubmed/#{pmid}"
abstract = agent.get(abstract_url).search(".abstr").children[1].text rescue " "
file.puts "pmid:#{pmid}"
file.puts abstract
file.puts ""
end
end
PS: It is possible that some Paper dont have abstract at all: Eg: 16376814 (check here)
Hope it helps : )