2

I have inherited a ruby app which connects to a mongodb. I have no idea about mongo or ruby unfortunately so im on a rapid googling and learning curve.

The app stores placenames as well as their lat longs, alternative name, peoples memories, and comments. It also counts how many times a place has been discussed.

The following rake file when run, grabs all the locations from the mongodb and creates a csv,spitting out one line for each location with the user, number of times mentioned, the memories etc etc.

task :data_dump => :environment do 
 File.open("results.csv","w") do |file|
  Location.all.each_with_index do |l,index|
    puts "done #{index}"
     file.puts [l.id, l.classification_count, l.position, l.created_at, l.classifications.collect{|c| c.text}, l.classifications.collect{|c| c.alternative_names }.flatten.join(";"), l.classifications.collect{|c| c.comment }.flatten.join(";"), l.memories.collect{|m| m.text}.flatten.join(";") ].join(",")
   end
  end
 end

It works great and generates a CSV I can then pull into other programmes. The problem is that the content contains plain text fields which breaks the validity of the csv with line breaks etc and I want to make sure all plain text fields are properly enclosed within the CSV.

So if I can understand the above query better, I can then input the correct field enclosures to ensure the csv is valid when loaded into GIS software.

Also the above takes about an hour 45 to run on my laptop so I want to find out if it is the most efficient way to do the query. To date we have around 300000 placename listed and this is going to rise to a few million so will only get slower.

Arvind
  • 2,671
  • 1
  • 18
  • 32
Paul M
  • 3,937
  • 9
  • 45
  • 53
  • Well your actual problem seems to have nothing to do with the query at all. Since you say "no idea", the the enlightenment here is that you don't do things like "modifiy" the data in any way with a standard MongoDB query. And that is all that is happening here. What your question "really" seems to be here is "how to remove the newlines from a field with ruby". Which is really just a matter of stripping them out with a regular expression. So maybe some reasearch or re-wording of question could be applied here, now you have that piece of knowledge. – Blakes Seven Aug 20 '15 at 14:52
  • Thanks, maybe my question is not clear. Are you saying there is some ruby code elsewhere which creates the csv not the code above? As far as I know, the above rake file when run, connects to mongodb and spits out a csv file of all the placenames. The csv file does not have plain text fields wrapped in " " so I want to modify the above query to properly enclose those fields, so its a valid csv. The csv is loaded into a GIS platform so no post processing is done with ruby. So not sure where regular expressions would help. – Paul M Aug 20 '15 at 14:58
  • What I am saying here is that the "query" to MongoDB and operation here does nothing consequential at all. In fact there is no query, other than "fetch everything". So this is just dumping the whole content of the collection. It is just that some of or one of the fields returned contains newline characters you now want to remove. – Blakes Seven Aug 20 '15 at 15:17
  • You also should maybe to listening to advice rather than spouting opinion when you have just taken a job using a language and storage platform you confess to know nothing about. – Blakes Seven Aug 20 '15 at 15:19
  • Thanks for the advice, I wasnt trying to sprout opinion or offend I just didnt fully understand your comment. Im trying to help organisation that has no developer/money to fix project app they have been left with. I want to learn thats why Im here, but rubys different to what I am used too. The query contains ";" and "," which appear in the CSV so I thought I would be able to add double quotes to each field via the query so we could keep the newlines, text formatting. I just wanted an idea of what the query said so I could add them. Sorry if I offended. – Paul M Aug 20 '15 at 15:44

1 Answers1

4

You can generate the CSV with Ruby's 'csv' module:

require 'csv'

task :data_dump => :environment do 
  CSV.open("results.csv","w") do |csv|
    Location.all.each_with_index do |l,index|
      puts "done #{index}"
      csv << [l.id, l.classification_count, ...]
    end
  end
end

This will ensure that the CSV is generated properly. As for the speed, I've only used ActiveRecord with relational databases, but I imaging the problem is the same - The 1 + N Problem. Basically it says that each time you are using l.classifications.collect or l.memories.collect it needs to do a query to get all the classifications/memories from the database. The solution is eager loading:

require 'csv'

task :data_dump => :environment do 
  CSV.open("results.csv","w") do |csv|
    Location.all.includes(:classifications, :memories).each_with_index do |l,index|
      puts "done #{index}"
      csv << [l.id, l.classification_count, l.position, l.created_at, l.classifications.collect{|c| c.text}, l.classifications.collect{|c| c.alternative_names }.flatten.join(";"), l.classifications.collect{|c| c.comment }.flatten.join(";"), l.memories.collect{|m| m.text}.flatten.join(";") ]
    end
  end
end

(and you might need to do so for alternative_names - I don't remember the syntax for nested eager loading). This will make a single query to the database, which should be much faster.

Idan Arye
  • 12,402
  • 5
  • 49
  • 68
  • OK Thanks I will give this a go. Just need to work out how to run a standalone ruby script now lol. – Paul M Aug 20 '15 at 16:04
  • I swopped out my rake file for your first csv changes just to see how it went and it fails with error 'undefined method `map' for # /usr/lib/ruby/1.9.1/csv.rb:1729:in `<<' I have been doing some googling and maybe I need to put my results into an array first? – Paul M Aug 21 '15 at 08:23
  • ahh sorted it, as it creates a csv anyway I didnt need the '.join(",")' on the end of the query after the array ended ] (at least I think thats the array end!) Thank you so much, will try the quicker query later too. – Paul M Aug 21 '15 at 10:13
  • Yea, forgot to remove the `join(",")`. I'll edit my answer – Idan Arye Aug 21 '15 at 23:34