Learning
How to Manage Large Amounts of Data in Ruby on Rails

How to Manage Large Amounts of Data in Ruby on Rails

The other day I got a task for which I needed to generate 2,000,000 records in the database. The author of the ticket was kind enough to write down a script to generate the data. I didn’t pay attention to it and just leave it running in a CLI while I was completing other tasks. These other tasks took me the rest of the day to complete and when I went back to the script I realized it wasn’t even half completed.

Obviously the script was not efficient at all, so I took a second look at it. First thing I noticed is that it was generating instances one by one, and that it was inevitable because of the nature of the records we needed to generate. The actual cause and the one that was avoidable was HOW it was saving the records, using ActiveRecord.save method.

I know first hand that ActiveRecord save method is not efficient when working with big chunks of data, but I didn’t know how slow it was. So after changing the script to use the activerecord-import library I was curious about how big of a difference it actually makes.

Preparing the tests

First we need a model, we will keep it very simple because we don’t need to overcomplicate things.

create_table :dummy_models do |t|
   t.string :dummy_text1
   t.string :dummy_text2
   t.string :dummy_text3
  
   t.timestamps
 endCode language: Ruby (ruby)

Next we are going to need the method to generate the records. For this we will use the Faker gem to generate the text. To mimic the original problem we are going to create the record one by one before saving them.

def self.generate_record
    DummyModel.new(dummy_text1: Faker::Lorem.word, dummy_text2: Faker::Lorem.word, dummy_text3: Faker::Lorem.word)
endCode language: Ruby (ruby)

Now let’s start with the method to compare, first using ActiveRecord.save

def self.with_active_record_create(size)
    (1..size).each do |i|
        record = generate_record
        record.save!
    end 
endCode language: Ruby (ruby)

We continue with activerecord-import

def self.with_active_record_import(size)
    records = []
    (1..size).each do |i|
        records << generate_record
    end
    DummyModel.import(records)
endCode language: Ruby (ruby)

We can go further, and use a couple of options that allow us to skip validations and additionally pass the columns directly without having to pass an ActiveRecord object. For this last option we are going to create the ActiveRecord object first to see if the difference in performance justify it.

def self.with_active_record_import_without_validations(size)
    records = []
    (1..size).each do |i|
        records << generate_record
    end
    DummyModel.import(records, validate: false)
end
    
def self.with_active_record_import_columns_without_validations(size)
    values = []
    columns = ['dummy_text1', 'dummy_text2', 'dummy_text3']
    (1..size).each do |i|
        record = generate_record
        values << [record.dummy_text1, record.dummy_text2, record.dummy_text3]
    end
    DummyModel.import(columns, values, validate: false)
endCode language: Ruby (ruby)

Comparing running times

And that’s it , the only thing left is to use benchmarking to see the differences in running time. Let’s start with just 1,000 records.

With ActiveRecord save
  2.444752   0.297301   2.742053 (  5.692380)
With activerecord-import
  0.260530   0.028553   0.289083 (  0.325896)
With activerecord-import without validations
  0.250810   0.024476   0.275286 (  0.289907)
With activerecord-import using columns without validations
  0.223519   0.028434   0.251953 (  0.267757)Code language: CSS (css)

With only this amount of data, we can see a noticeable difference of more than 5 seconds. But I think the ActiveRecord.save is still usable, since 5.6 seconds is not that bad.

activerecord-import

Between the methods of activerecord-import there is almost no difference. Now we will try with 10,000 records.

With ActiveRecord save
 19.409276   1.963945  21.373221 ( 47.560994)
With activerecord-import
  2.865762   0.203885   3.069647 (  3.313518)
With activerecord-import without validations
  2.657060   0.195326   2.852386 (  2.983243)
With activerecord-import using columns without validations
  2.423606   0.217493   2.641099 (  2.779699)Code language: CSS (css)

With this execution time I think ActiveRecord is not an option anymore when dealing with more than 10,000 records. But the difference between the methods of activerecord-import is still not significative.

Let’s put the computer to work and generate 100,000 records and see what happens.

With ActiveRecord save
207.721426  20.519692 228.241118 (503.432937)
With activerecord-import
 27.889573   2.115005  30.004578 ( 32.025789)
With activerecord-import without validations
 27.754645   2.084204  29.838849 ( 31.793397)
With activerecord-import using columns without validations
 24.807959   1.963377  26.771336 ( 28.741473)Code language: CSS (css)

Definitively ActiveRecord.save should not even be considered for 100,000+ records. Now I understand why the script didn’t finish executing after hours. Regarding the activerecord-import methods, using the columns instead of the ActiveRecord object was at least 2 seconds faster than the other ones. This is nothing outstanding but it is worth considering especially if you are working with more than that.

The comparisons showed me that when I’m developing with Ruby on Rails I have to seriously take in consideration the amount of data I’m gonna work with. Lest one of those scripts end up in production one day.

You can find the complete code used for this post here.

Leave a Reply

Your email address will not be published. Required fields are marked *