![How to Manage Large Amounts of Data in Ruby on Rails How to Manage Large Amounts of Data in Ruby on Rails](https://gerardomiranda.dev/blog/wp-content/uploads/2021/11/Ruby_On_Rails_Logo-1024x387.png)
How to Manage Large Amounts of Data in Ruby on Rails
The other day I got a task for which I needed to generate 2,000,000 records in the database. The author of the ticket was kind enough to write down a script to generate the data. I didn’t pay attention to it and just leave it running in a CLI while I was completing other tasks. These other tasks took me the rest of the day to complete and when I went back to the script I realized it wasn’t even half completed.
Obviously the script was not efficient at all, so I took a second look at it. First thing I noticed is that it was generating instances one by one, and that it was inevitable because of the nature of the records we needed to generate. The actual cause and the one that was avoidable was HOW it was saving the records, using ActiveRecord.save method.
I know first hand that ActiveRecord save method is not efficient when working with big chunks of data, but I didn’t know how slow it was. So after changing the script to use the activerecord-import library I was curious about how big of a difference it actually makes.
Preparing the tests
First we need a model, we will keep it very simple because we don’t need to overcomplicate things.
create_table :dummy_models do |t|
t.string :dummy_text1
t.string :dummy_text2
t.string :dummy_text3
t.timestamps
end
Code language: Ruby (ruby)
Next we are going to need the method to generate the records. For this we will use the Faker gem to generate the text. To mimic the original problem we are going to create the record one by one before saving them.
def self.generate_record
DummyModel.new(dummy_text1: Faker::Lorem.word, dummy_text2: Faker::Lorem.word, dummy_text3: Faker::Lorem.word)
end
Code language: Ruby (ruby)
Now let’s start with the method to compare, first using ActiveRecord.save
def self.with_active_record_create(size)
(1..size).each do |i|
record = generate_record
record.save!
end
end
Code language: Ruby (ruby)
We continue with activerecord-import
def self.with_active_record_import(size)
records = []
(1..size).each do |i|
records << generate_record
end
DummyModel.import(records)
end
Code language: Ruby (ruby)
We can go further, and use a couple of options that allow us to skip validations and additionally pass the columns directly without having to pass an ActiveRecord object. For this last option we are going to create the ActiveRecord object first to see if the difference in performance justify it.
def self.with_active_record_import_without_validations(size)
records = []
(1..size).each do |i|
records << generate_record
end
DummyModel.import(records, validate: false)
end
def self.with_active_record_import_columns_without_validations(size)
values = []
columns = ['dummy_text1', 'dummy_text2', 'dummy_text3']
(1..size).each do |i|
record = generate_record
values << [record.dummy_text1, record.dummy_text2, record.dummy_text3]
end
DummyModel.import(columns, values, validate: false)
end
Code language: Ruby (ruby)
Comparing running times
And that’s it , the only thing left is to use benchmarking to see the differences in running time. Let’s start with just 1,000 records.
With ActiveRecord save
2.444752 0.297301 2.742053 ( 5.692380)
With activerecord-import
0.260530 0.028553 0.289083 ( 0.325896)
With activerecord-import without validations
0.250810 0.024476 0.275286 ( 0.289907)
With activerecord-import using columns without validations
0.223519 0.028434 0.251953 ( 0.267757)
Code language: CSS (css)
With only this amount of data, we can see a noticeable difference of more than 5 seconds. But I think the ActiveRecord.save is still usable, since 5.6 seconds is not that bad.
activerecord-import
Between the methods of activerecord-import there is almost no difference. Now we will try with 10,000 records.
With ActiveRecord save
19.409276 1.963945 21.373221 ( 47.560994)
With activerecord-import
2.865762 0.203885 3.069647 ( 3.313518)
With activerecord-import without validations
2.657060 0.195326 2.852386 ( 2.983243)
With activerecord-import using columns without validations
2.423606 0.217493 2.641099 ( 2.779699)
Code language: CSS (css)
With this execution time I think ActiveRecord is not an option anymore when dealing with more than 10,000 records. But the difference between the methods of activerecord-import is still not significative.
Let’s put the computer to work and generate 100,000 records and see what happens.
With ActiveRecord save
207.721426 20.519692 228.241118 (503.432937)
With activerecord-import
27.889573 2.115005 30.004578 ( 32.025789)
With activerecord-import without validations
27.754645 2.084204 29.838849 ( 31.793397)
With activerecord-import using columns without validations
24.807959 1.963377 26.771336 ( 28.741473)
Code language: CSS (css)
Definitively ActiveRecord.save should not even be considered for 100,000+ records. Now I understand why the script didn’t finish executing after hours. Regarding the activerecord-import methods, using the columns instead of the ActiveRecord object was at least 2 seconds faster than the other ones. This is nothing outstanding but it is worth considering especially if you are working with more than that.
The comparisons showed me that when I’m developing with Ruby on Rails I have to seriously take in consideration the amount of data I’m gonna work with. Lest one of those scripts end up in production one day.
You can find the complete code used for this post here.