{"id":168,"date":"2021-11-25T23:00:00","date_gmt":"2021-11-26T03:00:00","guid":{"rendered":"https:\/\/gerardomiranda.dev\/blog\/?p=168"},"modified":"2024-04-05T01:06:12","modified_gmt":"2024-04-05T05:06:12","slug":"how-to-manage-large-amounts-of-data-in-ruby-on-rails","status":"publish","type":"post","link":"https:\/\/gerardomiranda.dev\/blog\/how-to-manage-large-amounts-of-data-in-ruby-on-rails\/","title":{"rendered":"How to Manage Large Amounts of Data in Ruby on Rails"},"content":{"rendered":"\n<p>The other day I got a task for which I needed to generate 2,000,000 records in the database. The author of the ticket was kind enough to write down a script to generate the data. I didn&#8217;t pay attention to it and just leave it running in a CLI while I was completing other tasks. These other tasks took me the rest of the day to complete and when I went back to the script I realized it wasn&#8217;t even half completed.<\/p>\n\n\n\n<p>Obviously the script was not efficient at all, so I took a second look at it. First thing I noticed is that it was generating instances one by one, and that it was inevitable because of the nature of the records we needed to generate. The actual cause and the one that was avoidable was HOW it was saving the records, using ActiveRecord.save method.<\/p>\n\n\n\n<p>I know first hand that ActiveRecord save method is not efficient when working with big chunks of data, but I didn&#8217;t know how slow it was. So after changing the script to use the  <a href=\"https:\/\/rubygems.org\/gems\/activerecord-import\" target=\"_blank\" rel=\"noreferrer noopener\">activerecord-import<\/a> library I was curious about how big of a difference it actually makes.<\/p>\n\n\n\n<div style=\"height:40px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h4 class=\"wp-block-heading\">Preparing the tests<\/h4>\n\n\n\n<p>First we need a model, we will keep it very simple because we don&#8217;t need to overcomplicate things.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-1\" data-shcb-language-name=\"Ruby\" data-shcb-language-slug=\"ruby\"><span><code class=\"hljs language-ruby\">create_table <span class=\"hljs-symbol\">:dummy_models<\/span> <span class=\"hljs-keyword\">do<\/span> <span class=\"hljs-params\">|t|<\/span>\n   t.string <span class=\"hljs-symbol\">:dummy_text1<\/span>\n   t.string <span class=\"hljs-symbol\">:dummy_text2<\/span>\n   t.string <span class=\"hljs-symbol\">:dummy_text3<\/span>\n  \n   t.timestamps\n <span class=\"hljs-keyword\">end<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-1\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Ruby<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">ruby<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>Next we are going to need the method to generate the records. For this we will use the <a href=\"https:\/\/rubygems.org\/gems\/faker\/versions\/1.6.6\" target=\"_blank\" rel=\"noreferrer noopener\">Faker<\/a> gem to generate the text. To mimic the original problem we are going to create the record one by one before saving them.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-2\" data-shcb-language-name=\"Ruby\" data-shcb-language-slug=\"ruby\"><span><code class=\"hljs language-ruby\"><span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">self<\/span>.<span class=\"hljs-title\">generate_record<\/span><\/span>\n    DummyModel.new(<span class=\"hljs-symbol\">dummy_text1:<\/span> Faker::Lorem.word, <span class=\"hljs-symbol\">dummy_text2:<\/span> Faker::Lorem.word, <span class=\"hljs-symbol\">dummy_text3:<\/span> Faker::Lorem.word)\n<span class=\"hljs-keyword\">end<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-2\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Ruby<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">ruby<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>Now let&#8217;s start with the method to compare, first using ActiveRecord.save<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-3\" data-shcb-language-name=\"Ruby\" data-shcb-language-slug=\"ruby\"><span><code class=\"hljs language-ruby\"><span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">self<\/span>.<span class=\"hljs-title\">with_active_record_create<\/span><span class=\"hljs-params\">(size)<\/span><\/span>\n    (<span class=\"hljs-number\">1<\/span>..size).each <span class=\"hljs-keyword\">do<\/span> <span class=\"hljs-params\">|i|<\/span>\n        record = generate_record\n        record.save!\n    <span class=\"hljs-keyword\">end<\/span> \n<span class=\"hljs-keyword\">end<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-3\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Ruby<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">ruby<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>We continue with activerecord-import<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-4\" data-shcb-language-name=\"Ruby\" data-shcb-language-slug=\"ruby\"><span><code class=\"hljs language-ruby\"><span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">self<\/span>.<span class=\"hljs-title\">with_active_record_import<\/span><span class=\"hljs-params\">(size)<\/span><\/span>\n    records = &#91;]\n    (<span class=\"hljs-number\">1<\/span>..size).each <span class=\"hljs-keyword\">do<\/span> <span class=\"hljs-params\">|i|<\/span>\n        records &lt;&lt; generate_record\n    <span class=\"hljs-keyword\">end<\/span>\n    DummyModel.import(records)\n<span class=\"hljs-keyword\">end<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-4\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Ruby<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">ruby<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>We can go further, and use a couple of options that allow us to skip validations and additionally pass the columns directly without having to pass an ActiveRecord object. For this last option we are going to create the ActiveRecord object first to see if the difference in performance justify it.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-5\" data-shcb-language-name=\"Ruby\" data-shcb-language-slug=\"ruby\"><span><code class=\"hljs language-ruby\"><span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">self<\/span>.<span class=\"hljs-title\">with_active_record_import_without_validations<\/span><span class=\"hljs-params\">(size)<\/span><\/span>\n    records = &#91;]\n    (<span class=\"hljs-number\">1<\/span>..size).each <span class=\"hljs-keyword\">do<\/span> <span class=\"hljs-params\">|i|<\/span>\n        records &lt;&lt; generate_record\n    <span class=\"hljs-keyword\">end<\/span>\n    DummyModel.import(records, <span class=\"hljs-symbol\">validate:<\/span> <span class=\"hljs-literal\">false<\/span>)\n<span class=\"hljs-keyword\">end<\/span>\n    \n<span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">self<\/span>.<span class=\"hljs-title\">with_active_record_import_columns_without_validations<\/span><span class=\"hljs-params\">(size)<\/span><\/span>\n    values = &#91;]\n    columns = &#91;<span class=\"hljs-string\">'dummy_text1'<\/span>, <span class=\"hljs-string\">'dummy_text2'<\/span>, <span class=\"hljs-string\">'dummy_text3'<\/span>]\n    (<span class=\"hljs-number\">1<\/span>..size).each <span class=\"hljs-keyword\">do<\/span> <span class=\"hljs-params\">|i|<\/span>\n        record = generate_record\n        values &lt;&lt; &#91;record.dummy_text1, record.dummy_text2, record.dummy_text3]\n    <span class=\"hljs-keyword\">end<\/span>\n    DummyModel.import(columns, values, <span class=\"hljs-symbol\">validate:<\/span> <span class=\"hljs-literal\">false<\/span>)\n<span class=\"hljs-keyword\">end<\/span><\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-5\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">Ruby<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">ruby<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<div style=\"height:40px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h4 class=\"wp-block-heading\">Comparing running times<\/h4>\n\n\n\n<p>And that&#8217;s it , the only thing left is to use benchmarking to see the differences in running time. Let&#8217;s start with just 1,000 records.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-6\" data-shcb-language-name=\"CSS\" data-shcb-language-slug=\"css\"><span><code class=\"hljs language-css\"><span class=\"hljs-selector-tag\">With<\/span> <span class=\"hljs-selector-tag\">ActiveRecord<\/span> <span class=\"hljs-selector-tag\">save<\/span>\n  2<span class=\"hljs-selector-class\">.444752<\/span>   0<span class=\"hljs-selector-class\">.297301<\/span>   2<span class=\"hljs-selector-class\">.742053<\/span> (  5<span class=\"hljs-selector-class\">.692380<\/span>)\n<span class=\"hljs-selector-tag\">With<\/span> <span class=\"hljs-selector-tag\">activerecord-import<\/span>\n  0<span class=\"hljs-selector-class\">.260530<\/span>   0<span class=\"hljs-selector-class\">.028553<\/span>   0<span class=\"hljs-selector-class\">.289083<\/span> (  0<span class=\"hljs-selector-class\">.325896<\/span>)\n<span class=\"hljs-selector-tag\">With<\/span> <span class=\"hljs-selector-tag\">activerecord-import<\/span> <span class=\"hljs-selector-tag\">without<\/span> <span class=\"hljs-selector-tag\">validations<\/span>\n  0<span class=\"hljs-selector-class\">.250810<\/span>   0<span class=\"hljs-selector-class\">.024476<\/span>   0<span class=\"hljs-selector-class\">.275286<\/span> (  0<span class=\"hljs-selector-class\">.289907<\/span>)\n<span class=\"hljs-selector-tag\">With<\/span> <span class=\"hljs-selector-tag\">activerecord-import<\/span> <span class=\"hljs-selector-tag\">using<\/span> <span class=\"hljs-selector-tag\">columns<\/span> <span class=\"hljs-selector-tag\">without<\/span> <span class=\"hljs-selector-tag\">validations<\/span>\n  0<span class=\"hljs-selector-class\">.223519<\/span>   0<span class=\"hljs-selector-class\">.028434<\/span>   0<span class=\"hljs-selector-class\">.251953<\/span> (  0<span class=\"hljs-selector-class\">.267757<\/span>)<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-6\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">CSS<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">css<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>With only this amount of data, we can see a noticeable difference of more than 5 seconds. But I think the ActiveRecord.save is still usable, since 5.6 seconds is not that bad. <\/p>\n\n\n\n<p> activerecord-import <\/p>\n\n\n\n<p>Between the methods of activerecord-import there is almost no difference. Now we will try with 10,000 records.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-7\" data-shcb-language-name=\"CSS\" data-shcb-language-slug=\"css\"><span><code class=\"hljs language-css\"><span class=\"hljs-selector-tag\">With<\/span> <span class=\"hljs-selector-tag\">ActiveRecord<\/span> <span class=\"hljs-selector-tag\">save<\/span>\n 19<span class=\"hljs-selector-class\">.409276<\/span>   1<span class=\"hljs-selector-class\">.963945<\/span>  21<span class=\"hljs-selector-class\">.373221<\/span> ( 47<span class=\"hljs-selector-class\">.560994<\/span>)\n<span class=\"hljs-selector-tag\">With<\/span> <span class=\"hljs-selector-tag\">activerecord-import<\/span>\n  2<span class=\"hljs-selector-class\">.865762<\/span>   0<span class=\"hljs-selector-class\">.203885<\/span>   3<span class=\"hljs-selector-class\">.069647<\/span> (  3<span class=\"hljs-selector-class\">.313518<\/span>)\n<span class=\"hljs-selector-tag\">With<\/span> <span class=\"hljs-selector-tag\">activerecord-import<\/span> <span class=\"hljs-selector-tag\">without<\/span> <span class=\"hljs-selector-tag\">validations<\/span>\n  2<span class=\"hljs-selector-class\">.657060<\/span>   0<span class=\"hljs-selector-class\">.195326<\/span>   2<span class=\"hljs-selector-class\">.852386<\/span> (  2<span class=\"hljs-selector-class\">.983243<\/span>)\n<span class=\"hljs-selector-tag\">With<\/span> <span class=\"hljs-selector-tag\">activerecord-import<\/span> <span class=\"hljs-selector-tag\">using<\/span> <span class=\"hljs-selector-tag\">columns<\/span> <span class=\"hljs-selector-tag\">without<\/span> <span class=\"hljs-selector-tag\">validations<\/span>\n  2<span class=\"hljs-selector-class\">.423606<\/span>   0<span class=\"hljs-selector-class\">.217493<\/span>   2<span class=\"hljs-selector-class\">.641099<\/span> (  2<span class=\"hljs-selector-class\">.779699<\/span>)<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-7\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">CSS<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">css<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>With this execution time I think ActiveRecord is not an option anymore when dealing with more than 10,000 records. But the difference between the methods of activerecord-import is still not significative.<\/p>\n\n\n\n<p>Let&#8217;s put the computer to work and generate 100,000 records and see what happens.<\/p>\n\n\n<pre class=\"wp-block-code\" aria-describedby=\"shcb-language-8\" data-shcb-language-name=\"CSS\" data-shcb-language-slug=\"css\"><span><code class=\"hljs language-css\"><span class=\"hljs-selector-tag\">With<\/span> <span class=\"hljs-selector-tag\">ActiveRecord<\/span> <span class=\"hljs-selector-tag\">save<\/span>\n207<span class=\"hljs-selector-class\">.721426<\/span>  20<span class=\"hljs-selector-class\">.519692<\/span> 228<span class=\"hljs-selector-class\">.241118<\/span> (503<span class=\"hljs-selector-class\">.432937<\/span>)\n<span class=\"hljs-selector-tag\">With<\/span> <span class=\"hljs-selector-tag\">activerecord-import<\/span>\n 27<span class=\"hljs-selector-class\">.889573<\/span>   2<span class=\"hljs-selector-class\">.115005<\/span>  30<span class=\"hljs-selector-class\">.004578<\/span> ( 32<span class=\"hljs-selector-class\">.025789<\/span>)\n<span class=\"hljs-selector-tag\">With<\/span> <span class=\"hljs-selector-tag\">activerecord-import<\/span> <span class=\"hljs-selector-tag\">without<\/span> <span class=\"hljs-selector-tag\">validations<\/span>\n 27<span class=\"hljs-selector-class\">.754645<\/span>   2<span class=\"hljs-selector-class\">.084204<\/span>  29<span class=\"hljs-selector-class\">.838849<\/span> ( 31<span class=\"hljs-selector-class\">.793397<\/span>)\n<span class=\"hljs-selector-tag\">With<\/span> <span class=\"hljs-selector-tag\">activerecord-import<\/span> <span class=\"hljs-selector-tag\">using<\/span> <span class=\"hljs-selector-tag\">columns<\/span> <span class=\"hljs-selector-tag\">without<\/span> <span class=\"hljs-selector-tag\">validations<\/span>\n 24<span class=\"hljs-selector-class\">.807959<\/span>   1<span class=\"hljs-selector-class\">.963377<\/span>  26<span class=\"hljs-selector-class\">.771336<\/span> ( 28<span class=\"hljs-selector-class\">.741473<\/span>)<\/code><\/span><small class=\"shcb-language\" id=\"shcb-language-8\"><span class=\"shcb-language__label\">Code language:<\/span> <span class=\"shcb-language__name\">CSS<\/span> <span class=\"shcb-language__paren\">(<\/span><span class=\"shcb-language__slug\">css<\/span><span class=\"shcb-language__paren\">)<\/span><\/small><\/pre>\n\n\n<p>Definitively ActiveRecord.save should not even be considered for 100,000+ records. Now I understand why the script didn&#8217;t finish executing after hours. Regarding the activerecord-import methods, using the columns instead of the ActiveRecord object was at least 2 seconds faster than the other ones. This is nothing outstanding but it is worth considering especially if you are working with more than that.<\/p>\n\n\n\n<p>The comparisons showed me that when I&#8217;m developing with Ruby on Rails I have to seriously take in consideration the amount of data I&#8217;m gonna work with. Lest one of those scripts end up in production one day.<\/p>\n\n\n\n<p>You can find the complete code used for this post <a href=\"https:\/\/github.com\/gerardo-m\/rails_test_lab\/blob\/master\/lib\/generating_data.rb\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I know first hand that ActiveRecord save method is not efficient when working with big chunks of data, but I didn&#8217;t know how slow it was.<\/p>\n","protected":false},"author":1,"featured_media":165,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[24],"tags":[92,94,96],"class_list":["post-168","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-learning","tag-data-en","tag-ruby-on-rails-en","tag-web-en"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/gerardomiranda.dev\/blog\/wp-json\/wp\/v2\/posts\/168","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/gerardomiranda.dev\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/gerardomiranda.dev\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/gerardomiranda.dev\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/gerardomiranda.dev\/blog\/wp-json\/wp\/v2\/comments?post=168"}],"version-history":[{"count":12,"href":"https:\/\/gerardomiranda.dev\/blog\/wp-json\/wp\/v2\/posts\/168\/revisions"}],"predecessor-version":[{"id":240,"href":"https:\/\/gerardomiranda.dev\/blog\/wp-json\/wp\/v2\/posts\/168\/revisions\/240"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/gerardomiranda.dev\/blog\/wp-json\/wp\/v2\/media\/165"}],"wp:attachment":[{"href":"https:\/\/gerardomiranda.dev\/blog\/wp-json\/wp\/v2\/media?parent=168"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/gerardomiranda.dev\/blog\/wp-json\/wp\/v2\/categories?post=168"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/gerardomiranda.dev\/blog\/wp-json\/wp\/v2\/tags?post=168"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}