Chasing performance degradations with Ruby benchmarks and memory_profiler

February 11, 2018

While working on some of the performance improvements for version 1.0.0 of Rambling Trie late in 2016 and early in 2017, I upgraded from Ruby 2.3.3 to Ruby 2.4.0 imagining that the newer version would help me on my mission to make all operations execute a bit faster. To my surprise, I ran into a significant performance degradation after upgrading - specifically, there was a ~25% increase in the time it took to execute a script using a trie with a big-ish word dictionary. This intrigued me enough that I decided to take a closer look.

Benchmarking with Ruby

After some investigation with the help of Ruby’s own Benchmark, I realized that while most operations were a bit slower, the main problem was a lot more visible during intensive operations like creating a brand new trie or compressing an existing one. You can see how a performance benchmark of Rambling::Trie.create went from 8.97 seconds with Ruby 2.3.3, to 11.19 seconds with Ruby 2.4.0.

Bringing out the big guns

So, I pulled in the memory_profiler gem to do some memory allocation monitoring:

require 'memory_profiler'

# Profile trie creation
result = MemoryProfiler.report allow_files: 'lib/rambling/trie', ignore_files: 'lib/rambling/trie/tasks' do
  1.upto(5) do |i|
    Rambling::Trie.create '/path/to/dictionary'
  end
end

result.pretty_print to_file: 'reports/#{version}/#{date}/creation'

# Profile trie compression
tries = []
1.upto(5) do |i|
  tries << Rambling::Trie.create '/path/to/dictionary'
end

result = MemoryProfiler.report allow_files: 'lib/rambling/trie', ignore_files: 'lib/rambling/trie/tasks' do
  tries.each { |t| t.compress! }
end

result.pretty_print to_file: 'reports/#{version}/#{date}/compression'

# ... etc.

The report output gives you the amount of memory (allocated and retained) and the number of objects (allocated and retained) all broken down by gem, by file, by location, and by class. It also gives you which strings were allocated and retained and how many times it happened.

The results were really illuminating. After running a profile for all operations, three things were immediately apparent:

  1. The memory footprint in 2.4.0 is a bit better due to some sweet Hash optimizations that came with Ruby 2.4.0 (#win!)
  2. Both the creation and compression operations create a lot of new object instances, as opposed to the rest of the operations that just run through a code path with existing instances
  3. The rest of the memory footprint looked exactly the same, so the issue had to be elsewhere

Forwardable is the culprit

After further investigation with benchmarks and tweaking classes and methods, I realized that forwardable was the culprit. Code like this was causing issues:

module Rambling
  module Trie
    # A representation of a node in the Trie data structure.
    class Node
      # ...

      extend ::Forwardable

      delegate [
        :[],
        :[]=,
        :delete,
        :has_key?
      ] => :children_tree

      # ...
    end
  end
end

Specifically, the benchmark-ips gem helped me analyze this issue further. I ran this benchmark:

require 'benchmark/ips'
require 'forwardable'

class TestDelegate
  extend ::Forwardable
  delegate [:[]] => :hash

  attr_reader :hash

  def initialize hash
    @hash = hash
  end
end

class TestDirect
  attr_reader :hash

  def initialize hash
    @hash = hash
  end

  def [] key
    hash[key]
  end
end

Benchmark.ips do |bm|
  hash = { h: 'hello' }

  delegate = TestDelegate.new hash
  bm.report 'delegate access' do
    delegate[:h]
  end

  direct = TestDirect.new hash
  bm.report 'direct access' do
    direct[:h]
  end

  bm.compare!
end

… and got these results:

Warming up --------------------------------------
     delegate access    62.485k i/100ms
       direct access   223.511k i/100ms
Calculating -------------------------------------
     delegate access    810.085k (± 6.2%) i/s -      4.062M in   5.035670s
       direct access      6.531M (±11.1%) i/s -     32.409M in   5.040199s

Comparison:
       direct access:  6531016.6 i/s
     delegate access:   810084.8 i/s - 8.06x  slower

So… access via a delegated method in Ruby 2.4.0 is about 8 TIMES SLOWER!!! than directly accessing the underlying field 😨. That needed to be addressed, but what do I do in the meantime? Define all those methods by hand?! NO WAY

Creating an interim solution

I reached out to Yuki Nishijima and explained the issue. He promptly responded saying he was going to take a look. In the meantime though, I wanted to release a new version of the gem with all the performance improvements I had been working on but without the degradation from 2.4.0, so I set out to create my own version of Forwardable that preserved 2.3.3’s speed and memory in 2.4.0. Here’s what I ended up with (ahem, after several iterations):

# rambling/trie/forwardable
module Rambling
  module Trie
    module Forwardable
      def delegate methods_to_target
        methods_to_target.each do |methods, target|
          methods.each do |method|
            define_method method do |*args|
              send(target).send method, *args
            end
          end
        end
      end
    end
  end
end

Then, by including these lines in the benchmark-ips script:

# other requires
require 'rambling/trie/forwardable'

# other test classes

class TestCustomForwardable
  extend Rambling::Trie::Forwardable
  delegate [:[]] => :hash

  attr_reader :hash

  def initialize hash
    @hash = hash
  end
end

Benchmark.ips do |bm|
  hash = { h: 'hello' }

  # other bm.report calls

  custom = TestCustomForwardable.new hash
  bm.report 'custom delegate access' do
    custom[:h]
  end

  bm.compare!
end

… I got these results (the custom delegate access is the benchmark with Rambling::Trie::Forwardable):

Warming up --------------------------------------
     delegate access    57.977k i/100ms
custom delegate access
                       132.492k i/100ms
       direct access   222.442k i/100ms
Calculating -------------------------------------
     delegate access    835.005k (± 8.0%) i/s -      4.174M in   5.034664s
custom delegate access
                          2.301M (± 6.5%) i/s -     11.527M in   5.031194s
       direct access      6.642M (± 9.9%) i/s -     32.921M in   5.013763s

Comparison:
       direct access:  6642032.3 i/s
custom delegate access:  2301340.9 i/s - 2.89x  slower
     delegate access:   835004.6 i/s - 7.95x  slower

And BAM! - back to (basically) the same numbers, which then pretty much transferred nicely to 1.0.0.

Contributing back to the community

I submitted an issue to Ruby’s Redmine and soon after got an answer back from Yuki: he pointed out that an UnboundMethod instance was being created on every delegated call. We also noticed that Nobuyoshi Nakada had fixed the issue on trunk revisions #57255, #57256 & #57257.

Finally, these revisions were included into the ruby_2_4 branch in revision #57913 in February 2017. Ruby 2.4.1 came out soon after and running the same benchmark in 2.4.1 gave me these results:

Warming up --------------------------------------
     delegate access   129.649k i/100ms
custom delegate access
                       126.123k i/100ms
       direct access   178.758k i/100ms
Calculating -------------------------------------
     delegate access      2.824M (± 9.1%) i/s -     14.002M in   5.007163s
custom delegate access
                          2.288M (± 6.7%) i/s -     11.477M in   5.039665s
       direct access      6.273M (±10.0%) i/s -     31.104M in   5.013482s

Comparison:
       direct access:  6272727.2 i/s
     delegate access:  2823600.3 i/s - 2.22x  slower
custom delegate access:  2288089.7 i/s - 2.74x  slower

… which means I could go back to using Ruby’s own Forwardable again! By May 2017 I was able to do so and delete my own implementation. This is one of my favorite parts about writing code: deleting code.


Update on 2018-02-15

Ruby 2.5.0’s Forwardable performs a bit slower now, but it’s still faster than my own implementation:

Warming up --------------------------------------
     delegate access   140.154k i/100ms
custom delegate access
                       122.634k i/100ms
       direct access   215.791k i/100ms
Calculating -------------------------------------
     delegate access      2.583M (± 7.2%) i/s -     12.894M in   5.019182s
custom delegate access
                          2.108M (± 5.9%) i/s -     10.547M in   5.022423s
       direct access      6.231M (± 9.1%) i/s -     30.858M in   5.003747s

Comparison:
       direct access:  6231242.7 i/s
     delegate access:  2582759.7 i/s - 2.41x  slower
custom delegate access:  2107914.3 i/s - 2.96x  slower

🤔… Looking at this more closely, maybe I should just drop the Forwardable module and just define the methods directly to avoid this 2x slower business. I guess that’ll be a topic for a future blog post.


Notes