Ruby performance: Using benchmarks and memory_profiler to chase performance degradations

February 11, 2018

While working on some of the performance improvements for version 1.0.0 of Rambling Trie late in 2016 and early in 2017, I upgraded from Ruby 2.3.3 to Ruby 2.4.0 imagining that the newer version would help me on my mission to make all operations execute a bit faster. To my surprise, I ran into a significant performance degradation after upgrading - specifically, there was a ~25% increase in the time it took to execute a script using a trie with a big-ish word dictionary. This intrigued me enough that I decided to take a closer look.

Benchmarking with Ruby

After some investigation with the help of Ruby’s own Benchmark, I realized that while most operations were a bit slower, the main problem was a lot more visible during intensive operations like creating a brand new trie or compressing an existing one. You can see how a performance benchmark of Rambling::Trie.create went from 8.97 seconds with Ruby 2.3.3, to 11.19 seconds with Ruby 2.4.0.

Bringing out the big guns

So, I pulled in the memory_profiler gem to do some memory allocation monitoring:

require 'memory_profiler'

# Profile trie creation
result = MemoryProfiler.report allow_files: 'lib/rambling/trie', ignore_files: 'lib/rambling/trie/tasks' do
  1.upto(5) do |i|
    Rambling::Trie.create '/path/to/dictionary'
  end
end

result.pretty_print to_file: 'reports/#{version}/#{date}/creation'

# Profile trie compression
tries = []
1.upto(5) do |i|
  tries << Rambling::Trie.create '/path/to/dictionary'
end

result = MemoryProfiler.report allow_files: 'lib/rambling/trie', ignore_files: 'lib/rambling/trie/tasks' do
  tries.each { |t| t.compress! }
end

result.pretty_print to_file: 'reports/#{version}/#{date}/compression'

# ... etc.

The report output gives you the amount of memory (allocated and retained) and the number of objects (allocated and retained) all broken down by gem, by file, by location, and by class. It also gives you which strings were allocated and retained and how many times it happened.

The results were really illuminating. After running a profile for all operations, three things were immediately apparent:

The memory footprint in 2.4.0 is a bit better due to some sweet Hash optimizations that came with Ruby 2.4.0 (#win!)
Both the creation and compression operations create a lot of new object instances, as opposed to the rest of the operations that just run through a code path with existing instances
The rest of the memory footprint looked exactly the same, so the issue had to be elsewhere

Forwardable is the culprit

After further investigation with benchmarks and tweaking classes and methods, I realized that forwardable was the culprit. Code like this was causing issues:

module Rambling
  module Trie
    # A representation of a node in the Trie data structure.
    class Node
      # ...

      extend ::Forwardable

      delegate [
        :[],
        :[]=,
        :delete,
        :has_key?
      ] => :children_tree

      # ...
    end
  end
end

Specifically, the benchmark-ips gem helped me analyze this issue further. I ran this benchmark:

require 'benchmark/ips'
require 'forwardable'

class TestDelegate
  extend ::Forwardable
  delegate [:[]] => :hash

  attr_reader :hash

  def initialize hash
    @hash = hash
  end
end

class TestDirect
  attr_reader :hash

  def initialize hash
    @hash = hash
  end

  def [] key
    hash[key]
  end
end

Benchmark.ips do |bm|
  hash = { h: 'hello' }

  delegate = TestDelegate.new hash
  bm.report 'delegate access' do
    delegate[:h]
  end

  direct = TestDirect.new hash
  bm.report 'direct access' do
    direct[:h]
  end

  bm.compare!
end

… and got these results:

...
Comparison:
       direct access:  6531016.6 i/s
     delegate access:   810084.8 i/s - 8.06x  slower

So… access via a delegated method in Ruby 2.4.0 is about 8 TIMES SLOWER (!!!) than directly accessing the underlying field 😨. That needed to be addressed, but what do I do in the meantime? Define all those methods by hand?! NO WAY…

Creating an interim solution

I reached out to Yuki Nishijima and explained the issue. He promptly responded saying he was going to take a look. In the meantime though, I wanted to release a new version of the gem with all the performance improvements I had been working on but without the degradation from 2.4.0, so I set out to create my own version of Forwardable that preserved 2.3.3’s speed and memory in 2.4.0. Here’s what I ended up with (ahem, after several iterations):

# rambling/trie/forwardable
module Rambling
  module Trie
    module Forwardable
      def delegate methods_to_target
        methods_to_target.each do |methods, target|
          methods.each do |method|
            define_method method do |*args|
              send(target).send method, *args
            end
          end
        end
      end
    end
  end
end

Then, by including these lines in the benchmark-ips script:

# other requires
require 'rambling/trie/forwardable'

# other test classes

class TestCustomForwardable
  extend Rambling::Trie::Forwardable
  delegate [:[]] => :hash

  attr_reader :hash

  def initialize hash
    @hash = hash
  end
end

Benchmark.ips do |bm|
  hash = { h: 'hello' }

  # other bm.report calls

  custom = TestCustomForwardable.new hash
  bm.report 'custom delegate access' do
    custom[:h]
  end

  bm.compare!
end

… I got these results (the custom delegate access is the benchmark with Rambling::Trie::Forwardable):

...
Comparison:
       direct access:  6642032.3 i/s
custom delegate access:  2301340.9 i/s - 2.89x  slower
     delegate access:   835004.6 i/s - 7.95x  slower

And BAM! - back to (basically) the same numbers, which then pretty much transferred nicely to 1.0.0.

Contributing back to the community

I submitted an issue to Ruby’s Redmine and soon after got an answer back from Yuki: he pointed out that an UnboundMethod instance was being created on every delegated call. We also noticed that Nobuyoshi Nakada had fixed the issue on trunk revisions #57255, #57256 & #57257.

Finally, these revisions were included into the ruby_2_4 branch in revision #57913 in February 2017. Ruby 2.4.1 came out soon after and running the same benchmark in 2.4.1 gave me these results:

...
Comparison:
       direct access:  6272727.2 i/s
     delegate access:  2823600.3 i/s - 2.22x  slower
custom delegate access:  2288089.7 i/s - 2.74x  slower

… which means I could go back to using Ruby’s own Forwardable again! By May 2017 I was able to do so and delete my own implementation. This is one of my favorite parts about writing code: deleting code.

Update on 2018-02-15

Ruby 2.5.0’s Forwardable performs a bit slower now, but it’s still faster than my own implementation:

...
Comparison:
       direct access:  6231242.7 i/s
     delegate access:  2582759.7 i/s - 2.41x  slower
custom delegate access:  2107914.3 i/s - 2.96x  slower

🤔… Looking at this more closely, maybe I should drop the Forwardable module and just define the methods directly to avoid this 2x slower business. I guess that’ll be a topic for a future blog post.

Notes

You can see the full documented version of Rambling::Trie::Forwardable on GitHub.
The Redmine issue has been addressed and the fix was backported to Ruby 2.2.