Ruby performance: Using benchmarks and memory_profiler to chase performance degradations
February 11, 2018
While working on some of the performance improvements for version 1.0.0 of Rambling Trie late in 2016 and early in 2017, I upgraded from Ruby 2.3.3 to Ruby 2.4.0 imagining that the newer version would help me on my mission to make all operations execute a bit faster. To my surprise, I ran into a significant performance degradation after upgrading - specifically, there was a ~25% increase in the time it took to execute a script using a trie with a big-ish word dictionary. This intrigued me enough that I decided to take a closer look.
Benchmarking with Ruby
After some investigation with the help of Ruby’s own Benchmark
, I realized that while most operations were a bit slower, the main problem was a lot more visible during intensive operations like creating a brand new trie or compressing an existing one. You can see how a performance benchmark of Rambling::Trie.create
went from 8.97 seconds with Ruby 2.3.3, to 11.19 seconds with Ruby 2.4.0.
Bringing out the big guns
So, I pulled in the memory_profiler gem to do some memory allocation monitoring:
require 'memory_profiler'
# Profile trie creation
result = MemoryProfiler.report allow_files: 'lib/rambling/trie', ignore_files: 'lib/rambling/trie/tasks' do
1.upto(5) do |i|
Rambling::Trie.create '/path/to/dictionary'
end
end
result.pretty_print to_file: 'reports/#{version}/#{date}/creation'
# Profile trie compression
tries = []
1.upto(5) do |i|
tries << Rambling::Trie.create '/path/to/dictionary'
end
result = MemoryProfiler.report allow_files: 'lib/rambling/trie', ignore_files: 'lib/rambling/trie/tasks' do
tries.each { |t| t.compress! }
end
result.pretty_print to_file: 'reports/#{version}/#{date}/compression'
# ... etc.
The report output gives you the amount of memory (allocated and retained) and the number of objects (allocated and retained) all broken down by gem, by file, by location, and by class. It also gives you which strings were allocated and retained and how many times it happened.
The results were really illuminating. After running a profile for all operations, three things were immediately apparent:
- The memory footprint in 2.4.0 is a bit better due to some sweet
Hash
optimizations that came with Ruby 2.4.0 (#win!) - Both the creation and compression operations create a lot of new object instances, as opposed to the rest of the operations that just run through a code path with existing instances
- The rest of the memory footprint looked exactly the same, so the issue had to be elsewhere
Forwardable is the culprit
After further investigation with benchmarks and tweaking classes and methods, I realized that forwardable was the culprit. Code like this was causing issues:
module Rambling
module Trie
# A representation of a node in the Trie data structure.
class Node
# ...
extend ::Forwardable
delegate [
:[],
:[]=,
:delete,
:has_key?
] => :children_tree
# ...
end
end
end
Specifically, the benchmark-ips gem helped me analyze this issue further. I ran this benchmark:
require 'benchmark/ips'
require 'forwardable'
class TestDelegate
extend ::Forwardable
delegate [:[]] => :hash
attr_reader :hash
def initialize hash
@hash = hash
end
end
class TestDirect
attr_reader :hash
def initialize hash
@hash = hash
end
def [] key
hash[key]
end
end
Benchmark.ips do |bm|
hash = { h: 'hello' }
delegate = TestDelegate.new hash
bm.report 'delegate access' do
delegate[:h]
end
direct = TestDirect.new hash
bm.report 'direct access' do
direct[:h]
end
bm.compare!
end
… and got these results:
...
Comparison:
direct access: 6531016.6 i/s
delegate access: 810084.8 i/s - 8.06x slower
So… access via a delegated method in Ruby 2.4.0 is about 8 TIMES SLOWER (!!!) than directly accessing the underlying field 😨. That needed to be addressed, but what do I do in the meantime? Define all those methods by hand?! NO WAY…
Creating an interim solution
I reached out to Yuki Nishijima and explained the issue. He promptly responded saying he was going to take a look. In the meantime though, I wanted to release a new version of the gem with all the performance improvements I had been working on but without the degradation from 2.4.0, so I set out to create my own version of Forwardable
that preserved 2.3.3’s speed and memory in 2.4.0. Here’s what I ended up with (ahem, after several iterations):
# rambling/trie/forwardable
module Rambling
module Trie
module Forwardable
def delegate methods_to_target
methods_to_target.each do |methods, target|
methods.each do |method|
define_method method do |*args|
send(target).send method, *args
end
end
end
end
end
end
end
Then, by including these lines in the benchmark-ips script:
# other requires
require 'rambling/trie/forwardable'
# other test classes
class TestCustomForwardable
extend Rambling::Trie::Forwardable
delegate [:[]] => :hash
attr_reader :hash
def initialize hash
@hash = hash
end
end
Benchmark.ips do |bm|
hash = { h: 'hello' }
# other bm.report calls
custom = TestCustomForwardable.new hash
bm.report 'custom delegate access' do
custom[:h]
end
bm.compare!
end
… I got these results (the custom delegate access
is the benchmark with Rambling::Trie::Forwardable
):
...
Comparison:
direct access: 6642032.3 i/s
custom delegate access: 2301340.9 i/s - 2.89x slower
delegate access: 835004.6 i/s - 7.95x slower
And BAM! - back to (basically) the same numbers, which then pretty much transferred nicely to 1.0.0.
Contributing back to the community
I submitted an issue to Ruby’s Redmine and soon after got an answer back from Yuki: he pointed out that an UnboundMethod
instance was being created on every delegated call. We also noticed that Nobuyoshi Nakada had fixed the issue on trunk revisions #57255, #57256 & #57257.
Finally, these revisions were included into the ruby_2_4
branch in revision #57913 in February 2017. Ruby 2.4.1 came out soon after and running the same benchmark in 2.4.1 gave me these results:
...
Comparison:
direct access: 6272727.2 i/s
delegate access: 2823600.3 i/s - 2.22x slower
custom delegate access: 2288089.7 i/s - 2.74x slower
… which means I could go back to using Ruby’s own Forwardable
again! By May 2017 I was able to do so and delete my own implementation. This is one of my favorite parts about writing code: deleting code.
Update on 2018-02-15
Ruby 2.5.0’s Forwardable
performs a bit slower now, but it’s still faster than my own implementation:
...
Comparison:
direct access: 6231242.7 i/s
delegate access: 2582759.7 i/s - 2.41x slower
custom delegate access: 2107914.3 i/s - 2.96x slower
🤔… Looking at this more closely, maybe I should drop the Forwardable
module and just define the methods directly to avoid this 2x slower business. I guess that’ll be a topic for a future blog post.
Notes
- You can see the full documented version of
Rambling::Trie::Forwardable
on GitHub. - The Redmine issue has been addressed and the fix was backported to Ruby 2.2.