How to skew benchmarks in your favour

written by@marvinh.dev01 August 2022

📖 tl;dr: Benchmarking is hard. There are many ways which one can accidentally introduce bias for a particular solution. Always double check that you're measuring what you think you are measuring.

A common rite of passage for every frontend framework is to compare the execution speed with the established players. After all you want to know if your efforts paid off. Where does your framework positions itself on the performance scale? Maybe you want to make this the main selling point of your framework, so benchmarks can be a motivating factor in squeezing just that little bit more speed out of your code. Whatever your motivation is, let me tell you: Benchmarking is hard and has lots of gotchas. A lot things that can introduce bias to a particular solution.

I thought I'd be fun to share all the things everyone of us did wrong with getting credible benchmark results at some point in our career. Benchmarking can be addictive and those numbers pretty deceiving if your not careful. I know, because I've been there. I made the same mistakes as outlined in this post.

Let's dive into how accidental bias is introduced in framework benchmarks.

Outdated versions

You've been working on your framework for quite a while now. You had your benchmark suite since the beginning and have been solely focussed on making your framework the fastest out there. But because you've been at it for so long, the other frameworks released new versions which improve their performance. Unbeknownst to you, you've accidentally introduced a bias in your suite that favours your own solution.

Outdated code

It's not just new framework versions though. With time, programming styles change. Chances are that a framework which has been around for a while has multiple ways of creating - let's say - a component. If you didn't update your benchmark, then you were comparing your solution to the how folks would've written code for the other framework months or even years ago. It's likely that the older style uses more closures, more method calls or something else, which makes it slower than the most recent way to write code for it.

Incorrect results

If you're rolling your own benchmark suite you probably didn't bother to write code that asserts that each run rendered the correct result. Maybe your still evaluating if your approach is worth it and skipped the whole attributes thing because you're more interested in swapping rows. But then it hits you, that the benchmark you've been working with uses attributes like ‘class' to add css classes on every element. So what you've been measuring all this time, has been how quickly your framework renders no attributes versus a framework that does. Oops!

Same is true if your code throws an exception early during a run. Depending on the benchmark runner that you're using, it might not show ‘console.log's or errors. What you've been measuring in this scenario is how quickly your code stopped working. I know, I know. That's a stupid mistake. But no worries, this happens to the best of us!

Batching runs

At this point your framework has matured, you've added all the bells and whistles and now you think to yourself: "Why should my framework render as fast as it can? Isn't it enough if I just render at 30fps or 60fps at best?" You've seen some other player hype this up into space and you don't want to miss out on what the cool kids are doing. So you add scheduling to your framework and limit it to a certain amount of renders to achieve a specific fps limit.

What you didn't think of though, was that you're now batching benchmark runs together. Picture this: If it takes your framework 14ms to do the work than each run would take that amount of time. There might be a little variance here and there but every number of every run should fall into the same ballpark. The benchmark runner will receive something like: 14ms, 15ms, 13ms, 14ms, 14ms, 13ms, etc.

But if your merging runs into one on every fourth invocation, your doing way less work. The benchmark runner now gets numbers looking like: 14ms, 1ms, 0ms, 2ms, 14ms, 0ms, etc. Naturally, the runner will think that your framework was faster. However, it's not. You just fooled yourself. If you're using a decent runner, it will print a warning when the variance is too high between runs.

Accidental overhead

If you didn't go the custom template route for your framework, chances are that that JSX thing was a pretty compelling option. You've added your own createElement function (the thing JSX gets compiled to) to your package and use that in your benchmark. All the other virtual-dom based frameworks provide their own createElement implementation too, so it's fair game, right?

There is one problem though: The babel transform only allows one single JSX transform per file. That's annoying! You don't want to duplicate that section of code for other frameworks as your constantly playing around with different amounts of elements, text nodes or other things.

"No problem!", you think to yourself and create a little function that converts the return value of your createElement calls to whatever the other framework needs.

const vnodes = (
	<div>
		<h1>Hello world!</h1>
	</div>
);

new Benchmark()
	.add("my framework", () => {
		render(vnodes);
	})
	.add("other framework", () => {
		const app = convert(vnodes);
		renderOther(app);
	});

You run the benchmark again and the numbers are amazing! Your framework is miles ahead of everyone else! But wait! Is it really? You go back to your benchmark to verify the numbers. They look too good to be true. Lo and behold, you notice that you call the conversion function during the run for the other framework. So you didn't actually measure how long the framework took to render, but how much the difference between converting the input data (=here vnodes) takes versus not doing that…

Note that this is just one example of adding overhead. There are many more ways to do that for a specific player.

CPU bias

You've been doing this benchmarking thing for a while now and you're pretty satisfied with the results. Your framework repeatedly yields better numbers than the others. During one of your benchmark sessions you comment out your benchmark case, which happens to be the first. You run the benchmark again, but something is odd. Suddenly the numbers for the other framework has improved dramatically! What the…?

Congratulations, you've fallen victim to CPU throttling. You know, because benchmarks are typically very CPU intensive, CPUs need to do a lot of work and get pretty hot. If it gets too hot, it must lower the frequency to cold down and avoid damage.

Problem is that with a lower frequency, it isn't able to do as much work as it did before anymore. For our benchmark this means that our framework ran at full speed, whereas sometime during the other's framework run, the CPU got too hot and throttled down, leading to worse results for the other framework. This is especially common with laptops which don't have space for extensive cooling systems due to their thinness.

JIT and GC

JavaScript being a Just-In-Time compiled language makes benchmarking even harder. You know the engine doesn't know about what your doing in that one file. It just views the whole thing as one big program. If it detects that your code doesn't have side effects and the return value isn't used, it assumes that this piece is dead code and will eliminate that. You essentially, measured how quickly the engine detects that you're doing nothing.

Another common issue are GC (=Garbage Collection) pauses. Because the whole thing is just a program from the engines perspective, the engine determines when it is the best time to do the GC. These heuristics are not deterministic and depend on a lot of other factors of your machine. Depending on when the engine decides to do the GC your benchmark results might vary a lot from run to run.

Conclusion

I hope this post gave you a little bit of insight as to how easy it is to create a benchmark that favours a specific player or introduces bias. There are obviously lots more ways to do that than shown here, so always check how they arrived at those numbers, before trusting the marketing material. If in doubt the benchmark was likely wrong.

But know that this is often not done out of malice. Benchmarking is hard, way harder than it seems! Usually it's just an honest mistake and the developer in question wasn't aware of all the gotchas surrounding it.

Always double check that you're measuring what you think you are measuring!