Hello. I'm the author of the PR that landed the tail-calling interpreter in CPython.
First, I want to say thank you to Nelson for spending almost a month to get to the root of this issue.
Secondly, I want to say I'm extremely embarrassed and sorry that I made such a huge oversight. I, and probably the rest of the CPython team did not expect the compiler we were using for the baseline to have that bug.
Reading that you are extremely embarrassed and sorry that you made such a huge oversight, I was imagining you had broken something / worsened CPython's performance.
But it's nothing like this. You announced a 10-15% perf improvement but that improvement is more like 1-5% on a non buggy compiler. It's not even like that 10-15% figure is wrong, it's just that it's correct only under very specific conditions, unknowingly to you.
IIUC, you did your homework: you made an improvement, you measured a 10-15% perf improvement, the PR was reviewed by other people, etc. It just so happens that this 10-15% figure is misleading because of an issue with the version of clang you happened to use to measure. Unless I'm missing something, it looks like a fair mistake anyone could have reasonably made. It even looks like it was hard to not fall into this trap. You could have been more suspicious seeing such a high number, but hindsight is 20/20.
Apparently, you still brought significant performance improvements, your work also helped uncover a compiler regression. The wrong number seems quite minor in comparison. I wonder who was actually hurt by this. I only discover the "case" right now but at a first glance it doesn't feel like you owe an apology to anyone. Kudos for all this!
> IIUC, you did your homework: you made an improvement, you measured a 10-15% perf improvement, the PR was reviewed by other people, etc. It just so happens that this 10-15% figure is misleading because of an issue with the version of clang you happened to use to measure. Unless I'm missing something, it looks like a fair mistake anyone could have reasonably made. It even looks like it was hard to not fall into this trap. You could have been more suspicious seeing such a high number, but hindsight is 20/20.
Hah! Is this a Gettier problem [0]?
1. True: The PR improves Python performance 15-20%.
2. True: Ken believes that the PR improves Python performance 15-20%.
3. True: Ken is justified in believing that the PR improves Python performance 15-20%.
Of course, PR discussions don't generally revolve around whether or not the PR author "knows" that the PR does what they claim it does. Still: these sorts of epistemological brain teasers seem to come up in the performance measurement field distressingly often. I wholeheartedly agree that Ken deserves all the kudos he has received; still, I also wonder if some of the strategies used to resolve the Gettier problem might be useful for code reviewers to center themselves every once in a while. Murphy's Law and all that.
In some way, by indirectly helping fix this bug, they led to a ~10% performance increase for everyone who was using that faulty compiler! That's even better than an optional flag that many people won't know about or use.
That performance regression only hit code that was using a very large number of paths with the same table of computed gotos at the end. That's likely to only be relatively complex interpreters that were affected. So it's not a broad performance improvement. But it is nice to have an example of the compiler's new heuristic failing to prove evidence it needs to be tunable.
FWIW - the fix was merged since you wrote that blog post ;)
Beyond that - 3-5% is a lot for something as old as the python interpreter if it holds up. I would still be highly proud of that.
After 30 years, i've learned (like i expect you have) to be suspicious of any significant (IE >1%) performance improvement in a system that has existed a long time.
They happen for sure, but are less common. Often, people are shifting time around, and so it just isn't part of your benchmark anymore[1]. Secondarily, benchmarking is often done in controlled environments, to try to isolate the effect. Which seems like the right thing to do. But then the software
is run in non-isolated environments (IE with a million other things on a VM or desktop computer), which isn't what you benchmarked it in. I've watched plenty of verifiably huge isolated gains disappear or go negative when put in production environments.
You have the particularly hard job that you have to target lots of environments - you can't even do something like say "okay look if it doesn't actually speed it up in production, it didn't really speed it up", because you have no single production target.
That's a really hard world to live in and try to improve.
In the end, performance tuning and measurement is really hard. You have nothing to be sorry for, except learning that :)
Don't let this cause you to be afraid to get it wrong - you will get it wrong anyway. We all do. Just do what you are doing here - say 'whoops, i think we screwed this up', try to figure out how to deal with it, and try to figure out how to avoid it in the future (if you can).
[1] This is common not just in performance, but in almost anything, including human processes. For example, to make something up, the team working on the code review tool would say "we've reduced code review time by 15% and thus sped up everyone's workflow!". Awesome! But actually, it turns out they made more work for some other part of the system, so the workflow didn't get any faster, they just moved the 15% into a part of the world they weren't measuring :)
> After 30 years, i've learned (like i expect you have) to be suspicious of any significant (IE >1%) performance improvement in a system that has existed a long time.
Sure, let me amend it to "i'm suspicious of any significant performance improvement in as system where performance actually matters, and has existed in a state where performance matters for a long time".
I think it's important to note that a primary motivation of the tail call interpreter design is to be less vulnerable to the whims of the optimizer. From my original blog article about this technique (https://blog.reverberate.org/2021/04/21/musttail-efficient-i...):
> Theoretically, this control flow graph paired with a profile should give the compiler all of the information it needs to generate the most optimal code [for a traditional switch()-based interpreter]. In practice, when a function is this big and connected, we often find ourselves fighting the compiler. It spills an important variable when we want it to keep it in a register. It hoists stack frame manipulation that we want to shrink wrap around a fallback function invocation. It merges identical code paths that we wanted to keep separate for branch prediction reasons. The experience can end up feeling like trying to play the piano while wearing mittens.
That second-to-last sentence is exactly what has happened here. The "buggy" compiler merged identical code paths, leading to worse performance.
The "fixed" compiler no longer does this, but the fix is basically just tweaking a heuristic inside the compiler. There's no actual guarantee that this compiler (or another compiler) will continue to have the heuristic tweaked in the way that benefits us the most.
The tail call interpreter, on the other hand, lets us express the desired machine code pattern in the interpreter itself. Between "musttail", "noinline", and "preserve_none" attributes, we can basically constrain the problem such that we are much less at the mercy of optimizer heuristics.
For this reason, I think the benefit of the tail call interpreter is more than just a 3-5% performance improvement. It's a reliable performance improvement that may be even greater than 3-5% on some compilers.
This is a good point. We already observed this in our LTO and PGO builds for the computed goto interpreter. On modern compilers, each LTO+PGO build has huge variance (1-2%) for the CPython interpreter. On macOS, we already saw a huge regression in performance because Xcode just decided to stop making LTO and PGO work properly on the interpreter. Presumably, the tail call interpreter would be immune to this.
I just wanted to say: respect for being able to say "sorry, I made a mistake". I hate the fake it till you make it mentality that seems to be the norm now.
I understand the frustration but I don't think it needed to be said (the part about mentality, the thanks is of course cool), because that's still not the norm.
Why do I even bring this message - I want to say that let's not let what we see in the news influence our perception of the real people of the world. Just because fraud and crimes get elevated in the news, does not mean that the common man is a criminal or a fraud. :)
Why didn't this regression in baseline performance show up (or did it?) on the faster-cpython benchmarks page [0]? Could the benchmarks be improved to prevent similar issues in the future?
As alluded to in https://news.ycombinator.com/item?id=43319010, I see these tests were collected against just 2 Intel and 2 ARM CPUs. So, if you are looking for feedback to improve, you should probably also include (at least) a AMD Zen4 or Zen5 in there. CPU & compiler people have been both trying to "help perf while not trusting the other camp" for as long as I can remember and I doubt that problem will ever go away.
A couple more CPUs will help but not solve generalizability of results. E.g., if somebody tests against some ancient 2008 Nehalem hardware, they might get very different answers. Similarly, answers today might not reflect 2035 very well.
The reality of our world of complex hardware deployment (getting worse with GPUs) is that "portable performance" is almost a contradiction in terms. We all just do the best we can at some semblance of the combination. The result is some kind of weighted average of "not #ifdef'd too heavily source" and "a bit" faster "informally averaged over our user population & their informally averaged workloads" and this applies at many levels of the whole computing stack.
EDIT: And, of course, a compiled impl like Cython or Nim is another way to go if you care about performance, but I do understand the pull & network effects of the Python ecosystem. So, that may not always be practical.
We don't normally test with bleeding-edge compilers on the faster cpython benchmarks page because that would invalidate historical data. E.g. if 2 years ago we used GCC 11 or something to compile and run a benchmark, we need to run it with GCC 11 again today to get comparable results.
Clang 19 was released last year. We only benched it a few months ago. We did notice there was a significant slowdown on macOS, but that was against Xcode Clang, which is a different compiler. I thought it might've been an Xcode thing, which in the past has bit CPython before (such as Xcode LTO working/not working versus normal Clang) so I didn't investigate deeper (facepalming now in retrospect) and chalked it up to a compiler difference.
TLDR: We didn't run benchmarks of clang 19 versus 18. We only ran benchmarks of clang 19 versus gcc, Xcode clang, and MSVC. All of which are not apples-to-apples to Clang 19, so I naiively thought it was a compiler difference.
EDIT: As to how we could improve this process, I'm not too sure, but I know I'll be more discerning when there's a >4% perf hit now when upgrading compilers.
The blog post mentions it brings a 1-5% perf improvement. Which is still significant for CPython. It does not complicate the source because we use a DSL to generate CPython's interpreters. So the only complexity is in autogenerated code, which is usually meant for machine consumption anyways.
The other benefit (for us maintainers I guess), is that it compiles way faster and is more debuggable (perf and other tools work better) when each bytecode is a smaller function. So I'm inclined to keep it for perf and productivity reasons.
Being more robust to fragile compiler optimizations is also a nontrivial benefit. An interpreter loop is an extremely specialized piece of code whose control flow is too important to be left to compiler heuristics.
If the desired call structure can be achieved in a portable way, that's a win IMO.
There was a plan for a 5x speedup overall looking for funding back in 2022. Then a team with Guido and others involved (and MS backing?) got on the same bandwagon and made some announcements for speeding up CPython a lot.
Several releases in, have we seen even a 2x speedup? Or more like 0.2x at best?
Not trying to dismiss the interpreter changes - more want to know if those speedup plans were even remotely realistic, and if anything close enough to even 1/5 of what was promised will really come out of them...
The faster cpython project was for 5x over the 3.10 baseline. CPython 3.13 is currently running at something like 1.6x speed compared to 3.10. With the JIT enabled it goes up by a few more single digit percentage point. With the changes in 3.14 it'll be something like 1.8x speed-up.
So it's slowly getting there, I think the faster cpython project was mostly around the idea that the JIT can get a lot faster as it starts to optimise more and more and that only just got shipped in 3.13, so there's a lot of headroom. We know that PyPy (an existing JIT implementation) is close to 5x faster than CPython a lot of the time already.
There's also now the experimental free-threading build which speeds up multithreaded Python applications (Not by a lot right now though unfortunately).
There has been a lot of speed ups coming out of that project. It's not being done in one go, though, but spread out over the last several releases. So while individual speedups may not look that significant, remember that it is compound on the previous release's speed up.
> His academic and commercial work is focused on compilers, virtual machines and static analysis for Python. His PhD was on building virtual machines for dynamic languages.
This dude looks God-level.
Half-joking: Maybe MSFT can also poach Lars Bak of Google V8-fame.
Thank you for your efforts in improving Python and especially thank you for helping to set the record straight in such a clear and visible manner.
Some of the sibling comments are saying there's "no need to apologize". One way this might be read is "no need to hold yourself to a higher standard". If this is the sentiment, then I disagree—we live in an age where accountability and intellectual integrity are undervalued. Being able to say that you and your team should have investigated a too-good-to-be-true result further before reporting it is setting a higher standard for yourselves.
But another way to read this is "no need to feel bad" and on this I agree. We all often fail to sufficiently challenge our own work, especially when it looks like something we worked hard on had a big impact.
> we live in an age where accountability and intellectual integrity are undervalued
I'm tired of this doomerism trope on HN. One thing has been constant in my life: People complaining that "we live in an age where accountability and intellectual integrity are undervalued". For me, it is on the same level as interviewing people for a local news show: "So, how's traffic these days?" "Oh, it's never been worse."
In what age were accountability and intellectual integrity "correctly" valued?
The way this was handled is incredibly good form! Thank you for going to such lengths to make sure the mistake was corrected. I respect and thank you (and all the developers working in Python) for all the work you do!
Yea, don't feel too bad, you're just unlucky that your mistake was so public, most of us have these kinds of mistakes weekly in a more private setting. I think everyone who knows anything about how optimized Python is would be impressed that you managed even a few percent improvement in speed! That will also probably save many, many GWh of power, even those few percent!
You have a new account here and your blog is just one article so far so you might be a new-ish developer(?), but you are doing great, keep it up! If you are not a new developer, you are still doing great, keep it up!
I don’t think you should be embarrassed or apologize. You still did a thing that improved performance - in the near term it worked around a bug that you weren’t even aware of, and long term there are still gains even with that bug fixed. But even if that weren’t the case, the only way anything gets done in software is to rely on abstractions. You can either get things done or know exactly how every last bit of your stack is implemented, but probably not both. It’s a very reasonable trade off.
Smart people go and build things. Other smart people find problems. Nothings broken with that.
I feel bad for you since a change on the order of 5% would still have been seen as a very nice bit of work. I appreciate the class with with you’re dealing with an honest mistake, and all of the hard work you’ve given the Python community.
No one cares about this oversight - more important is that you spend time on development that the rest of us don’t have (or don’t have the skill to do in my case). Thanks for your hard work.
You don't need a long heartfelt apology - a simple "god dammit" is enough for this. You made no mistake - you only got unlucky.
It's possible to improve your luck by applying more care, but it's also possible to apply so much care that you do not try things that would have been useful, so I'd rather you keep erring on the side that you did!
Hello, I was one of the people who contributed to this interpreter change. Thank you Nelson for the excellent write-up and for going to such lengths to get to the bottom of this. That said, I wanted to defend Ken a bit. :-)
Ken Jin is a volunteer who's been tirelessly working to make CPython faster over the past couple of years for not enough credit. IMHO, Ken did nothing wrong here, and there's really nothing to be embarrassed about. The fact that it took a month (more if you consider the folks helping to reproduce the original results) for anyone to notice anything wrong shows how complicated the situation was! Put yourself in Ken's shoes -- having updated to LLVM-19 to enable the preserve_none flag, would you then hypothesize that LLVM-19 may have introduced an unrelated performance regression that no one noticed for five months? Lots of people underestimate how hard benchmarking is IMO.
A 1-5% performance improvement is also pretty valuable, just not quite as spectacular as we thought originally :-)
Benchmarking is just insanely hard to do well. There are so many things which can mislead you.
I recently discovered a way to make an algorithm about 15% faster. At least, that's what all the benchmarks said. At some point I duplicated the faster function in my test harness, but did not call the faster version, just the original slower one... And it was still 15% faster. So code that never executed sped up the original code...!!! Obviously, this was due to code and memory layout issues, moving something so it aligned with some CPU cache better.
It's actually really really hard to know if speedups you get are because your code is actually "better" or just because you lucked out with some better alignment somewhere.
Casey Muratori has a really interesting series about things like this in his substack.
Aleksey Shipilёv, a long-time Java "performance engineer" (my term) has written and spoken extensively about the challenging of benchmarking. I highly recommend to read some of his blog posts or watch one of his talks about it.
I vaguely remember about some benchmarking project that deliberately randomised these compiler decisions, so that they could give you more stable estimates of how well your code actually performed, and not just how well you won or lost the linker lottery.
There was Stabilizer [1] which did this, although it is no longer maintained and doesn't work with modern versions of LLVM. I think there is something more current now that automates this, but can't remember what it's called.
As already mentioned this is likely Emery Berger’s project with the idea of intentionally slowing down different parts of the program, also to find out which parts are most sensitive to slowdowns (aka have the biggest effect on overall performance), with the assumption that these are also the parts that profit the most from optimisations.
That linker lottery led to a 15% improvement? I'm surprised. Do you know in what cases you get such a big improvement from something like that? Is it rare? How did you end up reasoning about it?
Various research has shown that the variation can be much higher than 15% due to things like this. It's not that rare; I keep bumping into it. Compilers and linkers do a reasonable job but fundamentally modern CPUs are extremely complex beasts.
I found Casey Muratori's series the best explanation of what is going on at the CPU level.
Some additional context. I was actually writing a benchmarking tool for certain kinds of search algorithm. I spent a long time reducing and controlling for external sources of noise. CPU pinning, doing hundreds of those runs with different data, and then repeating them several times and taking the best of each score with the same data (to control for transient performance issues due to the machine doing other things).
I got the benchmarking tool itself to give reasonably repeatable measurements.
The tool had high precision, but the accuracy of "which algorithm is better" was not reliable just due to these code layout issues.
I basically gave up and shelved the benchmarking project at that point, because it wasn't actually useful to determine which algorithm was better.
Kudos to the author for diving in and uncovering the real story here. The Python 3.14 tail-call interpreter is still a nice improvement (any few-percent gain in a language runtime is hard-won), just not a magic 15% free lunch. More importantly, this incident gave us valuable lessons about benchmarking rigor and the importance of testing across environments. It even helped surface a compiler bug that can now be fixed for everyone’s benefit. It’s the kind of deep-dive that makes you double-check the next big performance claim. Perhaps the most thought-provoking question is: how many other “X% faster” results out there are actually due to benchmarking artifacts or unknown regressions? And how can we better guard against these pitfalls in the future?
I guess the bigger question for me is, how was a 10% drop in Python performance not detected when that faulty compiler feature was pushed? Do we not benchmark the compilers themselves? Do the existing benchmarks on the compiler or python side not use that specific compiler?
As far as I am aware, the official CPython binaries on Linux have always been built using GCC, so you will have to build your own CPython using both Clang 18 and 19 to notice the speed difference. I think this is partly why no one has noticed the speed difference yet.
This is a very good example of how C is not "close to the machine" or "portable assembly", modern optimizers will do drastic changes to the logic as long as it has no observable effect.
As stated in the post: "Thus, we end up in this odd world where clang-19 compiles the computed-goto interpreter “correctly” – in the sense that the resulting binary produces all the same value we expect – but at the same time it produces an output completely at odds with the intention of the optimization. Moreover, we also see other versions of the compiler applying optimizations to the “naive” switch()-based interpreter, which implement the exact same optimization we “intended” to perform by rewriting the source code."
> This is a very good example of how C is not "close to the machine" or
> "portable assembly",
C is very much "portable assembly" from the perspective of other systems programming languages of the 80s-90s era. The C expression `a += 1` can be trusted to increment a numeric value, but the same expression in C++ might allocate memory or unwind the call stack or do who knows what. Similarly, `a = "a"` is a simple pointer assignment in C, but in C++ it might allocate memory or [... etc].
The phrase "C is portable assembly" isn't a claim that each statement gets compiled directly to equivalent machine code.
When the code has hit the IR in clang or gcc, there is no 'a' (we know that with certainty, since SSA form doesn't mutate but assigns to fresh variables). We don't know if there will be an increment of 1, the additions could be coalesced (or elided if the result can be inferred another way). The number can even decrease, say if things have been handled in chunks of 16, and needs to be adjusted down in the last chunk. Or the code may be auto-vectorized and completely rewritten, so that none of the variables at the C level are reflected on the assembler level.
From a high-level academic view, yes, the compiler is allowed to perform any legal transformation. But in practice C compilers are pretty conservative about what they emit, especially when code is compiled without -march= .
You don't have to take my word for it. Go find a moderately complex open-source library written in C, compile it, then open up the result in Hexrays/Ghidra/radare2/whatever. Compare the compiled functions with their original source and you'll see there's not that much magic going on.
The place where C compilers are conservative is when dealing with arrays and pointers. That's because it's impossible for C to know if a pointer is to an element of an array or something completely different. Pointer math further complicates what a pointer could actually reference.
Saying that something "is like XY" when you really mean "is like XY, at least in comparison to C++" isn't what most people mean.
C is not a portable assembler.
In C, "a += 1" could overflow, and signed overflow is undefined behavior--even though every individual ISA has completely defined semantics for overflow, and nearly all of them these days do two's complement wraparound arithmetic. With C's notion of undefined behavior, it doesn't even give you the same wraparound in different places in the same program. In fact, wraparound is so undefined that the program could do absolutely anything, and the compiler is not required to even tell you about it. Even without all the C++ abstraction madness, a C compiler can give you absolutely wild results due to optimizations, e.g. by evaluating "a += 1" at compile time and using a different overflow behavior than the target machine. Compile-time evaluation not matching runtime evaluation is one of a huge number of dumb things that C gives you.
Another is that "a += 1" may not even increment the variable. If this occurs as an expression, and not as a statement, e.g. "f(a += 1, a += 1)", you might only get one increment due to sequence points[1]--not to mention that the order of evaluation might be different depending on the target.
C is not a portable assembler.
C is a low-level language where vague machine-like programs get compiled to machine code that may or may not work, depending on whether it violates UB rules or not, and there are precious few diagnostics to tell if that happened, either statically or dynamically.
> The phrase "C is portable assembly" isn't a claim that each statement gets compiled directly to equivalent machine code.
Weasel words. Like a "self driving car" that requires a human driver with constant attention willing to take over within a few hundred milliseconds.
People advocate for C and use it in a way that implies they think it can achieve specific machine outcomes, and it usually does .. except when it doesn't. If people want a portable assembler they should build one.
As a general rule if you're reading a technical discussion and every single participant is using a particular phrase in a way that doesn't make sense to you then you should probably do a quick double-check to make sure you're on the same page.
For example, in this discussion about whether C is "portable assembly", you might be tempted to think back to the days of structured programming in assembly using macros. I no longer remember the exact syntax, but programs could be written to look like this:
Assembly? Definitely! Portable? Eh, sort of! If you're willing to restrict yourself to DOS + POSIX and write an I/O abstraction layer then it'll probably run on i386/SPARC/Alpha/PA-RISC.
But that's not really what people are discussing, is it?
When someone says "C is portable assembly" they don't mean you can take C code and run it through a platform-specific macro expander. They don't mean it's literally a portable dialect of assembly. They expect the C compiler to perform some transformations -- maybe propagate some constants, maybe inline a small function here and there. Maybe you'd like to have named mutable local variables, which requires a register allocator. Reasonable people can disagree about exactly what transformations are legal, but at that point it's a matter of negotiation.
Anyway, now you've got a language that is more portable than assembler macros but still compiles more-or-less directly to machine code -- not completely divorced from the underlying hardware like Lisp (RIP Symbolics). How would you describe it in a few words? "Like assembly but portable" doesn't seem unreasonable.
> still compiles more-or-less directly to machine code
There's a lot hiding in "more or less". The same kind of example holds for e.g. C# : https://godbolt.org/noscript/csharp ; if you hit "Compile" it'll give you the native binary. If you write "x+1" it'll generate an add .. or be optimized away. Now does that mean it's portable assembler? Absolutely not.
Conversely there's a bunch of things that people expect to do in C, do in real code, but are not in the standard or are undefined or implementation-defined. As well as things that are present in assemblers for various platforms (things like the overflow flag) which aren't accessible from the C language.
What people actually seem to mean by "portable assembler" is "no guardrails". Memory unsafety as a feature.
> Reasonable people can disagree about exactly what transformations are legal, but at that point it's a matter of negotiation
And a matter of CVEs when you lose your negotiation with the compiler. Or less dramatic things like the performance fluctuations under discussion.
C might be low level from the perspective of other systems languages, but that is like calling Apollo 11 simple from the perspective of modern spacecraft. C as written is not all that close to what actually gets executed.
For a small example, there are many compilers who would absolutely skip incrementing 'a' in the following code:
uint32_t add_and_subtract_1(uint32_t a) {
a += 1;
a -= 1;
return a;
}
Even though that code contains `a += 1;` clear as day, the chances of any incrementing being done are quite small IMO. It gets even worse in bigger functions where out-of-order execution starts being a thing.
Why would you want it to increment 1 if we decrement 1 from the same variable? That would be a waste of cycles and a good compiler knows how to optimize it out, or what am I misunderstanding here? What do you expect "it" to do and what does it really do?
I'm pretty sure that's replying directly to the comment about how c is close to assembly and that if you add that line of code somewhere you know there's a variable getting incremented. Doesn't really matter whether or not it's useful, the point is that the behavior isn't exactly what you wrote
To reiterate, claiming that C can be described as "portable assembly" is not a claim that it is literally a package of assembler macros that emit deterministic machine code for each individual source expression.
I linked these in another comment, but here's some examples of straightforward-looking integer addition emitting more complex compiler output for other languages that compile to native code:
That’s a contrived example but in a serious program there would often be code in between or some level of indirection (e.g. one of those values is a lookup, a macro express, or the result of another function).
Nothing about that is cheating, it just says that even C programmers cannot expect to look at the compiled code and see a direct mapping from their source code. Your ability to reason about what’s actually executing requires you to internalize how the compiler works in addition to your understanding of the underlying hardware and your application.
What optimizer would remove the increment/decrement if the value was accessed in between? That seems like something that would be really easy to detect.
It would be very normal for a compiler to do an increment (or merge it into a later instruction), but never do the decrement, and instead use the old copy of the value.
Then in the next step, it would see that the result of the increment is never used and thus the increment instruction is dead code and can also be removed.
In what languages can you do that that is not assembly though? The higher level the language is, the "worse" or difficult it gets, perhaps I am not following the thread right.
Yes, this is normal for languages. The only pushback here is against the term “portable assembler” being applied to C, where it’s incomplete often enough that many people feel it’s no longer a helpful label.
I think it’s also reflecting the maturity and growth of the industry. A turn of the century programmer could relatively easily find areas where dropping down to assembly was useful, but over the subsequent decades that’s become not only uncommon but often actively harmful: your code hand-optimized for a particular processor is likely slower on newer processors than what a modern compiler emits and is definitely a barrier to portability in an era where not only are ARM and potentially RISC-V of interest but also where code is being run on SIMD units or GPUs. This makes the low-level “portable assembler” idea less useful because there’s less code written in that middle ground when you want either a higher-level representation which gives compilers more flexibility or precise control. For example, cryptography implementers want not just high performance but also rigid control of the emitted code to avoid a compiler optimizing their careful constant-time implementation into a vulnerability.
I'm not an embedded expert but a friend of mine has complained about compiler optimizations breaking things in his programs. I could see incrementing by one being used to set some bits in a memory location for a cycle that may mean something to some peripheral and then decrementing by one to set some other bits that may mean something else. In that case, the compiler removing those two lines would cause a very hard to debug issue.
> It gets even worse in bigger functions where out-of-order execution starts being a thing.
In addition, add that your processor isn't actually executing x86 (nor ARM etc) instructions, but interprets/compiles them to something more fundamental.
So there's an additional layer of out-of-order instructions and general shenanigans happening. Especially with branch prediction in the mix.
If my misocompile, you mean that it fails the test that a "C expression `a += 1` can be trusted to increment a numeric value", then it is trivial: https://godbolt.org/z/G5dP9dM5q
The (implied) claim is that the C standard has enough sources of undefined behavior that even a simple integer addition can't be relied upon to actually perform integer addition.
But the sources of undefined behavior for integer addition in C are well-known and very clear, and any instruction set that isn't an insane science project is going to have an instruction to add integers.
Thus my comment. Show me a C compiler that takes that code and miscompiles it. I don't care if it returns a constant, spits out an infinite loop, jumps to 0x0000, calls malloc, whatever. Show me a C compiler that takes those four lines of C code and emits something other than an integer addition instruction.
Why are you talking about miscompilation? While the LLVM regression in the featured article makes the code slower, it is not a miscompilation. It is "correct" according to the contract of the C language.
You show one example where C doesn't have problems, but that's a much weaker claim than it sounds. "Here's one situation where this here gun won't blow your foot off!"
For what it's worth, C++ also passes your test here. You picked an example so simple that it's not very interesting.
'eru implied `a += 1` has undefined behavior; I provided a trivial counter-example. If you'd like longer examples of C code that performs unsigned integer addition then the internet has many on offer.
I'm not claiming that C (or C++) is without problems. I wrote code in them for ~20 years and that was more than enough; there's a reason I use Rust for all my new low-level projects. In this case, writing C without undefined behavior requires lots of third-party static analysis tooling that is unnecessary for Rust (due to being built in to the compiler).
But if you're going to be writing C as "portable assembly", then the competition isn't Rust (or Zig, or Fortran), it's actual assembly. And it's silly to object to C having undefined behavior for signed integer addition, when the alternative is to write your VM loop (or whatever) five or six times in platform-specific assembly.
I know that for 'int a' the statement 'a += 1' can give rather surprising results.
And you made a universal statement that 'a += 1' can be trusted. Not just that it can sometimes be trusted. In C++ the code you gave above can also be trusted as far as I can tell. At least as much as the C version.
In C there is no operator overloading, so an expression like `a += 1` is easy to understand as incrementing a numeric value by 1, where that value's type is one of a small set of built-in types.
You'd need to look further up in the function (and maybe chase down some typedefs) to see what that type is, but the set of possible types generally boils down to "signed int, unsigned int, float, pointer". Each of those types has well-defined rules for what `+= 1` means.
That means if you see `int a = some_fn(); assert(a < 100); a += 1` in the C code, you can expect something like `ADD EAX,1` somewhere in the compiler output for that function. Or going the other direction, when you're in a GDB prompt and you disassemble the current EIP and you see `ADD EAX,1` then you can pretty much just look at the C code and figure out where you are.
---
Neither of those is true in C++. The combination of completely ad-hoc operator overloading, function overloading, and implicit type conversion via constructors means that it can be really difficult to map between the original source and the machine code.
You'll have a core dump where EIP is somewhere in the middle of a function like this:
std::string some_fn() {
some_ns::unsigned<int> a = 1;
helper_fn(a, "hello");
a += 1;
return true;
}
and the disassembly is just dozens of function calls for no reason you can discern, and you're staring at the return type of `std::string` and the returned value of `true`, and in that moment you'll long for the happy days when undefined behavior on signed integer overflow was the worst you had to worry about.
> That means if you see `int a = some_fn(); assert(a < 100); a += 1` in the C code, you can expect something like `ADD EAX,1` somewhere in the compiler output for that function.
I completely agree that C++ is orders of magnitude worse but I’ve seen at least a couple counter-examples with code almost that simple. A researcher I used to support compared each release against a set of reference results, and got a surprise when they didn’t match but his program was working. This turned out to be a new compiler release being smart enough to inline and reorder his code to use a fused multiply-add instruction, which had greater internal precision and so the result was very slightly different from his saved referenced set. GCC has -fexcess-precision=standard for this but you have to understand the problem first.
error: could not convert 'true' from 'bool' to 'std::string' {aka 'std::__cxx11::basic_string<char>'}
I don't think anyone's claiming C nor C++'s dumpster fires have signed integer overflow at the top of the pile of problems, but when the optimizer starts deleting security or bounds checks and other fine things - because of signed integer overflow, or one of the million other causes of undefined behavior - I will pray for something as straightforward as a core dump, no matter where EIP has gone.
Signed integer overflow UB is the kind of UB that has a nasty habit of causing subtle heisenbugfuckery when triggered. The kind you might, hopefully, make shallow with ubsan and good test suite coverage. In other words, the kind you won't make shallow.
For context, I did not pick that type signature at random. It was in actual code that was shipping to customers. If I remember correctly there was some sort of bool -> int -> char -> std::string path via `operator()` conversions and constructors that allowed it to compile, though I can't remember what the value was (probably "\x01").
---
My experience with the C/C++ optimizer is that it's fairly timid, and only misbehaves when the input code is really bad. Pretty much all of the (many, many) bugs I've encountered and/or written in C would have also existed if I'd written directly in assembly.
I know there are libraries out there with build instructions like "compile with -O0 or the results will be wrong", but aside from the Linux kernel I've never encountered developers who put the blame on the compiler.
> but aside from the Linux kernel I've never encountered developers who put the blame on the compiler.
I encounter them frequently.
99.99% of the time it's undefined behavior and they're "wrong".
Frequently novices who have been failed by their teachers and documentation (see previous rant using atoi as an example of the poor quality of documentation about UB: https://news.ycombinator.com/item?id=14861917 .)
Less frequently, it's experienced devs half joking out of a need for catharsis.
Rarely, experienced devs finally getting to the end of their rope, and are finally beginning to seriously consider if they've got a codegen bug. They don't, but they're considering it. They know they were wrong the last 10 times they considered it, but they're considering it again damnit!
The linux kernel devs aren't quite unique in "just because you can, doesn't mean you should"ing their way into blaming the compiler for what could be argued to be defects in the standard or fundamental design of the language (the defect being making UB so common), but that's probably among the rarest slice of the pie of people blaming the compiler for UB. Few have the will to tilt at that windmill and voice their opinions when the compiler devs can easily just blame the standard - better to keep such unproductive rants close to heart instead, or switch to another language. Something actually productive.
0.01% of the time, it's a legitimate codegen bug on well-defined behavior code. Last one I tracked down to a bug tracker, was MSVC miscompiling 4x4 matrix multiplications by failing to spill a 17th value to stack when it only had 16 SSE register to work with. Caught by unit tests, but not by CI, since people updated compiler versions at their own random pace, and who runs `math_tests` on their personal machines when they're not touching `math`?
I heartily agree that C++ is a lot more annoying here than C, yes.
I'm just saying that C is already plenty annoying enough by itself, thanks eg to undefined behaviour.
> That means if you see `int a = some_fn(); assert(a < 100); a += 1` in the C code, you can expect something like `ADD EAX,1` somewhere in the compiler output for that function. Or going the other direction, when you're in a GDB prompt and you disassemble the current EIP and you see `ADD EAX,1` then you can pretty much just look at the C code and figure out where you are.
No, there's no guarantee of that. C compilers are allowed to do all kinds of interesting things. However you are often right enough in practice, especially if you run with -O0, ie turn off the optimiser.
It means that "a += 1` is easy to understand as incrementing a numeric value by 1" is not true and instead "it can be really difficult to map between the original source and the machine code".
> All of those look pretty straightforward to me -- again, what assembly would you expect to be emitted in those cases?
It is very straightforward indeed, but it is still not mapping primitive operations to direct machine code, but it is forwarding to out-of-line code. Same as operator overloading in other languages.
> It is very straightforward indeed, but it is still not mapping primitive
> operations to direct machine code, but it is forwarding to out-of-line code.
> Same as operator overloading in other languages.
I am not claiming that C is a collection of assembler macros. There is no expectation that a C compiler emit machine code that has exact 1:1 correspondence with the input source code.
> Same as operator overloading in other languages.
The lack of operator overloading, and other hidden complex control flow, is the reason that someone can read C code and have a pretty good idea of what it compiles to.
> That's just a symptom of allowing the compiler to inline the add code,
> otherwise the generated code is as straightforward:
No, that's just moving the instructions around. You've still got dynamic allocation and stack-unwinding being generated for a line that doesn't have any sign of entering a complex control flow graph.
Until someone calls longjmp() or a signal() is triggered. Extra bonus of fun if it happens to be multithreaded application, or in the middle of a non-rentrant call.
> a+=1 will not produce any surprising results, signed integer overflow is well defined on all platforms that matter.
I'm not sure what you are talking about?
There's a difference between how your processor behaves when given some specific instructions, and what shenanigans your C compiler gets up to.
See eg https://godbolt.org/z/YY69Ezxnv and tell me where the ADD instruction shows up in the compiler output. Feel free to pick a different compiler target than Risc-V.
Take a closer look at 'eru's example and my follow-up.
He wrote an example where the result of `a+1` isn't necessary, so the compiler doesn't emit an ADDI even though the literal text of the C source contains the substring "a += 1".
Your version has the same issue:
unsigned int square2(unsigned int num) {
unsigned int a = num;
a += 1;
if (num < a) return num * num;
return num;
}
The return value doesn't depend on `a+1`, so the compiler can optimize it to just a comparison.
If you change it to this:
unsigned int square2(unsigned int num) {
unsigned int a = num;
a += 1;
if (num < a) return num * a;
return num;
}
then the result of `a+1` is required to compute the result in the first branch, and therefore the ADDI instruction is emitted.
The (implied) disagreement is whether a language can be considered to be "portable assembly" if its compiler elides unnecessary operations from the output. I think that sort of optimization is allowed, but 'eru (presumably) thinks that it's diverging too far from the C source code.
`a = num; a += 1; if (num < a)` is the same as `if (num < (num + 1))`, which for unsigned integer addition can be rewritten as `if (num != UINT_MAX)`. So there's no need to actually compute `a+1`, the comparison is against a constant.
If the code returns `num * a` then the value of `a` is now necessary, and must be computed before the function returns.
For signed integer addition the compiler is allowed to assume that `(num < (num + 1))` is true, so the comparison can be removed entirely.
> For signed integer addition the compiler is allowed to assume that `(num < (num + 1))` is true, so the comparison can be removed entirely.
That's not directly what the compiler assumes. The direct problem is in 'a + 1' having undefined behaviour, and that transitively allows the assumption on the comparison that you mentioned.
This was an example where 'a + 1' doesn't compile to an add instruction.
So, the compiler is tinkering with the way the loop is organised so the whole tail-call interpreter thing is not as effective as announced... Not surprised.
1. CPU arch (and arch version) matters a lot. The problem is 95% about laying out the instruction dispatching code for the branch predictor to work optimally. C was never meant to support this.
2. The C abstract machine is also not low-level enough to express the intent properly. Any implementation becomes supersensitivy to a particular compiler's (and compiler version) quirks.
Certain paranoid interpreter implementation go back to writing assembly directly. luajit's famous for implementing a macro system to make its superefficient assembly loop implementation portable across architectures. This is also I find it fun to tinker with these!
Anyways, a few years ago I've put together an article and a a test for popular interpreter loop implementation approaches:
> The problem is 95% about laying out the instruction dispatching code for the branch predictor to work optimally.
A fun fact I learned while writing this post is that that's no longer true! Modern branch predictors can pretty much accurately predict through a single indirect jump, if the run is long enough and the interpreted code itself has stable behavior!
My experiments on this project anecdotally agree; they didn't make it into the post but I also explored a few of the interpreters through hardware CPU counters and `perf stat`, and branch misprediction never showed up as a dominant factor.
How do you reconcile that with the observation that moving to a computed goto style provides better codegen in zig[1]? They make the claim that using their “labeled switch” (which is essentially computed goto) allows you to have multiple branches which improves branch predictor performance. They even get a 13% speedup in their parser from switch to this style. If modern CPU’s are good at predicting through a single branch, I wouldn’t expect this feature to make any difference.
While it's unlikely as neat as this, the blog post we're all commenting on is a "I thought we had a 10-15% speedup, but it turned out to be an LLVM optimisation misbehaving". And Zig (for now) uses LLVM for optimised builds too
Yes, this was already becoming true around the time I was writing the linked article. And I also read the paper. :-) I also remember I had access to a pre-Haswell era Intel CPUs vs something a bit more recent, and could see that the more complicated dispatcher no longer made as much sense.
Conclusion: the rise of popular interpreter-based languages lead to CPUs with smarter branch predictors.
Trying to assess the performance of a python build is extremely difficult as there are a lot of build tricks you can do to improve it. Recently the astral folks ran into this showing how the conda-forge build is notable faster than most others:
In one of the referenced articles, https://simonwillison.net/2025/Feb/13/python-3140a5/, the author wrote: "So 3.14.0a5 scored 1.12 times faster than 3.13 on the benchmark (on my extremely overloaded M2 MacBook Pro)."
I'm quite confused by this. Did the author run the benchmark while the computer was overloaded with other processes? Wouldn't that make the results completely unreliable? I would have thought these benchmarks are conducted in highly controlled environments to eliminate external variables.
Simon Willison is a great guy, but he's not a Python core developer and his ad hoc benchmark is not what the CPython core team members are using. For the latter, see https://github.com/faster-cpython/benchmarking-public
While some people here are citing 10% as "large" and 1% as "normal", there are optimizations like partial inlining of doubly recursive Fibonacci that can reduce the actual work (and so also time) exponentially (factors >10X for double-digit arguments or 1000s of %, technically exponential in the difference of the recursion depth, not the problem size [1]).
C compilers can also be very finicky about their code inlining metrics. So, whether that enormous speed-up is realized can be very sensitive to the shape of your code.
So, while part of the problem is definitely that CPUs have gotten quite fancy/complex, another aspect is that compilers "beyond -O0 or -O1" have also gotten fancy/complex.
The article here, while good & worth reading, is also but one of many examples of how two complex things interacting can lead to very surprising results (and this is true outside of computing). People have a rather strong tendency to oversimplify -- no matter how many times this lesson shows up.
EDIT: Also, the article at least uses two CPUs, Intel & Apple M1 and two compilers (gcc & clang), but there are realistic deployment scenarios across many more generations/impls of Intel, AMD, and ARM and probably other compilers, too. So, it only even samples a small part of the total complexity. Also, if you want to be more scientific, esp. for differences like "1.01X" then time measurements should really have error bars of some kind (either std.dev of the mean or maybe better for a case like this std.dev of the min[2]) and to minimize those measurement errors you probably want CPU core affinity scheduling in the OS.
I recently made some benchmarking from python 3.9 to 3.13
And up to 3.11 it only got better. Python 3.12 and 3.13 were about 10% slower than 3.11.
I thought my homemade benchmark wasn't great enough so I deployed it to a core service anyway and I saw same changes in our collected metrics.
Does anyone else have the same problem?
This is exactly the kind of content I love to see on HN. But I wonder though how this optimization is related to tail-call optimization? How the interpreter jump table is implemented shouldn't affect how stack frames are created, should it?
It's explained under the first link from the article [0]:
"A new type of interpreter has been added to CPython. It uses tail calls between small C functions that implement individual Python opcodes, rather than one large C case statement."
A big chunk of the performance gain comes from the fact that with tail calls the CPU doesn't have to reset the registers (configured through the `preserve_none` calling convention).
Well if you’d bother to read it, you’d discover this is about tail calls in C, not in Python. It has nothing to do with tail recursion in Python. Guido has explicitly said that Python will never have it.
To clarify: The situation is still not completely understood? It's not just only the computed gotos, but there is some other regression in Clang19? Basically, the difference between clang19.nocg and clang19 is not really clear?
Btw, what about some clang18.tc comparison, i.e. Clang18 with the new tail-call interpreter? I wonder how that compares to clang19.tc.
(post author here) Oh, this is something I could have called out explicitly: The tail-calling interpreter relies on a feature (the `preserve_none` calling convention) that only landed in clang-19. That means you can only test it on that version. That coincidence (that 19 added both this feature, and the regression) is part of why this was so easy to miss at first, and why I had to "triangulate" with so many different benchmarks to be confident I understood what was going on.
It's getting harder to characterize the C language as "close to the metal". Obviously optimizing compilers are a good thing, but it's frustrating when it feels like you're fighting with them to get the code emitted just the way you want.
Kudo to author and Cpython team to reckon this. Nevertheless, its still a very important improvement in +Ve direction. Thats all matters. This a great story as to why benchmarking are soo hard to get right. I always tell my team to benchmark your real world use case as closely as possible and not rely on external results.
distributing the dispatch instructions lets the predictor pick up on specific patterns within the bytecode - for example, comparison instructions are probably followed by a conditional branch instruction
I think it's remarkable that nobody thought to benchmark against a independent build, maybe from debian or redhat. I do this for my local builds I use for development even, it's just so absurdly obvious (and not just in hindsight).
No real harm was done but some embarrassment, but it's pretty silly.
Why hasn't anyone just carbon copied the Python compiler (coincidentally the same name but unrelated to the Python language) from SBCL? The semantics of CL aren't that different from Python (the language).
Hello. I'm the author of the PR that landed the tail-calling interpreter in CPython.
First, I want to say thank you to Nelson for spending almost a month to get to the root of this issue.
Secondly, I want to say I'm extremely embarrassed and sorry that I made such a huge oversight. I, and probably the rest of the CPython team did not expect the compiler we were using for the baseline to have that bug.
I posted an apology blog post here. https://fidget-spinner.github.io/posts/apology-tail-call.htm...
Reading that you are extremely embarrassed and sorry that you made such a huge oversight, I was imagining you had broken something / worsened CPython's performance.
But it's nothing like this. You announced a 10-15% perf improvement but that improvement is more like 1-5% on a non buggy compiler. It's not even like that 10-15% figure is wrong, it's just that it's correct only under very specific conditions, unknowingly to you.
IIUC, you did your homework: you made an improvement, you measured a 10-15% perf improvement, the PR was reviewed by other people, etc. It just so happens that this 10-15% figure is misleading because of an issue with the version of clang you happened to use to measure. Unless I'm missing something, it looks like a fair mistake anyone could have reasonably made. It even looks like it was hard to not fall into this trap. You could have been more suspicious seeing such a high number, but hindsight is 20/20.
Apparently, you still brought significant performance improvements, your work also helped uncover a compiler regression. The wrong number seems quite minor in comparison. I wonder who was actually hurt by this. I only discover the "case" right now but at a first glance it doesn't feel like you owe an apology to anyone. Kudos for all this!
> IIUC, you did your homework: you made an improvement, you measured a 10-15% perf improvement, the PR was reviewed by other people, etc. It just so happens that this 10-15% figure is misleading because of an issue with the version of clang you happened to use to measure. Unless I'm missing something, it looks like a fair mistake anyone could have reasonably made. It even looks like it was hard to not fall into this trap. You could have been more suspicious seeing such a high number, but hindsight is 20/20.
Hah! Is this a Gettier problem [0]?
1. True: The PR improves Python performance 15-20%. 2. True: Ken believes that the PR improves Python performance 15-20%. 3. True: Ken is justified in believing that the PR improves Python performance 15-20%.
Of course, PR discussions don't generally revolve around whether or not the PR author "knows" that the PR does what they claim it does. Still: these sorts of epistemological brain teasers seem to come up in the performance measurement field distressingly often. I wholeheartedly agree that Ken deserves all the kudos he has received; still, I also wonder if some of the strategies used to resolve the Gettier problem might be useful for code reviewers to center themselves every once in a while. Murphy's Law and all that.
[0]: https://en.wikipedia.org/wiki/Gettier_problem
Could very well be!
Interesting, I didn't know about the Gettier problem, thanks for sharing. You could try submitting that page as a proper HN post.
In some way, by indirectly helping fix this bug, they led to a ~10% performance increase for everyone who was using that faulty compiler! That's even better than an optional flag that many people won't know about or use.
That performance regression only hit code that was using a very large number of paths with the same table of computed gotos at the end. That's likely to only be relatively complex interpreters that were affected. So it's not a broad performance improvement. But it is nice to have an example of the compiler's new heuristic failing to prove evidence it needs to be tunable.
Well, that includes at least everyone using Python built with that compiler.
FWIW - the fix was merged since you wrote that blog post ;)
Beyond that - 3-5% is a lot for something as old as the python interpreter if it holds up. I would still be highly proud of that.
After 30 years, i've learned (like i expect you have) to be suspicious of any significant (IE >1%) performance improvement in a system that has existed a long time.
They happen for sure, but are less common. Often, people are shifting time around, and so it just isn't part of your benchmark anymore[1]. Secondarily, benchmarking is often done in controlled environments, to try to isolate the effect. Which seems like the right thing to do. But then the software is run in non-isolated environments (IE with a million other things on a VM or desktop computer), which isn't what you benchmarked it in. I've watched plenty of verifiably huge isolated gains disappear or go negative when put in production environments.
You have the particularly hard job that you have to target lots of environments - you can't even do something like say "okay look if it doesn't actually speed it up in production, it didn't really speed it up", because you have no single production target. That's a really hard world to live in and try to improve.
In the end, performance tuning and measurement is really hard. You have nothing to be sorry for, except learning that :)
Don't let this cause you to be afraid to get it wrong - you will get it wrong anyway. We all do. Just do what you are doing here - say 'whoops, i think we screwed this up', try to figure out how to deal with it, and try to figure out how to avoid it in the future (if you can).
[1] This is common not just in performance, but in almost anything, including human processes. For example, to make something up, the team working on the code review tool would say "we've reduced code review time by 15% and thus sped up everyone's workflow!". Awesome! But actually, it turns out they made more work for some other part of the system, so the workflow didn't get any faster, they just moved the 15% into a part of the world they weren't measuring :)
> After 30 years, i've learned (like i expect you have) to be suspicious of any significant (IE >1%) performance improvement in a system that has existed a long time.
Laughs in corporate code
Sure, let me amend it to "i'm suspicious of any significant performance improvement in as system where performance actually matters, and has existed in a state where performance matters for a long time".
That's a different case. Corporate code is never optimized for performance. Performance as a factor doesn't play any factor.
I'll believe you if you say 0.5% improvement, I'll believe you if you say 10,000% improvement, but 10%? That's fishy.
I think it's important to note that a primary motivation of the tail call interpreter design is to be less vulnerable to the whims of the optimizer. From my original blog article about this technique (https://blog.reverberate.org/2021/04/21/musttail-efficient-i...):
> Theoretically, this control flow graph paired with a profile should give the compiler all of the information it needs to generate the most optimal code [for a traditional switch()-based interpreter]. In practice, when a function is this big and connected, we often find ourselves fighting the compiler. It spills an important variable when we want it to keep it in a register. It hoists stack frame manipulation that we want to shrink wrap around a fallback function invocation. It merges identical code paths that we wanted to keep separate for branch prediction reasons. The experience can end up feeling like trying to play the piano while wearing mittens.
That second-to-last sentence is exactly what has happened here. The "buggy" compiler merged identical code paths, leading to worse performance.
The "fixed" compiler no longer does this, but the fix is basically just tweaking a heuristic inside the compiler. There's no actual guarantee that this compiler (or another compiler) will continue to have the heuristic tweaked in the way that benefits us the most.
The tail call interpreter, on the other hand, lets us express the desired machine code pattern in the interpreter itself. Between "musttail", "noinline", and "preserve_none" attributes, we can basically constrain the problem such that we are much less at the mercy of optimizer heuristics.
For this reason, I think the benefit of the tail call interpreter is more than just a 3-5% performance improvement. It's a reliable performance improvement that may be even greater than 3-5% on some compilers.
This is a good point. We already observed this in our LTO and PGO builds for the computed goto interpreter. On modern compilers, each LTO+PGO build has huge variance (1-2%) for the CPython interpreter. On macOS, we already saw a huge regression in performance because Xcode just decided to stop making LTO and PGO work properly on the interpreter. Presumably, the tail call interpreter would be immune to this.
In full agreement with this. There is tremendous value in having code whose performance is robust to various compiler configurations.
I just wanted to say: respect for being able to say "sorry, I made a mistake". I hate the fake it till you make it mentality that seems to be the norm now.
I understand the frustration but I don't think it needed to be said (the part about mentality, the thanks is of course cool), because that's still not the norm.
Why do I even bring this message - I want to say that let's not let what we see in the news influence our perception of the real people of the world. Just because fraud and crimes get elevated in the news, does not mean that the common man is a criminal or a fraud. :)
Divide and conquer; we're supposed to hate each other and trust the state/elite/technology, at least that's the plan.
The real criminals, the people we should keep an eye on, are the plan's designers and implementers.
Why didn't this regression in baseline performance show up (or did it?) on the faster-cpython benchmarks page [0]? Could the benchmarks be improved to prevent similar issues in the future?
[0] https://github.com/faster-cpython/benchmarking-public
That is a better than average benchmark page.
As alluded to in https://news.ycombinator.com/item?id=43319010, I see these tests were collected against just 2 Intel and 2 ARM CPUs. So, if you are looking for feedback to improve, you should probably also include (at least) a AMD Zen4 or Zen5 in there. CPU & compiler people have been both trying to "help perf while not trusting the other camp" for as long as I can remember and I doubt that problem will ever go away.
A couple more CPUs will help but not solve generalizability of results. E.g., if somebody tests against some ancient 2008 Nehalem hardware, they might get very different answers. Similarly, answers today might not reflect 2035 very well.
The reality of our world of complex hardware deployment (getting worse with GPUs) is that "portable performance" is almost a contradiction in terms. We all just do the best we can at some semblance of the combination. The result is some kind of weighted average of "not #ifdef'd too heavily source" and "a bit" faster "informally averaged over our user population & their informally averaged workloads" and this applies at many levels of the whole computing stack.
EDIT: And, of course, a compiled impl like Cython or Nim is another way to go if you care about performance, but I do understand the pull & network effects of the Python ecosystem. So, that may not always be practical.
We don't normally test with bleeding-edge compilers on the faster cpython benchmarks page because that would invalidate historical data. E.g. if 2 years ago we used GCC 11 or something to compile and run a benchmark, we need to run it with GCC 11 again today to get comparable results.
Clang 19 was released last year. We only benched it a few months ago. We did notice there was a significant slowdown on macOS, but that was against Xcode Clang, which is a different compiler. I thought it might've been an Xcode thing, which in the past has bit CPython before (such as Xcode LTO working/not working versus normal Clang) so I didn't investigate deeper (facepalming now in retrospect) and chalked it up to a compiler difference.
TLDR: We didn't run benchmarks of clang 19 versus 18. We only ran benchmarks of clang 19 versus gcc, Xcode clang, and MSVC. All of which are not apples-to-apples to Clang 19, so I naiively thought it was a compiler difference.
EDIT: As to how we could improve this process, I'm not too sure, but I know I'll be more discerning when there's a >4% perf hit now when upgrading compilers.
One question arises: does the added code [1] bring any improvement, or does it merely complicate the source? Should it not be removed?
[1] https://github.com/python/cpython/pull/128718
That's a fair question.
The blog post mentions it brings a 1-5% perf improvement. Which is still significant for CPython. It does not complicate the source because we use a DSL to generate CPython's interpreters. So the only complexity is in autogenerated code, which is usually meant for machine consumption anyways.
The other benefit (for us maintainers I guess), is that it compiles way faster and is more debuggable (perf and other tools work better) when each bytecode is a smaller function. So I'm inclined to keep it for perf and productivity reasons.
Being more robust to fragile compiler optimizations is also a nontrivial benefit. An interpreter loop is an extremely specialized piece of code whose control flow is too important to be left to compiler heuristics.
If the desired call structure can be achieved in a portable way, that's a win IMO.
There was a plan for a 5x speedup overall looking for funding back in 2022. Then a team with Guido and others involved (and MS backing?) got on the same bandwagon and made some announcements for speeding up CPython a lot.
Several releases in, have we seen even a 2x speedup? Or more like 0.2x at best?
Not trying to dismiss the interpreter changes - more want to know if those speedup plans were even remotely realistic, and if anything close enough to even 1/5 of what was promised will really come out of them...
The faster cpython project was for 5x over the 3.10 baseline. CPython 3.13 is currently running at something like 1.6x speed compared to 3.10. With the JIT enabled it goes up by a few more single digit percentage point. With the changes in 3.14 it'll be something like 1.8x speed-up.
So it's slowly getting there, I think the faster cpython project was mostly around the idea that the JIT can get a lot faster as it starts to optimise more and more and that only just got shipped in 3.13, so there's a lot of headroom. We know that PyPy (an existing JIT implementation) is close to 5x faster than CPython a lot of the time already.
There's also now the experimental free-threading build which speeds up multithreaded Python applications (Not by a lot right now though unfortunately).
There has been a lot of speed ups coming out of that project. It's not being done in one go, though, but spread out over the last several releases. So while individual speedups may not look that significant, remember that it is compound on the previous release's speed up.
Original details here: https://github.com/markshannon/faster-cpython/blob/master/pl...
About the author: https://us.pycon.org/2023/speaker/profile/81/index.html
This dude looks God-level.Half-joking: Maybe MSFT can also poach Lars Bak of Google V8-fame.
You're doing great work pushing Python forwards. Thanks for all your efforts.
Thank you for your efforts in improving Python and especially thank you for helping to set the record straight in such a clear and visible manner.
Some of the sibling comments are saying there's "no need to apologize". One way this might be read is "no need to hold yourself to a higher standard". If this is the sentiment, then I disagree—we live in an age where accountability and intellectual integrity are undervalued. Being able to say that you and your team should have investigated a too-good-to-be-true result further before reporting it is setting a higher standard for yourselves.
But another way to read this is "no need to feel bad" and on this I agree. We all often fail to sufficiently challenge our own work, especially when it looks like something we worked hard on had a big impact.
In what age were accountability and intellectual integrity "correctly" valued?
The way this was handled is incredibly good form! Thank you for going to such lengths to make sure the mistake was corrected. I respect and thank you (and all the developers working in Python) for all the work you do!
Yea, don't feel too bad, you're just unlucky that your mistake was so public, most of us have these kinds of mistakes weekly in a more private setting. I think everyone who knows anything about how optimized Python is would be impressed that you managed even a few percent improvement in speed! That will also probably save many, many GWh of power, even those few percent!
You have a new account here and your blog is just one article so far so you might be a new-ish developer(?), but you are doing great, keep it up! If you are not a new developer, you are still doing great, keep it up!
Thank you for doing work that makes python better for millions of people.
I don’t think you should be embarrassed or apologize. You still did a thing that improved performance - in the near term it worked around a bug that you weren’t even aware of, and long term there are still gains even with that bug fixed. But even if that weren’t the case, the only way anything gets done in software is to rely on abstractions. You can either get things done or know exactly how every last bit of your stack is implemented, but probably not both. It’s a very reasonable trade off.
Smart people go and build things. Other smart people find problems. Nothings broken with that.
I feel bad for you since a change on the order of 5% would still have been seen as a very nice bit of work. I appreciate the class with with you’re dealing with an honest mistake, and all of the hard work you’ve given the Python community.
You don't have to apologize, you did great work either way.
It's still a performance improvement, which at the scale of python deployment will probably mean quite a bit of global power savings :)
Don't feel so bad. 1 to 5 % speedup is still an extremely valuable contribution.
Don't apologise. Programming is about the process not the person.
No one cares about this oversight - more important is that you spend time on development that the rest of us don’t have (or don’t have the skill to do in my case). Thanks for your hard work.
You don't need a long heartfelt apology - a simple "god dammit" is enough for this. You made no mistake - you only got unlucky.
It's possible to improve your luck by applying more care, but it's also possible to apply so much care that you do not try things that would have been useful, so I'd rather you keep erring on the side that you did!
Hello, I was one of the people who contributed to this interpreter change. Thank you Nelson for the excellent write-up and for going to such lengths to get to the bottom of this. That said, I wanted to defend Ken a bit. :-)
Ken Jin is a volunteer who's been tirelessly working to make CPython faster over the past couple of years for not enough credit. IMHO, Ken did nothing wrong here, and there's really nothing to be embarrassed about. The fact that it took a month (more if you consider the folks helping to reproduce the original results) for anyone to notice anything wrong shows how complicated the situation was! Put yourself in Ken's shoes -- having updated to LLVM-19 to enable the preserve_none flag, would you then hypothesize that LLVM-19 may have introduced an unrelated performance regression that no one noticed for five months? Lots of people underestimate how hard benchmarking is IMO.
A 1-5% performance improvement is also pretty valuable, just not quite as spectacular as we thought originally :-)
Benchmarking is just insanely hard to do well. There are so many things which can mislead you.
I recently discovered a way to make an algorithm about 15% faster. At least, that's what all the benchmarks said. At some point I duplicated the faster function in my test harness, but did not call the faster version, just the original slower one... And it was still 15% faster. So code that never executed sped up the original code...!!! Obviously, this was due to code and memory layout issues, moving something so it aligned with some CPU cache better.
It's actually really really hard to know if speedups you get are because your code is actually "better" or just because you lucked out with some better alignment somewhere.
Casey Muratori has a really interesting series about things like this in his substack.
Aleksey Shipilёv, a long-time Java "performance engineer" (my term) has written and spoken extensively about the challenging of benchmarking. I highly recommend to read some of his blog posts or watch one of his talks about it.
I vaguely remember about some benchmarking project that deliberately randomised these compiler decisions, so that they could give you more stable estimates of how well your code actually performed, and not just how well you won or lost the linker lottery.
You're probably thinking of "Performance Matters" by Emery Berger, a Strange Loops talk. https://youtube.com/watch?v=r-TLSBdHe1A
There was Stabilizer [1] which did this, although it is no longer maintained and doesn't work with modern versions of LLVM. I think there is something more current now that automates this, but can't remember what it's called.
[1] https://emeryberger.com/research/stabilizer/
The Coz profiler from Emery Berger.
It can actually go a step further and give you decent estimate of what functions you need to change to have the desired latency/throughput increases.
Thanks, I was trying to remember that one!
LLD has a new option "--randomize-section-padding" for this purpose: https://github.com/llvm/llvm-project/pull/117653
Interesting, thanks!
"Producing wrong data without doing anything obviously wrong!"
https://doi.org/10.1145/1508244.1508275
"Producing wrong data without doing anything obviously wrong!"
[pdf]
https://users.cs.northwestern.edu/~robby/courses/322-2013-sp...
As already mentioned this is likely Emery Berger’s project with the idea of intentionally slowing down different parts of the program, also to find out which parts are most sensitive to slowdowns (aka have the biggest effect on overall performance), with the assumption that these are also the parts that profit the most from optimisations.
That linker lottery led to a 15% improvement? I'm surprised. Do you know in what cases you get such a big improvement from something like that? Is it rare? How did you end up reasoning about it?
Various research has shown that the variation can be much higher than 15% due to things like this. It's not that rare; I keep bumping into it. Compilers and linkers do a reasonable job but fundamentally modern CPUs are extremely complex beasts.
I found Casey Muratori's series the best explanation of what is going on at the CPU level.
Some additional context. I was actually writing a benchmarking tool for certain kinds of search algorithm. I spent a long time reducing and controlling for external sources of noise. CPU pinning, doing hundreds of those runs with different data, and then repeating them several times and taking the best of each score with the same data (to control for transient performance issues due to the machine doing other things).
I got the benchmarking tool itself to give reasonably repeatable measurements.
The tool had high precision, but the accuracy of "which algorithm is better" was not reliable just due to these code layout issues.
I basically gave up and shelved the benchmarking project at that point, because it wasn't actually useful to determine which algorithm was better.
Kudos to the author for diving in and uncovering the real story here. The Python 3.14 tail-call interpreter is still a nice improvement (any few-percent gain in a language runtime is hard-won), just not a magic 15% free lunch. More importantly, this incident gave us valuable lessons about benchmarking rigor and the importance of testing across environments. It even helped surface a compiler bug that can now be fixed for everyone’s benefit. It’s the kind of deep-dive that makes you double-check the next big performance claim. Perhaps the most thought-provoking question is: how many other “X% faster” results out there are actually due to benchmarking artifacts or unknown regressions? And how can we better guard against these pitfalls in the future?
I guess the bigger question for me is, how was a 10% drop in Python performance not detected when that faulty compiler feature was pushed? Do we not benchmark the compilers themselves? Do the existing benchmarks on the compiler or python side not use that specific compiler?
The author makes this point, too, and I agree it’s the most surprising thing about the entire scenario.
LLVM introduced a major CPython performance regression, and nobody noticed for six months?
As far as I am aware, the official CPython binaries on Linux have always been built using GCC, so you will have to build your own CPython using both Clang 18 and 19 to notice the speed difference. I think this is partly why no one has noticed the speed difference yet.
This is a very good example of how C is not "close to the machine" or "portable assembly", modern optimizers will do drastic changes to the logic as long as it has no observable effect.
As stated in the post: "Thus, we end up in this odd world where clang-19 compiles the computed-goto interpreter “correctly” – in the sense that the resulting binary produces all the same value we expect – but at the same time it produces an output completely at odds with the intention of the optimization. Moreover, we also see other versions of the compiler applying optimizations to the “naive” switch()-based interpreter, which implement the exact same optimization we “intended” to perform by rewriting the source code."
The phrase "C is portable assembly" isn't a claim that each statement gets compiled directly to equivalent machine code.
When the code has hit the IR in clang or gcc, there is no 'a' (we know that with certainty, since SSA form doesn't mutate but assigns to fresh variables). We don't know if there will be an increment of 1, the additions could be coalesced (or elided if the result can be inferred another way). The number can even decrease, say if things have been handled in chunks of 16, and needs to be adjusted down in the last chunk. Or the code may be auto-vectorized and completely rewritten, so that none of the variables at the C level are reflected on the assembler level.
From a high-level academic view, yes, the compiler is allowed to perform any legal transformation. But in practice C compilers are pretty conservative about what they emit, especially when code is compiled without -march= .
You don't have to take my word for it. Go find a moderately complex open-source library written in C, compile it, then open up the result in Hexrays/Ghidra/radare2/whatever. Compare the compiled functions with their original source and you'll see there's not that much magic going on.
-O3 does autovectorization: turning your loops into a bunch of SIMD instructions, sometimes even drastically changing performance profile.
If autovectorization is "not that much magic" then idk what else it is.
Any optimization you are familiar with is trivial and expected. Everything else is broken compilers optimizing UB to win benchmarks.
Nowadays it's -O2. I was also surprised when I first learned this.
They are as aggressive as they can be.
Here's an example of a C compiler completely eliminating a loop because it has figure out how to transform the loop into a constant calculation.
https://godbolt.org/z/cfndqMj4j
The place where C compilers are conservative is when dealing with arrays and pointers. That's because it's impossible for C to know if a pointer is to an element of an array or something completely different. Pointer math further complicates what a pointer could actually reference.
Saying that something "is like XY" when you really mean "is like XY, at least in comparison to C++" isn't what most people mean.
C is not a portable assembler.
In C, "a += 1" could overflow, and signed overflow is undefined behavior--even though every individual ISA has completely defined semantics for overflow, and nearly all of them these days do two's complement wraparound arithmetic. With C's notion of undefined behavior, it doesn't even give you the same wraparound in different places in the same program. In fact, wraparound is so undefined that the program could do absolutely anything, and the compiler is not required to even tell you about it. Even without all the C++ abstraction madness, a C compiler can give you absolutely wild results due to optimizations, e.g. by evaluating "a += 1" at compile time and using a different overflow behavior than the target machine. Compile-time evaluation not matching runtime evaluation is one of a huge number of dumb things that C gives you.
Another is that "a += 1" may not even increment the variable. If this occurs as an expression, and not as a statement, e.g. "f(a += 1, a += 1)", you might only get one increment due to sequence points[1]--not to mention that the order of evaluation might be different depending on the target.
C is not a portable assembler.
C is a low-level language where vague machine-like programs get compiled to machine code that may or may not work, depending on whether it violates UB rules or not, and there are precious few diagnostics to tell if that happened, either statically or dynamically.
[1] https://en.wikipedia.org/wiki/Sequence_point
> The phrase "C is portable assembly" isn't a claim that each statement gets compiled directly to equivalent machine code.
Weasel words. Like a "self driving car" that requires a human driver with constant attention willing to take over within a few hundred milliseconds.
People advocate for C and use it in a way that implies they think it can achieve specific machine outcomes, and it usually does .. except when it doesn't. If people want a portable assembler they should build one.
As a general rule if you're reading a technical discussion and every single participant is using a particular phrase in a way that doesn't make sense to you then you should probably do a quick double-check to make sure you're on the same page.
For example, in this discussion about whether C is "portable assembly", you might be tempted to think back to the days of structured programming in assembly using macros. I no longer remember the exact syntax, but programs could be written to look like this:
Assembly? Definitely! Portable? Eh, sort of! If you're willing to restrict yourself to DOS + POSIX and write an I/O abstraction layer then it'll probably run on i386/SPARC/Alpha/PA-RISC.But that's not really what people are discussing, is it?
When someone says "C is portable assembly" they don't mean you can take C code and run it through a platform-specific macro expander. They don't mean it's literally a portable dialect of assembly. They expect the C compiler to perform some transformations -- maybe propagate some constants, maybe inline a small function here and there. Maybe you'd like to have named mutable local variables, which requires a register allocator. Reasonable people can disagree about exactly what transformations are legal, but at that point it's a matter of negotiation.
Anyway, now you've got a language that is more portable than assembler macros but still compiles more-or-less directly to machine code -- not completely divorced from the underlying hardware like Lisp (RIP Symbolics). How would you describe it in a few words? "Like assembly but portable" doesn't seem unreasonable.
> still compiles more-or-less directly to machine code
There's a lot hiding in "more or less". The same kind of example holds for e.g. C# : https://godbolt.org/noscript/csharp ; if you hit "Compile" it'll give you the native binary. If you write "x+1" it'll generate an add .. or be optimized away. Now does that mean it's portable assembler? Absolutely not.
Conversely there's a bunch of things that people expect to do in C, do in real code, but are not in the standard or are undefined or implementation-defined. As well as things that are present in assemblers for various platforms (things like the overflow flag) which aren't accessible from the C language.
What people actually seem to mean by "portable assembler" is "no guardrails". Memory unsafety as a feature.
> Reasonable people can disagree about exactly what transformations are legal, but at that point it's a matter of negotiation
And a matter of CVEs when you lose your negotiation with the compiler. Or less dramatic things like the performance fluctuations under discussion.
Systems languages that predated C already could do that, that is the typical myth.
> The C expression `a += 1` can be trusted to increment a numeric value, [...]
Have you heard of undefined behaviour?
Show me a C compiler that miscompiles the following code and I'll concede the point:
C might be low level from the perspective of other systems languages, but that is like calling Apollo 11 simple from the perspective of modern spacecraft. C as written is not all that close to what actually gets executed.
For a small example, there are many compilers who would absolutely skip incrementing 'a' in the following code:
Even though that code contains `a += 1;` clear as day, the chances of any incrementing being done are quite small IMO. It gets even worse in bigger functions where out-of-order execution starts being a thing.Why would you want it to increment 1 if we decrement 1 from the same variable? That would be a waste of cycles and a good compiler knows how to optimize it out, or what am I misunderstanding here? What do you expect "it" to do and what does it really do?
See: https://news.ycombinator.com/item?id=43320495
I'm pretty sure that's replying directly to the comment about how c is close to assembly and that if you add that line of code somewhere you know there's a variable getting incremented. Doesn't really matter whether or not it's useful, the point is that the behavior isn't exactly what you wrote
To reiterate, claiming that C can be described as "portable assembly" is not a claim that it is literally a package of assembler macros that emit deterministic machine code for each individual source expression.
I linked these in another comment, but here's some examples of straightforward-looking integer addition emitting more complex compiler output for other languages that compile to native code:
Haskell: https://godbolt.org/z/vdeMKMETT
C++: https://godbolt.org/z/dedcof9x5
The Haskell version deals with both laziness and also detects overflow.
Your C++ example has a lot more code than the C example, I'm not sure why you'd expect it to produce the same output?
https://godbolt.org/z/r39jK1ddv
It increments, then decrements with -O0 though.
I do not see the issue still, as the behavior is expected with -O0; increments then decrements.
There's nothing in the C standard that enforces the observed -O0 behaviour. Your compiler might change tomorrow.
How likely is that to happen, and in which languages can you either optimize or not AND where the compiler might not change tomorrow though?
David Hume said that we cannot know if the sun is going to rise tomorrow just because it has always did before. See "problem of induction", https://philosophynow.org/issues/160/Humes_Problem_of_Induct....
That’s a contrived example but in a serious program there would often be code in between or some level of indirection (e.g. one of those values is a lookup, a macro express, or the result of another function).
Nothing about that is cheating, it just says that even C programmers cannot expect to look at the compiled code and see a direct mapping from their source code. Your ability to reason about what’s actually executing requires you to internalize how the compiler works in addition to your understanding of the underlying hardware and your application.
What optimizer would remove the increment/decrement if the value was accessed in between? That seems like something that would be really easy to detect.
I’ve never studied compilers though.
It would be very normal for a compiler to do an increment (or merge it into a later instruction), but never do the decrement, and instead use the old copy of the value.
Then in the next step, it would see that the result of the increment is never used and thus the increment instruction is dead code and can also be removed.
In what languages can you do that that is not assembly though? The higher level the language is, the "worse" or difficult it gets, perhaps I am not following the thread right.
Yes, this is normal for languages. The only pushback here is against the term “portable assembler” being applied to C, where it’s incomplete often enough that many people feel it’s no longer a helpful label.
I think it’s also reflecting the maturity and growth of the industry. A turn of the century programmer could relatively easily find areas where dropping down to assembly was useful, but over the subsequent decades that’s become not only uncommon but often actively harmful: your code hand-optimized for a particular processor is likely slower on newer processors than what a modern compiler emits and is definitely a barrier to portability in an era where not only are ARM and potentially RISC-V of interest but also where code is being run on SIMD units or GPUs. This makes the low-level “portable assembler” idea less useful because there’s less code written in that middle ground when you want either a higher-level representation which gives compilers more flexibility or precise control. For example, cryptography implementers want not just high performance but also rigid control of the emitted code to avoid a compiler optimizing their careful constant-time implementation into a vulnerability.
I'm not an embedded expert but a friend of mine has complained about compiler optimizations breaking things in his programs. I could see incrementing by one being used to set some bits in a memory location for a cycle that may mean something to some peripheral and then decrementing by one to set some other bits that may mean something else. In that case, the compiler removing those two lines would cause a very hard to debug issue.
I understand compiler optimizations having unintended consequences, e.g. https://godbolt.org/z/r39jK1ddv but there are a lot of options he may use to enable or disable optimizations (assuming GCC here): https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
This is even more difficult to do with higher-level languages.
It is unlikely as is, but it frequently arises from macro expansions and inlining.
If you want the compiler to treat your code as literal portable assembly turn off optimizations.
That's a property of a particular compiler, not of the C language (at least as described in the standard).
Exactly, pretty much what I said, or enable / disable the optimizations you want.
> It gets even worse in bigger functions where out-of-order execution starts being a thing.
In addition, add that your processor isn't actually executing x86 (nor ARM etc) instructions, but interprets/compiles them to something more fundamental.
So there's an additional layer of out-of-order instructions and general shenanigans happening. Especially with branch prediction in the mix.
If my misocompile, you mean that it fails the test that a "C expression `a += 1` can be trusted to increment a numeric value", then it is trivial: https://godbolt.org/z/G5dP9dM5q
Here's another one, just for fun https://godbolt.org/z/TM1Ke4d5E
I was being somewhat terse.
The (implied) claim is that the C standard has enough sources of undefined behavior that even a simple integer addition can't be relied upon to actually perform integer addition.
But the sources of undefined behavior for integer addition in C are well-known and very clear, and any instruction set that isn't an insane science project is going to have an instruction to add integers.
Thus my comment. Show me a C compiler that takes that code and miscompiles it. I don't care if it returns a constant, spits out an infinite loop, jumps to 0x0000, calls malloc, whatever. Show me a C compiler that takes those four lines of C code and emits something other than an integer addition instruction.
Why are you talking about miscompilation? While the LLVM regression in the featured article makes the code slower, it is not a miscompilation. It is "correct" according to the contract of the C language.
You show one example where C doesn't have problems, but that's a much weaker claim than it sounds. "Here's one situation where this here gun won't blow your foot off!"
For what it's worth, C++ also passes your test here. You picked an example so simple that it's not very interesting.
'eru implied `a += 1` has undefined behavior; I provided a trivial counter-example. If you'd like longer examples of C code that performs unsigned integer addition then the internet has many on offer.
I'm not claiming that C (or C++) is without problems. I wrote code in them for ~20 years and that was more than enough; there's a reason I use Rust for all my new low-level projects. In this case, writing C without undefined behavior requires lots of third-party static analysis tooling that is unnecessary for Rust (due to being built in to the compiler).
But if you're going to be writing C as "portable assembly", then the competition isn't Rust (or Zig, or Fortran), it's actual assembly. And it's silly to object to C having undefined behavior for signed integer addition, when the alternative is to write your VM loop (or whatever) five or six times in platform-specific assembly.
Yes, 'a += 1' can have undefined behaviour in C when you use signed integers. (And perhaps also with floats? I don't remember.)
Your original comment didn't specify that you want to talk about unsigned integers only.
Forth might be a better competition for 'portable assembly', though.
Actually even here, C has some problems (and C++), too:
I don't think the standard says much about how to handle stack overflows?
I know that for 'int a' the statement 'a += 1' can give rather surprising results.
And you made a universal statement that 'a += 1' can be trusted. Not just that it can sometimes be trusted. In C++ the code you gave above can also be trusted as far as I can tell. At least as much as the C version.
I'll expand my point to be clearer.
In C there is no operator overloading, so an expression like `a += 1` is easy to understand as incrementing a numeric value by 1, where that value's type is one of a small set of built-in types.
You'd need to look further up in the function (and maybe chase down some typedefs) to see what that type is, but the set of possible types generally boils down to "signed int, unsigned int, float, pointer". Each of those types has well-defined rules for what `+= 1` means.
That means if you see `int a = some_fn(); assert(a < 100); a += 1` in the C code, you can expect something like `ADD EAX,1` somewhere in the compiler output for that function. Or going the other direction, when you're in a GDB prompt and you disassemble the current EIP and you see `ADD EAX,1` then you can pretty much just look at the C code and figure out where you are.
---
Neither of those is true in C++. The combination of completely ad-hoc operator overloading, function overloading, and implicit type conversion via constructors means that it can be really difficult to map between the original source and the machine code.
You'll have a core dump where EIP is somewhere in the middle of a function like this:
and the disassembly is just dozens of function calls for no reason you can discern, and you're staring at the return type of `std::string` and the returned value of `true`, and in that moment you'll long for the happy days when undefined behavior on signed integer overflow was the worst you had to worry about.> That means if you see `int a = some_fn(); assert(a < 100); a += 1` in the C code, you can expect something like `ADD EAX,1` somewhere in the compiler output for that function.
I completely agree that C++ is orders of magnitude worse but I’ve seen at least a couple counter-examples with code almost that simple. A researcher I used to support compared each release against a set of reference results, and got a surprise when they didn’t match but his program was working. This turned out to be a new compiler release being smart enough to inline and reorder his code to use a fused multiply-add instruction, which had greater internal precision and so the result was very slightly different from his saved referenced set. GCC has -fexcess-precision=standard for this but you have to understand the problem first.
Signed integer overflow UB is the kind of UB that has a nasty habit of causing subtle heisenbugfuckery when triggered. The kind you might, hopefully, make shallow with ubsan and good test suite coverage. In other words, the kind you won't make shallow.
For context, I did not pick that type signature at random. It was in actual code that was shipping to customers. If I remember correctly there was some sort of bool -> int -> char -> std::string path via `operator()` conversions and constructors that allowed it to compile, though I can't remember what the value was (probably "\x01").
---
My experience with the C/C++ optimizer is that it's fairly timid, and only misbehaves when the input code is really bad. Pretty much all of the (many, many) bugs I've encountered and/or written in C would have also existed if I'd written directly in assembly.
I know there are libraries out there with build instructions like "compile with -O0 or the results will be wrong", but aside from the Linux kernel I've never encountered developers who put the blame on the compiler.
> but aside from the Linux kernel I've never encountered developers who put the blame on the compiler.
I encounter them frequently.
99.99% of the time it's undefined behavior and they're "wrong".
Frequently novices who have been failed by their teachers and documentation (see previous rant using atoi as an example of the poor quality of documentation about UB: https://news.ycombinator.com/item?id=14861917 .)
Less frequently, it's experienced devs half joking out of a need for catharsis.
Rarely, experienced devs finally getting to the end of their rope, and are finally beginning to seriously consider if they've got a codegen bug. They don't, but they're considering it. They know they were wrong the last 10 times they considered it, but they're considering it again damnit!
The linux kernel devs aren't quite unique in "just because you can, doesn't mean you should"ing their way into blaming the compiler for what could be argued to be defects in the standard or fundamental design of the language (the defect being making UB so common), but that's probably among the rarest slice of the pie of people blaming the compiler for UB. Few have the will to tilt at that windmill and voice their opinions when the compiler devs can easily just blame the standard - better to keep such unproductive rants close to heart instead, or switch to another language. Something actually productive.
0.01% of the time, it's a legitimate codegen bug on well-defined behavior code. Last one I tracked down to a bug tracker, was MSVC miscompiling 4x4 matrix multiplications by failing to spill a 17th value to stack when it only had 16 SSE register to work with. Caught by unit tests, but not by CI, since people updated compiler versions at their own random pace, and who runs `math_tests` on their personal machines when they're not touching `math`?
I heartily agree that C++ is a lot more annoying here than C, yes.
I'm just saying that C is already plenty annoying enough by itself, thanks eg to undefined behaviour.
> That means if you see `int a = some_fn(); assert(a < 100); a += 1` in the C code, you can expect something like `ADD EAX,1` somewhere in the compiler output for that function. Or going the other direction, when you're in a GDB prompt and you disassemble the current EIP and you see `ADD EAX,1` then you can pretty much just look at the C code and figure out where you are.
No, there's no guarantee of that. C compilers are allowed to do all kinds of interesting things. However you are often right enough in practice, especially if you run with -O0, ie turn off the optimiser.
See eg https://godbolt.org/z/YY69Ezxnv and tell me where the ADD instruction shows up in the compiler output.
Counterexample: https://godbolt.org/z/z3jqjPT6o
I'm not sure that's a counter-example -- what assembly do you think should be emitted for floating-point math on an AVR microcontroller?
It means that "a += 1` is easy to understand as incrementing a numeric value by 1" is not true and instead "it can be really difficult to map between the original source and the machine code".
More examples of non-trivial mapping from C code to generated code: https://godbolt.org/z/jab6vh6dM
All of those look pretty straightforward to me -- again, what assembly would you expect to be emitted in those cases?
For contrast, here's the assembly generated for Haskell for integer addition: https://godbolt.org/z/vdeMKMETT
And here's assembly for C++: https://godbolt.org/z/dedcof9x5
> All of those look pretty straightforward to me -- again, what assembly would you expect to be emitted in those cases?
It is very straightforward indeed, but it is still not mapping primitive operations to direct machine code, but it is forwarding to out-of-line code. Same as operator overloading in other languages.
> And here's assembly for C++: https://godbolt.org/z/dedcof9x5
That's just a symptom of allowing the compiler to inline the add code, otherwise the generated code is as straightforward:
Ref: https://godbolt.org/z/xo1es9TcW> ... and other hidden complex control flow,....
Until someone calls longjmp() or a signal() is triggered. Extra bonus of fun if it happens to be multithreaded application, or in the middle of a non-rentrant call.
Or your emulated float or atomic relies on non-local state, like a control word or a spin-lock pool.
a+=1 will not produce any surprising results, signed integer overflow is well defined on all platforms that matter.
And we all know about the looping behavior, it isn't surprising.
The only surprising part would be if the compiler decides to use inc vs add, not that it really matters to the result.
> a+=1 will not produce any surprising results, signed integer overflow is well defined on all platforms that matter.
I'm not sure what you are talking about?
There's a difference between how your processor behaves when given some specific instructions, and what shenanigans your C compiler gets up to.
See eg https://godbolt.org/z/YY69Ezxnv and tell me where the ADD instruction shows up in the compiler output. Feel free to pick a different compiler target than Risc-V.
I don't think "dead-code elimination removes dead code" adds much to the discussion.
If you change the code so that the value of `a` is used, then the output is as expected: https://godbolt.org/z/78eYx37WG
The parent example can be made clearer like this: https://godbolt.org/z/MKWbz9W16
Dead code elimination only works here because integer overflow is UB.
Take a closer look at 'eru's example and my follow-up.
He wrote an example where the result of `a+1` isn't necessary, so the compiler doesn't emit an ADDI even though the literal text of the C source contains the substring "a += 1".
Your version has the same issue:
The return value doesn't depend on `a+1`, so the compiler can optimize it to just a comparison.If you change it to this:
then the result of `a+1` is required to compute the result in the first branch, and therefore the ADDI instruction is emitted.The (implied) disagreement is whether a language can be considered to be "portable assembly" if its compiler elides unnecessary operations from the output. I think that sort of optimization is allowed, but 'eru (presumably) thinks that it's diverging too far from the C source code.
C compilers are just too smart IMO to make C portable assembly. Your example doesn’t always ADDI either, for example if it is inlined
https://godbolt.org/z/vv9rvKsxn
This isn’t qualitatively different from what the JVM JIT would do, but Java isn’t considered portable assembly.
I guess if you compile with optimizations completely off, you get something that is assembly-like, but I’ve never seen that in prod code.
In what world the return value doesn't depends on 'a' in this code?
A control dependency is still a dependency`a = num; a += 1; if (num < a)` is the same as `if (num < (num + 1))`, which for unsigned integer addition can be rewritten as `if (num != UINT_MAX)`. So there's no need to actually compute `a+1`, the comparison is against a constant.
If the code returns `num * a` then the value of `a` is now necessary, and must be computed before the function returns.
For signed integer addition the compiler is allowed to assume that `(num < (num + 1))` is true, so the comparison can be removed entirely.
> For signed integer addition the compiler is allowed to assume that `(num < (num + 1))` is true, so the comparison can be removed entirely.
That's not directly what the compiler assumes. The direct problem is in 'a + 1' having undefined behaviour, and that transitively allows the assumption on the comparison that you mentioned.
This was an example where 'a + 1' doesn't compile to an add instruction.
That will be inline by any C compiler and then pretty much anything can happen to the 'a += 1'.
Stretching "no observable effect" all the way to a 10000 word blog post.
So, the compiler is tinkering with the way the loop is organised so the whole tail-call interpreter thing is not as effective as announced... Not surprised.
1. CPU arch (and arch version) matters a lot. The problem is 95% about laying out the instruction dispatching code for the branch predictor to work optimally. C was never meant to support this.
2. The C abstract machine is also not low-level enough to express the intent properly. Any implementation becomes supersensitivy to a particular compiler's (and compiler version) quirks.
Certain paranoid interpreter implementation go back to writing assembly directly. luajit's famous for implementing a macro system to make its superefficient assembly loop implementation portable across architectures. This is also I find it fun to tinker with these!
Anyways, a few years ago I've put together an article and a a test for popular interpreter loop implementation approaches:
https://github.com/vkazanov/bytecode-interpreters-post
(author here)
> The problem is 95% about laying out the instruction dispatching code for the branch predictor to work optimally.
A fun fact I learned while writing this post is that that's no longer true! Modern branch predictors can pretty much accurately predict through a single indirect jump, if the run is long enough and the interpreted code itself has stable behavior!
Here's a paper that studied this (for both real hardware and a certain simulated branch predictor): https://inria.hal.science/hal-01100647/document
My experiments on this project anecdotally agree; they didn't make it into the post but I also explored a few of the interpreters through hardware CPU counters and `perf stat`, and branch misprediction never showed up as a dominant factor.
How do you reconcile that with the observation that moving to a computed goto style provides better codegen in zig[1]? They make the claim that using their “labeled switch” (which is essentially computed goto) allows you to have multiple branches which improves branch predictor performance. They even get a 13% speedup in their parser from switch to this style. If modern CPU’s are good at predicting through a single branch, I wouldn’t expect this feature to make any difference.
[1] https://ziglang.org/download/0.14.0/release-notes.html#Code-...
While it's unlikely as neat as this, the blog post we're all commenting on is a "I thought we had a 10-15% speedup, but it turned out to be an LLVM optimisation misbehaving". And Zig (for now) uses LLVM for optimised builds too
Yes, this was already becoming true around the time I was writing the linked article. And I also read the paper. :-) I also remember I had access to a pre-Haswell era Intel CPUs vs something a bit more recent, and could see that the more complicated dispatcher no longer made as much sense.
Conclusion: the rise of popular interpreter-based languages lead to CPUs with smarter branch predictors.
What's interesting is that a token threaded interpreter dominated my benchmark (https://github.com/vkazanov/bytecode-interpreters-post/blob/...).
This trick is meant to simplify dispatching logic and also spread branches in the code a bit.
Trying to assess the performance of a python build is extremely difficult as there are a lot of build tricks you can do to improve it. Recently the astral folks ran into this showing how the conda-forge build is notable faster than most others:
https://github.com/astral-sh/python-build-standalone/pull/54...
I'd be interested to know how the tail-call interpreter performs with other build optimisations that exist.
Compare https://donsbot.com/2009/03/09/evolving-faster-haskell-progr...
The author uses a genetic algorithm to try out lots of different compiler and optimisation flag combinations.
Related discussions:
https://docs.python.org/3.14/whatsnew/3.14.html#whatsnew314-... --> https://news.ycombinator.com/item?id=42999672 (66 points | 25 days ago | 22 comments)
https://blog.reverberate.org/2025/02/10/tail-call-updates.ht... --> https://news.ycombinator.com/item?id=43076088 (124 points | 18 days ago | 92 comments)
Great article! One detail caught my attention.
In one of the referenced articles, https://simonwillison.net/2025/Feb/13/python-3140a5/, the author wrote: "So 3.14.0a5 scored 1.12 times faster than 3.13 on the benchmark (on my extremely overloaded M2 MacBook Pro)."
I'm quite confused by this. Did the author run the benchmark while the computer was overloaded with other processes? Wouldn't that make the results completely unreliable? I would have thought these benchmarks are conducted in highly controlled environments to eliminate external variables.
Simon Willison is a great guy, but he's not a Python core developer and his ad hoc benchmark is not what the CPython core team members are using. For the latter, see https://github.com/faster-cpython/benchmarking-public
While some people here are citing 10% as "large" and 1% as "normal", there are optimizations like partial inlining of doubly recursive Fibonacci that can reduce the actual work (and so also time) exponentially (factors >10X for double-digit arguments or 1000s of %, technically exponential in the difference of the recursion depth, not the problem size [1]).
C compilers can also be very finicky about their code inlining metrics. So, whether that enormous speed-up is realized can be very sensitive to the shape of your code.
So, while part of the problem is definitely that CPUs have gotten quite fancy/complex, another aspect is that compilers "beyond -O0 or -O1" have also gotten fancy/complex.
The article here, while good & worth reading, is also but one of many examples of how two complex things interacting can lead to very surprising results (and this is true outside of computing). People have a rather strong tendency to oversimplify -- no matter how many times this lesson shows up.
EDIT: Also, the article at least uses two CPUs, Intel & Apple M1 and two compilers (gcc & clang), but there are realistic deployment scenarios across many more generations/impls of Intel, AMD, and ARM and probably other compilers, too. So, it only even samples a small part of the total complexity. Also, if you want to be more scientific, esp. for differences like "1.01X" then time measurements should really have error bars of some kind (either std.dev of the mean or maybe better for a case like this std.dev of the min[2]) and to minimize those measurement errors you probably want CPU core affinity scheduling in the OS.
[1] https://stackoverflow.com/questions/360748/computational-com...
[2] https://github.com/c-blake/bu/blob/main/doc/tim.md
I recently made some benchmarking from python 3.9 to 3.13 And up to 3.11 it only got better. Python 3.12 and 3.13 were about 10% slower than 3.11.
I thought my homemade benchmark wasn't great enough so I deployed it to a core service anyway and I saw same changes in our collected metrics. Does anyone else have the same problem?
Yes, I found a loop performance regression [0] in 3.12 and 3.13.
[0]: https://github.com/python/cpython/issues/123540
This is exactly the kind of content I love to see on HN. But I wonder though how this optimization is related to tail-call optimization? How the interpreter jump table is implemented shouldn't affect how stack frames are created, should it?
It's explained under the first link from the article [0]:
"A new type of interpreter has been added to CPython. It uses tail calls between small C functions that implement individual Python opcodes, rather than one large C case statement."
[0] https://docs.python.org/3.14/whatsnew/3.14.html#whatsnew314-...
A big chunk of the performance gain comes from the fact that with tail calls the CPU doesn't have to reset the registers (configured through the `preserve_none` calling convention).
Well if you’d bother to read it, you’d discover this is about tail calls in C, not in Python. It has nothing to do with tail recursion in Python. Guido has explicitly said that Python will never have it.
To clarify: The situation is still not completely understood? It's not just only the computed gotos, but there is some other regression in Clang19? Basically, the difference between clang19.nocg and clang19 is not really clear?
Btw, what about some clang18.tc comparison, i.e. Clang18 with the new tail-call interpreter? I wonder how that compares to clang19.tc.
> Btw, what about some clang18.tc comparison
(post author here) Oh, this is something I could have called out explicitly: The tail-calling interpreter relies on a feature (the `preserve_none` calling convention) that only landed in clang-19. That means you can only test it on that version. That coincidence (that 19 added both this feature, and the regression) is part of why this was so easy to miss at first, and why I had to "triangulate" with so many different benchmarks to be confident I understood what was going on.
The article includes a link to the pull request which fixes the regression in LLVM: https://github.com/llvm/llvm-project/pull/114990
It's getting harder to characterize the C language as "close to the metal". Obviously optimizing compilers are a good thing, but it's frustrating when it feels like you're fighting with them to get the code emitted just the way you want.
Kudo to author and Cpython team to reckon this. Nevertheless, its still a very important improvement in +Ve direction. Thats all matters. This a great story as to why benchmarking are soo hard to get right. I always tell my team to benchmark your real world use case as closely as possible and not rely on external results.
I'm struggling to confirm that my intuition of why the compiler optimisation is faster than the naive switch expansion.
Is it about the layout of where the assembly instructions end up, and spacing around them? Or the CPU pipelining working better? Or...?
distributing the dispatch instructions lets the predictor pick up on specific patterns within the bytecode - for example, comparison instructions are probably followed by a conditional branch instruction
Ah of course! It's improving the branch target predictor. Makes sense, thanks!
Thanks for the post! Would love to read similarly styled dives into GIL performance and experimental multi-threaded Python builds!
Hah, I found something similar in gcc a few years back. Affecting postgres' expression evaluation interpreter:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785
Very interesting! Clearly something else going on though if the 2% Vs 9% thing is true.
Congratulations! You have nothing to apologize for.
This is some solid work and a great blog post to go hand-in-hand with. Wow.
I think it's remarkable that nobody thought to benchmark against a independent build, maybe from debian or redhat. I do this for my local builds I use for development even, it's just so absurdly obvious (and not just in hindsight).
No real harm was done but some embarrassment, but it's pretty silly.
Why hasn't anyone just carbon copied the Python compiler (coincidentally the same name but unrelated to the Python language) from SBCL? The semantics of CL aren't that different from Python (the language).
The compiler is named Python, but has nothing to do with Python the language. I mean __really__. Way to keep your invention buried.
Because they started work on Python (the compiler) in 1980!
[dead]
[dead]
[dead]
That is what happens in a project if you purge all opposition and a few people work on corporate projects that are supposed to succeed.
The free-threading had massive regressions (50% slower), this one has almost no speedups.
People are scared of corporate repressions or repressions from the Steering Council, so they shut up like in the Soviet Union.