The hunting of the Snark
In the latter stages of testing the forthcoming release of Esparto v2, I was "stress" testing it by throwing everything including the kitchen sink at it to see how it coped.Short story is : it didn't. Low workloads were absolutely fine, but once the rate got above a certain figure - and I couldn't work out exactly how big - it fell in a screaming heap, which is the one thing it is designed specifically NOT to do.
Very frequently it would crash with a stack dump and it was clear that heap exhaustion was the culprit. What slowed me down was that I had spent a lot of time building in heap protection. But no matter how careful and extensive my protection was, it was failing. I searched - literally - for days, stitching in almost as much diagnostic code as there was "real" code. On a couple of occasions, I created a "heisenbug" with my diagnostics: after putting them in, the bug went away. Take them back out, it reappeared. "Aha!" I thought that's a sure-fire sign of a timing problem.
Given that the whole thing is wrapped very tightly round a microsecond-indexed custom priority_queue on the main loop which has to co-operate with asynchronous events via a mutex, it was no surprise. And I had had problems with that core module in the past. And I had made a lot of changes to it -so it was no surprise to find that that's where a lot of my "wasted" days were spent...
Ironically, it's not actually a lot of code - about 500 lines of which about 30-odd percent or more is comment and/or vertical spacing for clarity so maybe only 3-400 lines of code, but its hellishly complex - for me at least.
For a start although I have been Cing and C++ing for many a year, I only looked st the standard library about 9 months ago, as It had rapidly become clear that "functional programming" was the only way I could get a timer to call back to a class method which was a fundamental pre-requisite for an Arduino library. Once I had got my head around lambdas and function objects and the how-did-I-ever-do-without-it bind mechanism with placeholder parameters, I was cooking! I then delved into containers and I'm now sold sold sold and wouldn't ever dream of doing things any other way. I did have to spend a lot of time with my head in Stroustrup and whining on Stack Overflow, but I got there. Mostly.
Secondly, there is the very nature of the beast: that rare area that mostly only OS designers play with more than once: task synchronisation, critical sections, interlocked atomic access, volatiles, mutexes and a slew of buzz phrases most programmers never meet in a full career...luckily I'd had some previous experience - on mainframes in the mid 1970s...Oddly enough, it was a huge help when it all came back to me. The words may be different in ESP8266land, but the tune is the same.
Thirdly I was new-ish to the ESP8266 having got into the embedded scene (as many others) with AVR-based Arduinos.
All in all, a "random" timing bug under heavy loads was all I didn't need. After days of searching and not finding, I came reluctantly to the Sherlock Holmes conclusion:
"When you have eliminated the impossible, whatever is left - no matter how improbable - must be the truth"
The truth in this case was that the reason I couldn't find the bug in my code is because - it wasn't in my code! Therefore it must be in someone else's! The prospect of digging through profoundly complex library code written by geniuses - obviously for what would be days and days - did not fill me with deep joy. Once I had finally tracked down and nailed the problem it turned out, it wasn't actually a bug but a combination of two things, both of them on the very limits of my knowledge comfort zone. But I'm jumping ahead.
The first big clue was my voluminous diagnostics revealing a sudden and huge drop in the available heap shortly before lights-out. Every time. At least that part had a pattern. I knew though bitter sweat and very very late nights to the sound of much swearing, I could account for every byte of memory I used, every pointer RAII'd until it bled, every heap fluctuation possible - I had the spreadsheets and graphs to prove it. Yet here was a sudden massive bite out of a very small cherry while I sat with my jaw clamped shut.
We all know the ESP8266 is heap-limited. There is a ton of other code on top of my core scheduler: a web server doing dynamic updates in real-time via web sockets, a real-time pin management library, auto WiFi / MQTT connection management with AP fallback , yada yada yada all the way up to 410k plus. The app gets out of bed with less than 30k heap to play with. By the time it tells you MQTT is waiting to hear from you, there is only 27k left in the bank.
So who or what was stealing - out of nowhere - 16k or 60% of my available space? It would be OK when I was at 27, but if the heap-thief helped itself when I had 11k spare, guess what? Oblivion. Sadly, it did exactly that quite often.
Now the implementers of the std library on this platform are obviously well aware of the ESP8266's pint pot into which they had to shoehorn a barell-full of stdlib. Something had to give. Gone is any type info, gone is the ability to get the target of a function object...I could go on, suffice to say that while it is mostly functional for my humble needs it is a (necessarily) heavily-hacked offering.
The extensive stack back trace looked like an explosion in an alphabet soup factory in Moscow. I couldn't make head nor tail once it had got below the precious few of my own routines but I persevered and finally hit pay dirt as the second of the issues became apparent to me: my lack of experience in looking "under the hood" of stl.
It's a memory problem, I know that much. I also know I can't see the problem because its not in my code, so I'm going to have put diagnostics in - gasp - the stl code. After recovering from a near-faint at the thought, I vaguely recalled a little-if-ever used feature of the custom allocator. I had glossed over it as I knew I would never be needing it. Until now, of course! If I could write a custom allocator - yes, this is how desperate I was getting - I could sprinkle it with printfs and finally nail the problem.
By jingo, that's what I started to do. I found a skeleton malloc-based proof-of-concept template on't'web and blindly cut and pasted it it, as I sure as hell couldn't understand a word of it! So it was back to Bjarne and StackOverflow and I concentrated on how this foreign beast was meant to do it magic, I never even considered the when and why...
Blinkering my vision was the fact that because I "knew" I only had a few tasks in my queue, I was concentrating on numerous other potential culprits. What I hadn't seen was that a second (yes! how odd!) bug was causing me to double- and even triple-dip at stuffing tasks into the queue. The logic simply ignored the excess and/or duplicates, so it just wasn't apparent anywhere.
I think the exact Eureka! moment came as I delved deeper into the allocator's black box that I heard myself asking "Why is the last thing just after I have added one of my meagre handful of taks to the queue?" At this point, the experienced stl'ers amongst you will have probably already seen the problem, which could even be - by some standards - classed as a "schoolboy error". Those of you who use it on big machines with unlimited virtual memory may not yet have spotted it though, because what happens on an ESP8266 never happens in your big world.
Hiding complexity is one of stl's great strengths, but it has it's dark side too, as was the case here. I knew that these magic containers would stretch to fit. What I didn't know was a) when b) by how much. What I also couldn't see was that - albeit in error - I had filled up my small queue. How small? Who knew? Only the folk who implemented the stl on the '8266. Some arbitrary default initial size that would automatically stretch to fit as and when required.Without asking. Or telling. All that nasty implementation stuff is hidden from view, which - after all - is why we use the thing in the first place isn't it?
Drifting off into the new territory of dynamic resizing, I was trying to find the place in the allocator to put the printf with current heap size at the point where my queue was allocated, so I could see how much free heap there was and what else might be stealing it But I couldn't spot where an initial allocation was made and/or a resizing allocation was made. I didn't want to waste my efforts and put it anywhere near resizing code, because my queue was so small it would never need resizing! In an idle moment I mused on what might happen in that scenario: how much bigger would it grow and where in this mess would I even start to look for that?
"Because.." I thought, musing to myself to avoid having to solve this impossible bug, "...if it did for some bizarre reason allocate a huge chunk...".
Then it hit me like a brick: I was looking for something under the hood that not only had the potential to, but was actually grabbing a huge chunk of memory without being asked, wasn't I? I'd got so many levels nested in the complexity that I'd forgotten my call return address. Once I unwound my own mental stack, I realised I was staring the problem right in the eyes: It wasn't something else stealing the memory, it was the queue itself!
"But why? It's tiny, look, this extra diag will show the size....!. Oh. Oh, that's a shock - where did all those tasks come from? Is there a bug in clearing the queue? Nope, checked that, works fine. is it a timing thing? Queue so busy that some underlying memory-freeing thingamajig an't get a look in?"
No, it's me accidentally busting the queue and the queue very obligingly and silently trying to expand to nicely accommodate me ...by not-so-nicely grabbing 60% of the little I have left. Bingo!
I rapidly found out how to set the capacity of the underlying vector container, hunted down the duh! bug that was tripling my queue, daringly removed (well commented out, I'm not that stupid) almost all the diagnostics, crossed my fingers, legs, arms and eyes and hit "run".
You know the rest: bug splatted. All the other code worked fine too and suddenly I'm almost ready to release. See the more recent posts here for a quick look-in - ignore the earlier posts from a few months back, it's changed a lot (because of this) since then.
On a big machine in the real world, I'd have never even noticed this. I'd have gone for release with an inefficient, heap-hungry monster. Such beasts don't live long in the mbeeded wrold, This particular one lived a lot longer than it should have, but - hey, we learn by our mistakes. Or by reading about others'!
The morals of this story are manifold:
- Get to know a new technology in detail before playing with it in anger
- Preallocate stl containers wherever possible
- Where you can't preallocate, guard all accesses with heap checks
- When you have eliminated the impossible...
- Find the person who thinks 16k is a good size for reallocation on an '8266 and "have a word"
Anyone want to buy a brand-new completely unused custom stl memory allocator? It's an ideal self-teaching tool...
No comments:
Post a Comment