Search This Blog

Beware of the (watch)dog!

Your house is full of smoke. The smoke alarm is beeping. Do you a) Turn of the smoke alarm and burn alive or b) find the source of the smoke and put out the fire?

I'm sure most folk would opt for b) but the most common problem I see when helping newbies in various esp8266 forums is the programmming equivalent of a). For reasons that are explained elsewhere on this blog LINK programming the ESP8266 isn't the same as programming  a "simple" AVR / Arduino etc and part of that difference frequently causes a "watchdog" timer reset - essentially a "crash" followed by a reboot.

These things generally only happen when the programmer doesn't fully grasp all the issues mentioned in the above LINK, but their first attempts to "fix the problem" usually involves "shooting the meesenger" and turning off the smoke alarm...

If you already know what a WDT is, how it works and why, then you will probably disagree with some aspects of my next statement...in which case, pop off somewhere else and let those who don't yet know those answers to allow this to sink in:

DO NOT TOUCH THE WATCHDOG TIMER. YOU DON'T NEED IT. FORGET IT EVEN EXSISTS! WHATEVER YOU THINK THE PROBLEM IS, IT IS ABSOLUTELY NOT THE WATCHDOG TIMER! DON'T FEED IT. DON'T DISABLE IT. 

D O N ' T   T O U C H  I T!!!

The “watchdog” timer (WDT) is the ESP8266’s smoke alarm. It goes off when there is a fundamental problem with your code. You need to find and fix that problem, not mess around with the WDT.

Embedded systems often don’t have the luxury of a screen and/or keyboard and are frequently fitted in difficult-to-access places where they are never seen by the human eye such as behind your living room wall or under the hood of your car - or in my case - 25feet up on a barn roof... When something fatal occurs, they have little option but to automatically reset themselves, thus many such devices have a WDT built into the hardware. This monitors the state of the system and if it freezes, locks / up or loops indefinitely for more than an “acceptable” amount of time, the WDT will reboot the device. After all, an occasionally faulty device is better than no device at all - especially if it controls your brakes.

I see many forum posts where the programmer says one of:
  •        “I need to understand how the WDT works”
  •        “There is something wrong with the WDT”
  •        “My code runs fine on xxxx , but when I run it on the ESP8266, I get a WDT reset”
  •      “Every time I run my code, I see: WDT reset, please help”.
My answers usually are:
  • Oh no you don't (see above)
  • Oh no there isn't
  • So what?
  • Read this blog
It really helps if you have already read the article on "Asynchronous programming". If you haven't, then you need to, because WDT problems are the tip of an iceberg and you need to understand the whole iceberg to get the best out of your ESP8266.

The usual cause of a WDT reset is that your code is “blocking” which means its stopping other processes or "threads" from running. This is often caused by taking too long to do what you think it needs to do. The most common causes I see are indiscriminate use of  delay() calls and/or waiting in a loop for an external resource e.g. a remote website. 

So how long is “too long” and what is an “acceptable” period of time, when your code already runs fine on an Arduino / stm32 / cray 1 / HP pocket calculator? Perhaps more importantly - why

ESP8266 is a WiFi capable device – that’s why you bought it, right? Connecting to, disconnecting from,  and – more importantly - maintaining a WiFi link os not magic - it takes processing time. There is only one CPU. The most important thing to grasp is that the code you write is not the only code running in the chip. About 200k+ of ESP code is loaded in before you even get to think about blinking an LED. And when does that code run? All the time. It runs “in the background” and you cannot easily see it or find out exactly what it’s doing and when. It just does its thing. Untill you interfere with it and stop it doing its thing. Then the WDT kicks in. It's really quite simple.

If your code stops the WiFi code from running for more than a very short period of time, the WDT says “oops! System has locked up, reboot!”. There is a reason why I have left you thinking "what does 'very short' mean? How long exactly is it?" and the reason is because if you write your programs correctly, you don't need to know. If you really want to, google it.

Yes, you can try to turn off  the WDT to “fix” the problem, but like the smoke alarm, it doesn’t remove the source of the fire, it just delays the inevitable. You can turn off the smoke alarm too, but if that is your preferred solution, I won’t be staying at your house, thank you. Even if you turn it off but still don't fix your code, the hardware WDT will probably kick in after a few seconds -and you can't turn that one off, so you are still going to crash - just several seconds later than if you hadn't turned off the software WDT.

Yes, there are ways you can "cheat" and "feed" the watchdog, but all you are doing is putting a blanket over the beeping smoke alarm to obscure the problem and hiding your bad code. Bad code generally finds a way to bite you in the ass no matter what you do, so it's best to find it and get rid of it, don't you think?

The only solution is to find the part of your code which blocks the background processing and then change it so that it doesn't. How to change it is a whole other (complex) story and for that, you definitely need to understand the link you haven't read yet...How do I know you haven't read it? Easy - because if you had, you wouldn't need to be reading this. Now go and read it.

The only way to absolutely guarantee no WDT resets is to write your code so that it can run asynchronously, co-operate fully with other processes and obey all the rules that multitasking requires. Unfortunately, that is a) a whole new way of thinking b) can be quite complex. With some basic rules, you can avoid most of the problems, but don't forget: we are talking about the tip of an iceberg here.

Until you get more experienced and fully understand the above paragraph, try to stick to the following:

1.       Never forget that yours is not the only code running.
2.       The problem is in your code. Messing with the WDT won’t fix that.
3.       Try to avoid delay() if at all possible. Only ever include delay() if it is absolutely needed and you truly understand why it is needed. If both of those aren't true, take it out.
4.       Never sit in a loop waiting for an external event to happen. Instead, set a volatile global, test and reset the global in the main loop. The same goes for callbacks and timer events. Or, write your code properly (see above link)
5.       Yield() in your main loop.
6.       If a library has a “run” or “handle” or “loop” method, always call it, it’s there for a reason!  This is usually the way library code does what your code also needs to do: co-operate with all other code running in the CPU. The best place is in your main loop.
7.       Never disable the WDT, it’s there for a reason!

No comments: