MbedOS currently implements error/exception handlers which gets invoked when the system encounters a fatal error/exception. The error handler capture information such as register context/thread info etc and these are valuable information required to debug the problem later. This information is currently printed over the serial port, but in many cases the serial port is not accessible and the serial terminal log is not captured, particularly in the case of field deployed devices. We cannot save this information by sending it over network or writing to a file, as the state of the system might be unstable after the fatal error. And thus a different mechanism is needed to record and report this data. The idea here is to auto-reboot the system after a fatal error has occurred to bring the system back in stable state, without losing the RAM contents where we have the error information collected, and we can then save this information reliably to be logged externally(E.g:- ARM Pelion cloud) or can be written to file system.
The current error handling implementation in MbedOS already collects error and exception context. With this feature the above mentioned data structures should be placed in an uninitialized RAM region so that the data is retained after an auto-reboot(warm-reset).
System should implement mechanism to track number of times the system has auto-rebooted and be able to stop auto-reboot when a configurable limit is reached.
As depicted in the above diagram, when the system gets into fatal error state the information collected by error and fault handlers are saved into RAM space allocated for Crash-Report. This is followed by a auto-reboot triggered from error handler. On reboot the the initialization routine validates the contents of Crash-Report space in RAM. This validation serves two purposes - to validate the captured content itself and also it tells the system if the previous reboot was caused by a fatal error. It then reads this information and calls an application defined callback function passing the crash-report information. The callback is invoked just before the entry to main() and thus the callback implementation may access libraries and other resources as other parts of the system have already initialized(like SDK, HAL etc) or can just capture the error information in application space to be acted upon later.
Current error and exception handling implementation in Mbed OS already collects error and exception context. But currently these data structures are implemented as statically allocated memory locations. With this feature these data structures should be placed in an uninitialized RAM region so that the data is retained after auto-reboot(warm-reset). So, this should be allocated as a dedicated region using linker command file(or in scatter file) for the corresponding target for each toolchain. Also note that this region should be marked as uninitialized region using the right toolchain attributes. For example, for ARM compiler we can define a new section as below:
```
RW_m_crash_data m_crash_report_ram_start EMPTY m_crash_report_ram_size { ; Dedicated Region to store crash report data
}
```
Note that the actual location of the data should be carefully chosen without affecting the current usage of other regions such as interrupt table region, flash configuration area etc. The absolute location of this Crash-Report RAM region may also differ for each target.
The current mbed_error() implementation should be modified to cause an auto-reboot at the end of error handling if this feature is enabled. The mechanism used for rebooting should make sure it doesn't cause a reset of RAM contents. This can be done by calling system_reset() function already implemented by MbedOS which cause the system to reboot without resetting the RAM. The mbed_error() implementation also should make sure it updates the error context stored in Crash-Report RAM with the right CRC value and it should also implement mechanism to track the reboot count caused by fatal errors. The below pueudo-code shows how the mbed_error() implementation should be modified.
MbedOS error handling system should implement necessary APIs for application to retrieve and reset the error and/or fault context of the previous fatal error after auto-reboot.
The below API can be called by application to retrieve the error context captured in the Crash-Report RAM. The error context is copied into the location pointed by *error_info*. Note that the caller should allocate the memory for this location.
The function should return MBED_ERROR_NOT_FOUND if there is no error context currently stored.
The below API can be called by application to retrieve the fault context captured in the Crash-Report RAM. The error context is copied into the location pointed by *fault_context*. Note that the caller should allocate the memory for this location. Note that the fault context is valid only if the previous reboot was caused by an exception. Whether the previous reboot was caused by an exception can be determined from the error code stored in error context information retrieved using mbed_get_reboot_error_info() API above.
The function should return MBED_ERROR_NOT_FOUND if there is no fault context currently stored.
### Mechanism to report the error data after reboot
MbedOS initialization sequence should check if the reboot is caused by a fatal error and should report the same to the application using callback mechanism. The system may also print this to terminal if enabled.
MbedOS initialization sequence should be modified as shown in below diagram to report the crash report and invoke the callback.
The error handing system in MbedOS will call this callback function if it detects that the current reboot has been caused by a fatal error. This function will be defined with MBED_WEAK attribute by default and applications wanting to process the error report should override this function in application implementation.
### System should implement mechanism to track number of times the system is auto-rebooted and be able to stop auto-reboot when a configurable limit is reached
Many a times rebooting may be a solution to bring the erroring device back into good state(after a fatal error for example)but there might be scenarios when the system has a permanent issue causing it to run into fatal error on every boot. In such scenarios, auto-reboot mechanism can cause a constant reboot loop situation. In order to avoid this, implementation should provide a mechanism to prevent constant reboot loop by limiting the number of auto-reboots. The number of times auto-reboot happens on fatal errors should be configurable. A configuration option should be provided to configure the
maximum number of auto-reboots(warm-resets). In order to implement this, system should track the number of times auto-reboot was effected using the Crash-Report RAM region.
Crash reporting implementation should provide enough parameters to control different aspects of crash reporting behavior so that developers can configure this feature to conform to their system design. Implementation should provide following configuration options to control the behavior as below.
In order to implement the callback the user can override the default callback function(*mbed_error_reboot_callback()*) implemented with MBED_WEAK attribute in platform layer as below.
The error context captured can be retrieved using mbed_get_reboot_error_info() API. See the below code
for example usage of that API. In the example below, a status variable reboot_error_detected has been used to track the presence of error context capture.
Below is the list of new configuration options added to configure error reporting functionality. All of these options are capture in mbed_lib.json file in platform folder.