Presently Mbed OS error handling is not standardized in terms of error codes used and how its handled. This should not be the case as error handling should be unified/standardized and handled in OS level and should be a service available to every component in the system. This makes developing for Mbed OS easier as it enables standard way to describe and communicate error scenarios. This can also enable better error analysis/analytics tools in future. Implementation of standardized error handling should provide APIs for setting the errors and retrieving runtime errors occurred in the system. OS level error handler should collect required data on error scenarios to make it easier for developer to triage the issues later. Error handling implementation should also handle fault exceptions such as Hard Fault, Bus Faults, MemManage faults, Usage Faults to capture the faults context and report them as required and halt the system based on exception type. Sufficient documentation and code comments, in the form of Doxygen comments, should also be provided to enable easy usage of error handling APIs and data structures. This document defines the software design and test strategy for implementing Standardized error codes and error handling in Mbed OS.
### Requirements and assumptions
This feature requires serial terminal support to emit the error report when the system encounters a error scenario. The error reports are printed out for non-release builds only. For release builds, the system still captures the error context (but not printed out) and can be retrieved using error handling APIs.
# System architecture and high-level design
Below are the high-level design goals for "Standardized Error Coding and Error Handling" feature.
**Common error code definitions**
Generic error codes should be defined for common/generic error scenarios. For example, out of memory situation is a common error which needs to be identified with an error code. Any layer of Mbed OS should be able to use these error codes in their implementations to report the errors. It’s up to the receiver/handler of the error code to decide whether the error code should be considered fatal based on current context. Other functionalities, like conversion of error code to human readable string, may also be implemented depending on capability of the target platform like memory available, build profile etc. Applications who should also be able to extend or custom define the error code definitions as required.
1. Error setting API – An API should be provided for the application to set a system level error using the error code definitions.
1. Error retriever API – Application should be able to read the errors reported over a period, get the number of errors or read the last reported error.
1. Error clearing API – Application should be able to reset/clear the current set of errors recorded by the system.
1. Fatal Error API – APIs should be provided to handle fatal error scenarios, so that applications can make that call if the error is fatal based on context and the system should record that and halt the system.
**Error logging**
The implementation should include support for error logging where in the system should capture and log the last N number of errors reported. These errors can be retrieved later for device-health reporting or to analyze the past errors. The error log should also capture the error code, error type and other information as required. The error handling functions mentioned in Section 4.2 should use this logging mechanism to record or retrieve the errors.
**Error reporting**
Implementation should include mechanisms to report errors or fault exceptions to the user through standard input/output or other channels as required when an error scenario happens. The error report should include relevant information to enable efficient triaging of error scenario. There are few different ways to report errors once its captured as below:
1. Print the error report to STDIO/Serial terminal – Almost all development boards have access to serial port as its STDIO and thus reporting it through serial port is required.
**Fault exceptions handling and reporting**
When the system crashes due to fault exceptions, the error handling infrastructure should handle the exception scenario and generate a crash dump containing relevant information for triaging the scenario. The crash dump generated should be reported using the error reporting mechanisms as mentioned in Section 4.4.
**Application hook for error handling**
In many cases, applications developers may want to implement custom error handling for all the errors or custom defined errors. Implementation should include mechanisms to register custom error handler and should call that in the event of an error scenario.
As shown in the diagram above all software components including Application, Platform code, Drivers can use common error code definitions to represent errors conditions and use that to report the error to Error Handling component. The error handling component interfaces with Error Logging and Error Reporting component to log and report the errors and halt the system if required. Fault exceptions will also be handled by the Exception Handling component and then logged and reported. Also note that a copy of error log may be put in a reserved area in RAM, so that we can report this after a warm reset when the system is back in good state.
Common error scenarios should be identified and each of them should have corresponding error code defined. The error codes should be extensible by applications needing to define custom error codes. To facilitate efficient use of memory, 32-bit values should be used to represent each error and encode following information using different fields.
The error codes, entity and type should be defined in a platform level header file accessible to all layers of the system (applications, drivers, sdks etc).
Presently many modules (like filesystems) under Mbed OS use Posix error codes to report error back into the system. It’s better to make sure the Posix error code definitions doesn’t overlap with Mbed Error code standardization and make it easier for developers to report Posix error codes into Mbed error coding/handling system if required. Although we support Posix error codes for backward compatibility, its highly encouraged that all future Mbed OS focused implementations use Mbed OS error code definitions so that errors reported works seamlessly with error reporting and handling implementation in Mbed OS.
To incorporate Posix error code representation into Mbed OS, a portion of error space is allocated for Posix error codes. Since Mbed OS error codes will always be negative, we will capture the negative of the actual Posix error code in the error code defintions. For example, the error code equivalent for EPERM in Mbed OS error code space would be -EPERM. This aligns with Mbed OS error coding convention of using negative values, but the numerical value will be same as the Posix error code.
**Using error codes**
The Mbed OS error codes can be used under 2 circumstances:
1. It can be used as a return value when a function returns indicating the error situation.
A successful return should be either 0 or a positive value. The Mbed OS defined error codes should always be negative as mentioned above.
1. It can also be used to report a fatal/non-fatal error into the Mbed OS error handling system to be reported and recorded in the error log.
**NOTE: If you are using Posix error code to report into the Mbed OS error handling system make sure you are using the negative of the Posix error code.**
Below are the error types supported. Mbed OS defined error codes should be classified as system error codes. A platform or implementation can always define their own error codes but should use custom error types to classify those. In addition, Posix error codes should also be supported under error code definitions, handling and reporting. This enables better diagnostics and routing for defects reported by the system. To capture all these error types under negative integer space, each error type will be assigned a range of negative error space. See Section 5.1.6 - Error Codes Values/Ranges for details.
There should be definitions to capture the origination or location of the error. The below are some of the entity definitions which should be captured by the error handling system to identify where the error is originating from.
ENTITY_ANY ( = 0 ),
ENTITY_APPLICATION,
ENTITY_PLATFORM,
ENTITY_KERNEL,
ENTITY_NETWORK_STACK,
ENTITY_HAL,
ENTITY_MEMORY_SUBSYSTEM,
ENTITY_FILESYSTEM,
ENTITY_BLOCK_DEVICE,
ENTITY_DRIVER,
ENTITY_DRIVER_SERIAL,
ENTITY_DRIVER_RTC,
ENTITY_DRIVER_I2C,
ENTITY_DRIVER_SPI,
ENTITY_DRIVER_GPIO,
ENTITY_DRIVER_ANALOG,
ENTITY_DRIVER_DIGITAL,
ENTITY_DRIVER_CAN,
ENTITY_DRIVER_ETHERNET,
ENTITY_DRIVER_CRC,
ENTITY_DRIVER_PWM,
ENTITY_DRIVER_QSPI,
ENTITY_DRIVER_USB,
ENTITY_TARGET_SDK,
The bit field representation of these fields in unsigned 32-bit integer should be as below. The error code will always be represented as negative.
The implementation should also provide convenient macros to combine Error Code, error type and Entity to build the error status. For example, the following could be the helper macro.
The implementation should provide necessary APIs for error handling and error management. These APIs are expected to be called from C and C++ code as this will be used by SDK/Target/HAL code as well. The system should also capture each error with some context specified by a data structure. That error context data structure should capture information such as – current thread, filename of the source file where the error is logged from etc. A sample error context data structure may be as below.
The filename capture will be in the form of an ascii string and the maximum size of that string depends on memory/resources provided by the platform and can be made a configurable.
This API should return the total number of errors reported from the last boot. Note that this is different from number of errors we logged which may be only a subset of recent errors reported.
This API is called to retrieve the error context information which is logged as part set_error() call. It should take the index of the error for which the error information is requested. If there is no error information found it should return NULL for error_info, and function should with an error value (ERROR_NOT_FOUND?). If the index is invalid it should return an error code (ERROR_INVALID_INDEX?).
This API is called to retrieve the error context information which is logged as part of the last set_error() call. If there is no error information found it should return NULL for error_info, and function should with an error value (ERROR_NOT_FOUND?).
This API should return the total number of errors in the error log. Note that this will be a set of latest reported errors depending on the size of error log.
The implementation should provide an error logging system to record a limited number of most recent errors. The number of entries in the log may be configurable depending on memory constraints. The error log can be implemented as a circular buffer which captures most recent errors. The APIs described above will act on this buffer. The buffer may not be exposed externally, and the users or applications should use the APIs to interact with this buffer. The error log should also be able to record some error data specific to the error being reported maxed at pre-defined number of bytes (For example, 64 bytes? per entry). This may be configurable depending on use case. Below diagram shows the interaction between error log and the APIs which access the log buffer.
When a subsystem or any component in the system encounters a fatal error, it will be reported to the error reporting subsystem which should print the error information through STDOUT (which is usually the serial terminal) with relevant information. The information printed out should contain the error code, the function (address) which reported the error, the stack trace (raw data from the current stack), the time of crash etc, current task info etc. For example, below shows a sample of what could be reported as part of fatal error handling.
The error reporting subsystem may support backing this error log into filesystem, if provided by the platform. Every time we back that up in filesystem, we also clear the current log in RAM. The backing up of Error log into filesystem should be triggered by calling an explicit API such as below.
Cortex-M based processors trigger fault exceptions when the core encounters an unrecoverable error. Below are the fault exceptions triggered by Cortex-M based processor.
- MemManage Exception - Memory accesses that violate the setup in the MPU and certain illegal memory accesses trigger memory management faults.
- BusFault Exception - When an error response is received during a transfer on the AHB interfaces, it produces bus faults.
- UsageFault Exception - Division by zero, unaligned accesses and trying to execute coprocessor instructions can cause usage faults.
- HardFault Exception - Triggered on all fault conditions or if the corresponding fault handler (one of the above) is not enabled.
Not all faults exceptions are supported by all cores. For example, Cortex-M0/M0+ processors (or any ARMv6M processors) do not have MemManage, BusFault and UsageFault exceptions implemented. In those cases, all exceptions are reported as HardFault exception. For ARMv7M processors, MemManage, BusFault and UsageFault exceptions trigger only if they are enabled in System Handler Control and State Register (SHCSR). When these exceptions happen, they should be handled by proper exception handlers to generate a crash dump and should update the error log with this information. It should also be reported over STDOUT (serial terminal). The crash information should contain register context at the time of exception, exception type, current threads in the system etc. Below diagram depicts how the exception handling works.
Some applications may need to do custom error handling on some scenarios. To facilitate this, a mechanism should be provided to register a custom error handling/hook function from application side to the error handling sub-system. When such a hook is present, the error handling system should call the registered callback when an error is encountered before handling the error.
### Platform configuration options for error handling infrastruture
Below is the list of new configuration options added to configure error handling functionality. All of these options are capture in mbed_lib.json file in platform folder.
Enables capture of filename and line number as part of error context capture, this works only for debug and develop builds. On release builds, filename capture is always disabled
The error handling implementation is very generic that other components should no longer need to implement their own error codes or handling. For example, fault exception handling implements part of error handling (like halting the system) which is no longer needed and can be switched to use common error handling which handles system behavior on a fatal error.