mbed-os/docs/design-documents/platform/crash-reporting/crash_reporting.md

# Crash Reporting

# Table of contents

- [Crash Reporting](#crash-reporting)
- [Table of contents](#table-of-contents)
    - [Revision history](#revision-history)
- [Introduction](#introduction)
    - [Overview and background](#overview-and-background)
    - [Requirements and assumptions](#requirements-and-assumptions)
- [System architecture and high-level design](#system-architecture-and-high-level-design)
    - [System architecture and component interaction](#system-architecture-and-component-interaction)
- [Detailed design](#detailed-design)
- [Usage scenarios and examples](#usage-scenarios-and-examples)
- [Tools and configuration changes](#tools-and-configuration-changes)
- [Other information](#other-information)
    - [Reusability](#reusability)
    - [Deprecations](#deprecations)
    - [References](#references)

### Revision history

1.0 - Initial version - Senthil Ramakrishnan - 10/22/2018

# Introduction

### Overview and background

MbedOS currently implements error/exception handlers which gets invoked when the system encounters a fatal error/exception. The error handler capture information such as register context/thread info etc and these are valuable information required to debug the problem later. This information is currently printed over the serial port, but in many cases the serial port is not accessible and the serial terminal log is not captured, particularly in the case of field deployed devices. We cannot save this information by sending it over network or writing to a file, as the state of the system might be unstable after the fatal error. And thus a different mechanism is needed to record and report this data. The idea here is to auto-reboot the system after a fatal error has occurred to bring the system back in stable state, without losing the RAM contents where we have the error information collected, and we can then save this information reliably to be logged externally(E.g:- ARM Pelion cloud) or can be written to file system. 

### Requirements and assumptions

This feature requires 256 bytes of dedicated RAM allocated for storing the error and fault context information.

# System architecture and high-level design

Below are the high-level goals for "Crash Reporting" feature:

**Error information collection including exception context**

The current error handling implementation in MbedOS already collects error and exception context. With this feature the above mentioned data structures should be placed in an uninitialized RAM region so that the data is retained after an auto-reboot(warm-reset).

**Mechanism to auto reboot(also called warm-reset) the system without losing RAM contents where error info is stored**

Implement auto-reboot functionality to reboot the system automatically after a fatal error. Note that the auto-reboot feature should be configurable.

**Mechanism to retrieve the error data after reboot**

Provide platform APIs to read the saved error information after reboot.

**Mechanism to report the error data after reboot**

During reboot the system should check if the reboot is caused by a fatal error and report the same using callback mechanism. 

**Implementation should provide a mechanism to prevent constant reboot loop by limiting the number of auto-reboots**

System should implement mechanism to track number of times the system has auto-rebooted and be able to stop auto-reboot when a configurable limit is reached.

**Implementation should provide following configuration options**

1. Configuration option to enable or disable this feature
1. Configuration option to enable or disable auto-reboot when the system enters a fatal error scenario
1. Configuration option to limit the number of auto-reboots
1. Configuration option to print the saved error report on reboot

### System architecture and component interaction

The below diagram shows overall architecture of crash-reporting implementation.

![System architecture and component interaction](./diagrams/crash-report1.jpg)

As depicted in the above diagram, when the system gets into fatal error state the information collected by error and fault handlers are saved into RAM space allocated for Crash-Report. This is followed by a auto-reboot triggered from error handler. On reboot the the initialization routine validates the contents of Crash-Report space in RAM. This validation serves two purposes - to validate the captured content itself and also it tells the system if the previous reboot was caused by a fatal error. It then reads this information and calls an application defined callback function passing the crash-report information. The callback is invoked just before the entry to main() and thus the callback implementation may access libraries and other resources as other parts of the system have already initialized(like SDK, HAL etc) or can just capture the error information in application space to be acted upon later.

# Detailed design

### Error information collection including exception context

Current error and exception handling implementation in Mbed OS already collects error and exception context. But currently these data structures are implemented as statically allocated memory locations. With this feature these data structures should be placed in an uninitialized RAM region so that the data is retained after auto-reboot(warm-reset). So, this should be allocated as a dedicated region using linker command file(or in scatter file) for the corresponding target for each toolchain. Also note that this region should be marked as uninitialized region using the right toolchain attributes. For example, for ARM compiler we can define a new section as below:
```
RW_m_crash_data m_crash_report_ram_start EMPTY m_crash_report_ram_size { ; Dedicated Region to store crash report data
} 
```
Note that the actual location of the data should be carefully chosen without affecting the current usage of other regions such as interrupt table region, flash configuration area etc. The absolute location of this Crash-Report RAM region may also differ for each target.

### Mechanism to auto reboot(also called warm-reset) the system without losing RAM contents where error info is stored

The current mbed_error() implementation should be modified to cause an auto-reboot at the end of error handling if this feature is enabled. The mechanism used for rebooting should make sure it doesn't cause a reset of RAM contents. This can be done by calling system_reset() function already implemented by MbedOS which cause the system to reboot without resetting the RAM. The mbed_error() implementation also should make sure it updates the error context stored in Crash-Report RAM with the right CRC value and it should also implement mechanism to track the reboot count caused by fatal errors. The below pueudo-code shows how the mbed_error() implementation should be modified.

```
mbed_error_status_t mbed_error( ... )
{
    //Handle the error just as we do now and then do the following to save the context into Crash-Report RAM and reset
    
    Read the current Crash Report and calculate CRC
	If CRC matches what's in Crash-Report RAM: 
		Update the location with new error information
		Update Reboot Count
		Calculate new CRC
		Update with new CRC value
	Else (if CRC doesn't match) 
		//This is the case when we dont have a crash report already stored.
		Update the location with new error information
		Set Reboot count to 1
		Calculate new CRC
		Update with new CRC value

    Do a system reset //using system_reset() function
}

```
The Crash-Report RAM region should also be used for tracking other pieces of information such as the CRC value and the
auto-reboot count. 

### Mechanism to retrieve and reset the error data after reboot

MbedOS error handling system should implement necessary APIs for application to retrieve and reset the error and/or fault context of the previous fatal error after auto-reboot.
The below APIs should be implemented.

The below API can be called by application to retrieve the error context captured in the Crash-Report RAM. The error context is copied into the location pointed by *error_info*. Note that the caller should allocate the memory for this location.
The function should return MBED_ERROR_NOT_FOUND if there is no error context currently stored.
```C
//Retrieve the reboot error context
mbed_error_status_t mbed_get_reboot_error_info(mbed_error_ctx *error_info)
```

The below API can be called by application to retrieve the fault context captured in the Crash-Report RAM. The error context is copied into the location pointed by *fault_context*. Note that the caller should allocate the memory for this location. Note that the fault context is valid only if the previous reboot was caused by an exception. Whether the previous reboot was caused by an exception can be determined from the error code stored in error context information retrieved using mbed_get_reboot_error_info() API above.
The function should return MBED_ERROR_NOT_FOUND if there is no fault context currently stored.
```C
//Call this function to retrieve the last reboot fault context
mbed_error_status_t mbed_get_reboot_fault_context (mbed_fault_context_t *fault_context);
```

The below API can be called by application to reset the error context captured in the Crash-Report RAM.
The function should MBED_ERROR_NOT_FOUND if there is no error context currently stored.
```C
//Reset the reboot error context
mbed_error_status_t mbed_reset_reboot_error_info()
```

### Mechanism to report the error data after reboot

MbedOS initialization sequence should check if the reboot is caused by a fatal error and should report the same to the application using callback mechanism. The system may also print this to terminal if enabled. 

MbedOS initialization sequence should be modified as shown in below diagram to report the crash report and invoke the callback.

![Error report on reboot](./diagrams/boot-error-report.jpg)

Below should be the signature of the callback for reporting the error information.

The error handing system in MbedOS will call this callback function if it detects that the current reboot has been caused by a fatal error. This function will be defined with MBED_WEAK attribute by default and applications wanting to process the error report should override this function in application implementation.
```CS
void mbed_error_reboot_callback(mbed_error_ctx *error_context);
```

### System should implement mechanism to track number of times the system is auto-rebooted and be able to stop auto-reboot when a configurable limit is reached

Many a times rebooting may be a solution to bring the erroring device back into good state(after a fatal error for example)but there might be scenarios when the system has a permanent issue causing it to run into fatal error on every boot. In such scenarios, auto-reboot mechanism can cause a constant reboot loop situation. In order to avoid this, implementation should provide a mechanism to prevent constant reboot loop by limiting the number of auto-reboots. The number of times auto-reboot happens on fatal errors should be configurable. A configuration option should be provided to configure the
maximum number of auto-reboots(warm-resets). In order to implement this, system should track the number of times auto-reboot was effected using the Crash-Report RAM region.

### Implementation should provide following configuration options

Crash reporting implementation should provide enough parameters to control different aspects of crash reporting behavior so that developers can configure this feature to conform to their system design. Implementation should provide following configuration options to control the behavior as below.

1. Configuration option to enable or disable error the entire feature
1. Configuration option to enable or disable auto-reboot when the system enters a fatal error scenario
1. Configuration option to limit the number of auto-reboots
1. Configuration option to print the saved error report on reboot

# Usage scenarios and examples

Below (pseudo code) are some common usage scenarios using the new error reporting APIs.

### Implementing crash reporting callback
In order to implement the callback the user can override the default callback function(*mbed_error_reboot_callback()*) implemented with MBED_WEAK attribute in platform layer as below. 

```C
mbed_error_ctx my_error_ctx;
//Callback during reboot
void mbed_error_reboot_callback(mbed_error_ctx *error_context) {
    printf("Error callback received");
    //Copy the error context in a local struct for processing it later
    memcpy(&my_error_ctx, error_context, sizeof(mbed_error_ctx));
}
```
The above function will be called during boot with a pointer to *error_context* structure.

### Retrieving error info after reboot
The error context captured can be retrieved using mbed_get_reboot_error_info() API. See the below code
for example usage of that API. In the example below, a status variable reboot_error_detected has been used to track the presence of error context capture.

```C
mbed_error_ctx error_ctx;
int reboot_error_detected = 0;

//Callback during reboot
void mbed_error_reboot_callback(mbed_error_ctx *error_context) {
    printf("error callback received");
    reboot_error_detected = 1;
}

// main() runs in its own thread in the OS
int main() {
  
    if(reboot_error_detected == 1) {
        if(MBED_SUCCESS == mbed_get_reboot_error_info(&error_ctx)) {
        printf("\nSuccessfully read error context\n");
    }
    //main continues...
}
```

### Retrieving fault context after reboot
The fault context captured can be retrieved using mbed_get_reboot_fault_context() API. See the below code
for example usage of that API. The example code below checks for error_status using the error context and then
retrieves the fault context using mbed_get_reboot_fault_context() API.

```C
mbed_error_ctx error_ctx;
mbed_fault_context_t fault_ctx;
int reboot_error_detected = 0;

//Callback during reboot
void mbed_error_reboot_callback(mbed_error_ctx *error_context) {
    printf("error callback received");
    reboot_error_detected = 1;
}

// main() runs in its own thread in the OS
int main() {
  
    if(reboot_error_detected == 1) {
        if(MBED_SUCCESS == mbed_get_reboot_error_info(&error_ctx)) {
            printf("\nRead in reboot info\n");
            if(error_ctx.error_status == MBED_ERROR_HARDFAULT_EXCEPTION) {
               if(MBED_SUCCESS == mbed_get_reboot_fault_context(&fault_ctx)) {
                   printf("\nRead in fault context info\n");
               }
            }
        }
    }
    //main continues...
}
```

# Tools and configuration changes

### Platform configuration options for error reporting infrastruture

Below is the list of new configuration options added to configure error reporting functionality. All of these options are capture in mbed_lib.json file in platform folder.

**crash-capture-enabled**

Enables crash context capture when the system enters a fatal error/crash. When this is disabled it should also disable other dependent options.

**fatal-error-auto-reboot-enabled**

Enables auto-reboot on fatal errors.

**reboot-crash-report-enabled**

Enables crash report over terminal when the system reboots after a fatal error/crash.

**error-reboot-max**

Maximum number of auto reboots permitted when an error happens.

# Other information

### Reusability

### Deprecations

### References