Huawei KunLun Mission Critical Server: The Key to In-Memory Application Continuity

Preface

Databases are being used by more and more enterprises in production, operation, and management and have become important IT infrastructure for enterprises. Core databases that bear mission-critical services of enterprises usually require 24/7 uninterrupted running to minimize loss caused by system breakdown.

Huawei KunLun mission critical server meets the high requirements on performance and reliability of mission-critical applications. In terms of performance, in addition to being ranked No.1 in multiple SPEC benchmark tests, KunLun took first place in both the SAP B4H and SD2 benchmark tests oriented to online analytical processing (OLAP) and online transaction processing (OLTP), demonstrating its outstanding product performance. In terms of reliability, leveraging Huawei’s innovative reliability, availability, and serviceability (RAS) 2.0 technologies, KunLun supports CPU and memory hot swap and has surpassed traditional UNIX servers by some reliability indicators.

In traditional databases, a large amount of data is stored on storage devices such as disks. In the basic architecture of modern computers, storage devices, compared with memory devices, are more “far away” from the CPUs of computers. CPUs deliver much lower data rate and bandwidth when accessing storage devices than memory devices. As server CPUs are supporting increasingly larger memory capacities, many database vendors have continuously stored database running data in the system memory for operation and management. This in-memory computing model significantly improves the overall performance of databases.

As the DIMM quantity and memory capacity supported by servers keep increasing, enterprises’ attention on the reliability of the server memory subsystem is also increasing. Huawei KunLun is outstanding in terms of reliability. It incorporates cutting-edge reliability technologies such as double device data correction (DDDC), memory sparing, and memory mirroring. Moreover, KunLun is the first x86 server in the industry to support memory hot-swap. This technology, together with KunLun’s proactive failure analysis engine (PFAE), enables users to take preventive measures early when a minor exception occurs in the memory. Like most electronic devices, the failure rates of DIMMs over time are in the shape of a bathtub, which is commonly known as the Bathtub Curve. DIMMs experience higher failure rates in their Early-Life Failures period and the Ware-Out Failures period. For KunLun, DIMMs at their Early-Life Failures periods are filtered out in production tests so that they do not reach customers. To avoid wear-out failures, KunLun’s memory hot-swap technology allows enterprises to replace DIMMs online before they enter their Wear-Out Failures periods to ensure service continuity. In this way, DIMMs used by service systems can remain in periods with low failure rates, avoiding critical faults or even system breakdowns caused by high failure rates after long-term operating.

Huawei KunLun Mission Critical Server: The Key to In-Memory Application Continuity

Failure rate bathtub curve

 

Openness Accelerates Innovation

The always-on in-memory application solution based on the Huawei KunLun mission critical server and SUSE Linux Enterprise Server (SLES) OS is the collaborative innovation result of the two parties, leveraging the open architecture and open ecosystem.

Huawei KunLun Mission Critical Server: The Key to In-Memory Application Continuity

Layered architecture of the memory hot swap function

Huawei proactively works with partners to build open ecosystems to accelerate innovation. The memory hot-swap technology is a complex, systematic technology that requires the support of the CPU, BIOS firmware, server platform, and OS kernel. With a long-term cooperation basis, SUSE and Huawei challenge the technology. SUSE organizes experts on memory and advanced configuration and power management interface (ACPI) to perform in-depth, joint development with Huawei. In the OS patch officially released by SUSE, a large number of underlying code optimizations and hardening measures have been made for the memory management module and the ACPI driver module, optimizing the memory hot-swap process.

The always-on in-memory application solution jointly launched by Huawei and SUSE is the first of its kind in the industry for open architecture servers. Open architecture servers, represented by x86 servers, are leading the trend of technological innovation. KunLun is no exception. It delivers industry-leading reliability while helping enterprises accelerate innovation.

Focusing on Service Experience Improvement

KunLun’s always-on in-memory application solution focuses not only on technologies but also on user experience.

Huawei and SUSE have cooperated with each other throughout the long-term development of the memory hot-swap technology, all the way from making the OS itself to support memory hot-swap to improving memory migration efficiency in different service scenarios. Step by step, the two parties have made in-depth research and continuous innovation in every aspect. They have also carried out systematic verification and optimization particularly for in-memory services.

For the hot-swap of memory with latent faults, a critical step is to migrate running data from the risky memory to another functional and idle memory, and delete the resource information related to the risky memory after the memory migration is complete to ensure that no new data is stored in the risky memory any more. How OSs and databases use memory is very complex. Different databases may use different methods to access memory. Huawei and SUSE have optimized the memory migration mechanism for mainstream databases such as the Oracle database and SAP HANA, improving the migration success rate of a single memory page and decreasing the overall memory migration time by reducing retries.

 Simple Operation and Easy Maintenance

 Memory hot swap is a complex process at the system level. However, it is an easy operation for users.

When KunLun’s PFAE detects a minor exception that is likely to escalate into a critical fault on a memory module, a pre-warning is displayed on the KunLun management user interface (UI). When noticing such a pre-warning, a user can locate the risky memory module using the pre-warning information provided and click the hot swap button (status indicator) on the risky memory icon to trigger automatic memory hot removal.

Huawei KunLun Mission Critical Server: The Key to In-Memory Application Continuity

Hot swap operation screen

When the memory hot removal process is complete, the KunLun management UI displays a message to remind the user to replace the DIMM. And the user only needs to open the cabinet, pull out the faulty memory module, replace the faulty DIMM, insert the memory module back into the cabinet, and click the hot swap button again to trigger memory hot addition. Then, the memory hot addition is complete automatically. Services are not interrupted throughout the entire hot-swap. In addition to memory hot swap, KunLun also supports CPU hot swap.

(Contributed by Chen Ben, Intelligent Computing Product Line)

The post Huawei KunLun Mission Critical Server: The Key to In-Memory Application Continuity appeared first on Huawei Enterprise Blog.

Source: Huawei Enterprise Blog