Huawei KunLun Mission Critical Server’s Secret to Eliminate Potential Faults

>Unexpected Downtime Damages the Interests of Enterprises<

If you ask an IT manager of a large enterprise to list a few IT-related things that they concern most, IT infrastructure reliability and service continuity must be the top 2 on the list. The pressure of IT managers comes from the pressure of the enterprise’s own operation. For any enterprise, core service systems, such as financial systems, ordering systems, production management systems, or real-time analysis systems, carry the data flows and information flows critical to the normal operation of the enterprise. They are usually expected to achieve 24/7 uninterrupted operating because a common unexpected downtime may cause huge economic loss.

According to Information Technology Intelligence Consulting (ITIC)’s survey report on server hardware and operating system reliability, which polled over 800 customers worldwide, an 81% majority of respondents say that a single hour of downtime per year costs their company over $300,000 (USD).

>Find the IT Infrastructure Ideal for Mission-Critical Applications<

 To minimize the probability and impact of downtime, enterprises can adopt a service deployment mode that is more suitable for core systems, for example, using multiple servers for redundant deployment. Although this method increases service availability, it does not reduce the probability of failures of a single machine, and it also increases the overhead of space, power consumption, and management. Even if redundancy is used, if the failure rate of each server node is too high, there will be frequent redundant node failover, which increases the risk of service interruption.

Therefore, it is also necessary to increase the reliability of each single node in the case of redundant deployment. In the industry, mission-critical server is a term used to specify a type of computing hardware that can provide extremely high single-node reliability and can be applied to the core services of an enterprise. Enterprises’ core services have been carried by UNIX servers in the past. With the advancement of technology, more and more core services are carried by x86 servers nowadays. Huawei’s KunLun mission critical server is one of the best. It not only provides outstanding performance, efficient resource utilization, but also high reliability comparable to that of UNIX servers. With an open architecture, KunLun can also keep up with the innovation pace of the cloud and big data era and has gradually become an ideal choice for enterprises’ core services.

 >KunLun RAS 2.0 Technology Overcomes x86 Architecture Limitations<

As we all know, x86 is an open system architecture, which means that every component of an x86 server, such as server hardware, operating system, and database program, can come from different vendors. It is more difficult for x86 server vendors to implement end-to-end reliability design. Instead, their server reliability usually simply relies on the reliability, availability, and serviceability (RAS) capabilities of x86 processors.

In order to meet the high reliability requirements of mission-critical applications, KunLun has taken various measures to build a unique full-stack reliability design. From hardware layer, firmware layer, operating system layer, to application layer, KunLun enhances and synergizes the reliability features of all layers to achieve fault isolation, early warning, faulty component hot swap, and end-to-end reliability delivery for typical application scenarios, including Oracle database and SAP HANA in-memory computing scenarios. This innovative reliability design is named RAS 2.0 technology.

Huawei KunLun Mission Critical Server’s Secret to Eliminate Potential Faults

Hardware Layer: The most prominent feature of KunLun’s hardware layer is modular design and support for both front- and rear-access maintenance. KunLun consists of a system compute enclosure (SCE), central management enclosure (CMC), and Resource Expansion Enclosure (REE). The following figure shows the front view and rear view of an SCE.

Huawei KunLun Mission Critical Server’s Secret to Eliminate Potential FaultsHuawei KunLun Mission Critical Server’s Secret to Eliminate Potential Faults

Modular design of the SCE, CMC, and REE brings many benefits. First, you do not need to open the chassis cover to maintain components, and maintenance does not involve operations on cables, reducing time and misoperations. Second, tools such as screwdrivers are not required. All modules can be locked and unlocked by their own mechanical parts. Third, front- and rear-access maintenance facilitates the hot swap of components such as power supply units, fan modules, hard drives, CPU boards, and memory modules.

Firmware Layer: At this layer, the in-band system firmware and out-of-band management software form an out-of-band fault management system. This out-of-band management system detects server component faults and collects fault related information. When a fault occurs, the system reports the name of the faulty component and related information. In addition, the management system also supports the Proactive Failure Analysis Engine (PFAE), which can analyze historical fault information continuously. When a component experiences a minor exception and the exception is likely to evolve towards a critical fault, the PFAE triggers warning so that necessary measures, such as faulty module isolation and online replacement, can be taken in time.

Huawei KunLun Mission Critical Server’s Secret to Eliminate Potential Faults

The PFAE has the capability to proactively predict faults and therefore can effectively reduce the risk of serious system failures that may lead to system downtime.

Operating System Layer and Application Layer: The reliability of these two layers is based on the cooperation with mainstream ecosystems. KunLun is the only vendor in the industry to support online physical removal and insertion of CPUs and memory modules. Since the launch of KunLun, Huawei has been proactively working with partners to build an open ecosystem to accelerate innovation. The CPU and memory module hot swap technology is a systematic and complex technology that requires the support of processors, system firmware, server platforms, operating systems, and databases.

Huawei KunLun Mission Critical Server’s Secret to Eliminate Potential Faults

As of today, Huawei (KunLun) and SUSE (Linux) have jointly released a memory module hot swap solution for Oracle database and SAP HANA scenarios. Huawei will continue to cooperate with the two top Linux vendors SUSE and Red Hat on joint R&D and is expected to implement CPU and memory module hot swap on the latest Intel® Xeon® Scalable processor platforms in the fourth quarter of 2018.

>Safeguarding Enterprises During UNIX-to-x86 Transformation<

Enterprise IT systems are undergoing a transformation from UNIX to x86 servers, and their requirements on the continuity of mission-critical applications have never been reduced. KunLun helps enterprises unlock their innovation potential with an open architecture (x86) and provides high availability comparable to that of closed-architecture UNIX servers with advanced RAS 2.0 technology, offering high-quality mission-critical computing resources and always-on services.

The post Huawei KunLun Mission Critical Server’s Secret to Eliminate Potential Faults appeared first on Huawei Enterprise Blog.

Source: Huawei Enterprise Blog