With the rapid development of cloud computing, big data, and AI applications, increasingly higher requirements are posed on servers and computing capabilities. Global data center construction is advancing both in speeds and scale. It is common now to see deployment of more than tens to hundreds of thousands of servers.
Challenges to Server Deployment
According to Gartner’s report, the global server revenue increased by 25.7% in the fourth quarter of 2017, and the server technology industry is in its prime of rapid growth. With the blazing-fast development of services, IT infrastructure needs to be quickly deployed, brought online, and conveniently managed. The management scenarios of massive servers will become more and more complex, and the traditional O&M domain faces many new challenges.
Challenges to Energy Consumption Management
In data center expansion, migration, and consolidation scenarios, a newly purchased server needs to be assembled, commissioned, allocated network resources, and provisioned configurations. Onsite manpower involves hardware installation, software deployment, and technical O&M personnel. Most of these operations are performed by O&M personnel on site. According to statistics from Huawei IT departments, more than 50% faults are caused by manual operations. Manual operations are inefficient and error-prone, resulting in extra labor, material, and time.
Challenges to Fault Prewarning and Diagnostics
According to the Climate Change News report, the total power consumption of global data centers in 2017 accounts for 3% of the global power consumption, and the proportion is expected to reach 20% in 2025. Statistics show that energy consumption accounts for 35% of the data center operating expense (OPEX). The rocketing OPEX will become a global challenge. Customers’ requirements for energy consumption management are mainly about designing reliable power consumption management policies to efficiently save energy consumption, and about effectively calculating and predicting energy consumption costs, which is critical to the precise investment of data centers.
In the traditional O&M mode, O&M personnel reactively wait for faults to occur and then rectify the faults. In this old-school mode, the maintenance efficiency is 50–100 servers per person. As the data center scale constantly grows, fault occurrence will become a more frequent issue, and the association between faults will be more complex. These will cause even lower efficiency with the traditional maintenance mode. In addition, traditional maintenance is based on alarm reporting. This means that problems can be noticed and fixed only after critical thresholds are crossed. This in turn leads to service interruption. Against such a backdrop, it is difficult to deliver on user-level 99.95% or above service level agreement (SLA) assurance.
So, how to tackle such challenges?
Gartner proposed the concept of Algorithmic IT Operations (AIOps), a novel form of intelligent O&M, in 2016. The deployment ratio of AIOps was lower than 5% in 2016, but will reach 25% in 2019 globally. In other words, intelligent O&M will become a new normal. The AIOps platform is defined by 11 capabilities, including historical data management, stream data management, log data extraction, network data extraction, algorithm data extraction, text and NLP document extraction, automatic model discovery and prediction, exception detection, root cause analysis, on-demand delivery, and software service delivery capability. These capabilities enable targeted solutions to the preceding pain points, and are the main development direction of massive data center server management.
Figure 1: Algorithmic IT Operations (AIOps) overview [Gartner, 2016]
AIOps is a long-term evolution process. We can see that AIOps focuses on detection and prediction based on massive machine data, and turns reactive O&M into proactive. The optimization is mainly on the software side. However, to deliver a material leap forward in aspects such as deployment, energy saving, and fault management, software and hardware synergy is vital.
In response, Huawei puts forward the conception of Intelligent Servers. Intelligent Servers integrate intelligent management chips and intelligent algorithms to implement server deployment, fault diagnostics and prediction, energy consumption management, mobile O&M, and version management.
Figure 2: Five major functions of the Intelligent Server
The Intelligent Server is an integrated software and hardware solution combining the O&M platform software, BMC software, and intelligent chips. So, what are the advantages of this holistic software-and-hardware solution?
Compared with traditional servers and OEM servers, Intelligent Servers provide intelligent management functions, such as single-node-level fault prediction and analysis, and intelligent power consumption management. In addition, GUI is designed in a more user-friendly and intelligent fashion to facilitate operations, thereby reducing O&M personnel costs and improving O&M experience. What’s more, the intelligent server allows maintenance personnel to access the server O&M system via Bluetooth and Wi-Fi at the local end, which dramatically facilitates server deployment and fault locating.
For example, in deployment and maintenance scenarios, the Intelligent Server provides the one-click Wi-Fi hotspot button. After arriving at the site, the maintenance engineer can touch the Wi-Fi hotspot button, use the mobile app to scan the bar code on the server to access the server O&M network, and then quickly maintain the server enclosure information and provision configurations. The maintenance engineer can also perform assembly and maintenance according to the guidance provided by the mobile app.
Figure 3: Mobile app for intelligent O&M
Compared with intelligent O&M, the Intelligent Server provides a hardware platform that supports feature-rich intelligent management, which greatly complements the intelligent O&M scenarios. In many scenarios, the main bottleneck in the manual operation of O&M personnel is not that the desired information is lost in an ocean of data, but that the hardware itself does not support intelligent management. Intelligent Servers bridge the gap between software and hardware and fundamentally resolve the problems that cannot be addressed by solely relying on the software in some O&M scenarios. In addition, thanks to the improvement in hardware chip capabilities, the servers are capable of providing some intelligent O&M capabilities and therefore make server management more timely and efficient. The hardware information collected by the servers is more comprehensive, which provides a more reliable reference for the O&M platform to make decisions.
For energy consumption management, the intelligent server integrates functions such as dynamic CPU frequency modulation, fan speed tuning, and power supply hibernation. When the service load is low at night, users can set the energy consumption model to the energy-saving mode. The Intelligent Server then will dynamically adjust the CPU frequency and limit the power consumption within a specified value. The Intelligent Server can also hibernate some power supply units (PSUs) to further reduce power consumption. When the service load is heavy during daytime, users can set the energy consumption mode to the high-performance mode. The Intelligent Server then will cancel the CPU frequency modulation restriction and PSU hibernation configuration. In addition, the Intelligent Server will invoke the high-performance heat dissipation specifications for fan heat dissipation, and intelligently associate the energy saving policies. These features can combine to save over 10% energy for a single server cabinet. The intelligent power consumption management platform also provides intelligent control of cabinet-level power consumption. The power capping value is recommended based on the historical power. In typical service scenarios, the deployment density of a single server cabinet can be increased by more than 10%.
Figure 4: Intelligent energy consumption management platform
The Intelligent Server inherits all the existing functions of the intelligent O&M platform and provides a new direction for intelligent O&M evolution. According to the implementation of the intelligent server solution, traditional O&M personnel can be freed from the daily mechanical, repetitive, and low-value work. Manual operations can be maximally smartened up and automated, and boost the efficiency of onsite O&M personnel. In addition, intelligent energy consumption and fault management capabilities can enable better fulfillment of SLA and help customers save on OPEX.
Inspired by innovations from the very core chip level, Huawei intelligent servers make customers better positioned for future excellence and success with data centers.
(Contributed by Jason Ding, IT Product Line)
The post Not All Servers Are Called Intelligent Servers appeared first on Huawei Enterprise Blog.
Source: Huawei Enterprise Blog
—