Huawei AI re-evolved, and CANN 3.0 released the "computing madness"

[Introduction to Zhiyuan Xin] Today, AI has entered the stage of comprehensive landing, but in the future, if AI is really as ubiquitous as water and electricity, it still faces a huge gap. In order to solve the problems of high computing cost and low model development efficiency, Huawei specially designed the heterogeneous computing architecture CANN 3.0.
Recently, GPT3, which has spread all over the world, has 175 billion parameters, which can unlock various skills and talk to human beings. GPT-3 is the best example of the rise and rapid development of AI.
But when everyone was amazed, I didn’t know that it cost $4.6 million to train him in Open AI. If it is not open to ordinary users in the form of API, this NLP artifact can’t be played by everyone.

The high cost of AI training and the expensive computing power lead to the fact that development in this sense is only a patent for technology giants to show their technical strength. In the public’s mind, AI seems to just look beautiful.
Nowadays, AI has entered the stage of full landing. In the future, if AI is to be as ubiquitous as water and electricity, it still faces a major gap. In addition to the high cost, there is still a long way to go from model development to commercialization. The AI algorithm in various scenarios is relatively mature, but the model migration and adaptation takes too much manpower and material resources.
Xu Yingtong, president of Huawei’s Ascending Computing business, said that there is a common problem in the current AI application. The equipment on the training side and the reasoning side are independent of each other, and the business process is divided. Many links need human intervention, and the increase in development costs almost offsets the convenience brought by AI.

Moreover, there are various scenarios and different devices, which leads to a high threshold for AI to be integrated into practical applications. For example, AI applications developed with terminals may be incompatible in the cloud. Is there a general method to achieve efficient development of the whole scene?
Huawei’s Vision: Heterogeneous Computing Architecture CANN 3.0, Full Scene AI Walking.
In order to solve the problems of high computing cost and low model development efficiency, Shengteng specially designed the heterogeneous computing architecture CAN3.0..
As early as October 10, 2018, when Huawei released its AI strategy at the fully connected conference, it released the heterogeneous computing architecture CANN 1.0. CANN 3.0 is the third version, and now it has unified the programming architecture and achieved end-to-end cloud full-scenario collaboration.

CANN 1.0 focuses on edge-side reasoning, which makes the reasoning speed of terminal applications faster.
CANN 2.0 opens up the data center and training terminal, which improves the efficiency of training.
CANN 3.0 has strong scalability and adaptability in the whole scene, so that developers don’t have to worry about whether the terminal is a mobile phone, a camera or a robot, and they don’t have to care what kind of operating system the hardware uses. Linux, Android and HarmonyOS OS are all fine.
At present, CANN 3.0 has covered 10+ operating systems and 14+ intelligent terminal devices. Once you write the code, you can reuse the whole scene of the end cloud, which greatly improves the development efficiency.
What are the advantages of CANN 3.0 compared with the previous two versions?
Software and hardware are decoupled, and one set of codes is closed.
High-performance AI development must be co-designed by software and hardware. For example, TensorFlow of Google has better performance than GPU of NVIDIA on its own TPU. Although this highly coupled mode of software and hardware has high efficiency, it also has shortcomings. A model with excellent performance on a TPU may be "misfired" by changing its hardware.
The code of CANN 3.0 is universal, and has no specific dependence on the training and reasoning hardware in the whole scene of terminal, edge and cloud. It provides users with a rich operator library and a programming method that combines difficulty with difficulty. By developing a set of codes, it can be reused on various terminal hardware, and its performance can be brought into full play.
The development of instruction-level operators satisfies the master’s desire for performance control
Speaking of operator development, CANN 3.0 supports both general and professional operator development modes in order to meet the needs of AI developers at different levels.
For ordinary developers, it is recommended to use TBE-DSL. DSL can automatically realize data segmentation and scheduling, covering 70% of operators. Developers only need to pay attention to how to realize the calculation and call the existing operators, thus reducing the development time of operators by 70%.
Advanced developers who have the ultimate pursuit of performance can realize instruction-level programming and tuning process through TBE-TIK, which can cover all operators. If developers are familiar with the underlying principles, this method can further tap the computing potential of hardware.
No matter which version you upgrade to, CANN 3.0 can continue to be used.
It took a lot of effort to optimize the operator, and suddenly one day CANN was upgraded to 4.0, and the code was invalid?
CANN 3.0 fully considers the life cycle of software. AscendCL, a unified programming interface, is provided at the top level, which normalizes the API. Even if CANN 3.0 is upgraded to version 4.0 and 5.0, the code developed based on CANN3.0 CANN still be used.
Xu Yingtong said that after the upgrade of CANN, the old version of the code needs to be recompiled at most without modification, which is simply the gospel of developers.
Backward compatibility is a great challenge to the research and development personnel of CANN. CANN 3.0 is backward compatible, which allows users to enjoy the latest performance improvement, but also saves development costs and ensures the lasting vitality of the software. "Leave complexity to yourself and simplicity to customers" is also the product concept that Shengteng has always adhered to.
How to release the math maniac? Commander CANN 3.0 can also be "up and down compatible"
In addition to efficient development, CANN has also opened up the "Ren Du Er Mai" of AI algorithm and computational optimization.
TensorFlow’s openness, perfect community construction and support for a variety of different hardware platforms make Google occupy a considerable advantage in the AI model development framework.
In addition to TensorFlow, many manufacturers are using development frameworks such as PyTorch, Caffe and Mxnet, especially PyTorch, which tends to surpass TensorFlow. Many advanced models in academic circles are developed by PyTorch.  

There is a problem in this way. Models developed by different frameworks cannot be used in different systems and different devices, and need secondary development. Although ONNX similar tools can support models of different frameworks, these non-native tools will encounter various bugs and cannot be traced back.
In the laboratory, many SOTA models are written in PyTorch, but the architecture to the deployment end only supports TensorFlow, so it is very inefficient to rewrite the model.
However, each framework has its own advantages, such as stable TF performance, flexible dynamic diagram of PyTorch, and many systematic visual algorithm libraries in Mxnet, so it is best to be compatible with them at the AI   algorithm level.
The core competence construction of AI computing platform includes two aspects: the upper AI algorithm and the lower high-performance computing.
CANN 3.0 can be connected with AI algorithms of various deep learning programming frameworks, which can accelerate the training and reasoning speed of the whole series of AI chips, and is the real core of Ascension. So how does CANN 3.0 achieve chip acceleration?
In the past few years, NVIDIA has been able to make great achievements in the field of AI computing, largely thanks to accelerated plug-in libraries such as CUDA and cuDNN, and CANN is the "CUDA" of the Ascension series.
CANN 3.0 provides C++ API libraries such as Device management, Context management, Stream management, memory management, model loading and execution, operator loading and execution, and media data processing through AscendCL for users to develop deep neural network applications. Users can call these API libraries with any third-party framework without caring about the optimization of computing resources.
So how does CANN 3.0 realize the intelligent allocation of computing resources?
Neural network can be regarded as graphs. In the past, most graphs were executed on the HostCPU. Now, the graph compiler of Ascension has realized the sinking of the whole graph, and both graphs and operators can be executed on the Device side, reducing the interaction time between the chip and the Host CPU, thus giving full play to the computing power of the ascending chip.
Neural networks built like TF are all calculation graphs. In the past, these graphs were all executed in the HostCPU. When resources permit, the whole graph is executed in the HostCPU with high efficiency. However, when resources are limited, the Device side needs collaborative processing, that is, the acceleration card. By sinking the whole graph into the Device, the calculation can be completed efficiently in cooperation with the CPU.

According to the characteristics of the graph and the allocation of computing resources, CANN can automatically split and merge the graph, and minimize the interaction with the HostCPU. With less interaction, computing resources can continue to operate at high intensity.

Intelligent optimization of data Pipeline greatly improves the processing efficiency of data resources, and realizes automatic data segmentation and intelligent distribution flow through artificial intelligence to maximize the utilization rate of individual computing units, thus improving computing efficiency.
In addition to graph automatic compilation and graph splitting and fusion optimization, the 1000+ operator library of CANN 3.0 accelerates your neural network "instantaneously".
Better than NVIDIA, cuDNN only has more than 100 operators. CANN 3.0 not only includes the commonly used Caffe and TF operators, but also provides various acceleration libraries independently, which can be realized through ACL unified programming interface, such as matrix multiplication interface.
CANN 3.0 now has perfect architecture and functions, providing drivers adapted to different hardware and different OS, supporting heterogeneous communication between GPU and CPU, realizing low-level management such as Stream and memory internally, and supporting general calculations such as operators/scalars/vectors with rich acceleration libraries, which can efficiently preprocess image and video data, and the execution engine provides execution guarantee for deep neural network calculations.
Insert wings into AI hardware, and CANN 3.0 opens Pandora’s Box.
With the escort of CANN 3.0, the performance of Atlas 300I, a rising AI reasoning card, has been greatly improved in mainstream reasoning scenarios.
To verify the reasoning performance, the scenario of multi-channel high-definition video analysis is perfect. The high-definition video itself has a large traffic, and the multi-channel fusion tests the concurrent processing ability of the computing platform. The actual measurement shows that a single Atlas 300I reasoning card can handle 80 channels of high-definition video with 1080p and 25FPS at the same time, which is twice the number of similar reasoning cards on the market at present.      

For traffic, security and other scenes, the number of video channels that need to be processed at the same time is more, ranging from a few hundred to tens of thousands. If a single card can handle more tasks, the cost advantage will be expanded in large-scale application, and it is easier to deploy with less hardware. Therefore, many artificial intelligence vendors are building high-performance video analysis solutions based on Ascension AI reasoning cards.
At the training end, Huawei also has a high-density "computing madness". The actual measurement shows that Atlas 800 is superior to the industry’s new product training server in the training of multiple models, and its average performance is 2.5 times ahead of the industry.

It can be seen that the rising training card and reasoning card are extremely capable of "single soldier" combat. If they are assembled, can they become a special force with computing power?
"Commander" CANN said: Yes.
Affected by the communication ability between machines, when multiple machines handle a task at the same time, the overall performance will be greatly reduced if the computing resources are not allocated well. What CANN needs to do is to assign "single soldier" tasks, so that they can cooperate with each other efficiently. Facts have also proved that through the joint "command" optimization of L2 network and CANN, Ascending AI cluster has achieved 1+1>2 performance improvement, far exceeding friendly forces.
CANN 3.0 not only reduces the difficulty of developing applications using Ascent chips in various fields, but also provides many excellent middleware and basic libraries to enable various manufacturers. However, in the face of strong competition from international giants such as Google, NVIDIA and Intel, if we want to survive, we still need to build a complete ecology, contribute more computing power to academia and industry, and create more value for our partners.
CANN "is not fighting alone", but also the developer’s AK47.
CANN is not alone, and a computing architecture alone cannot establish the ecology of AI computing. 

In March of this year, Huawei opened MindSpore, an AI framework that supports the full scene of end-side cloud, and can be widely used in AI fields such as computer vision and natural language processing.
Using MindSpore can significantly reduce the training time and cost, and achieve the highest efficiency with the least resources. MindSpore natively supports the rising AI processor, which makes the AI model development more concise and efficient.
At the just-past HAI conference, Huawei Ascension officially released Ascension Application Enabling MindX, which includes Atlas Deep Learning Platform MindX DL, Atlas Intelligent Edge Platform MindX Edge, Optimized Model Library ModelZoo and various industry SDKs.

MindX DL can quickly build a commercial deep learning system on the computing cluster through the unified management and scheduling of data center equipment and computing resources. Xu Yingtong said that users with needs can meet business needs and get online quickly by simple secondary development.
MindX Edge is a platform for reasoning, which can realize lightweight deployment and support various forms of reasoning hardware such as cameras and drones.
ModelZoo and industry SDK contain a large number of mainstream models, which can help developers quickly build AI applications in multiple scenarios.
With so many platforms and tools, from models to operators to industry SDK, if each link has to build its own development environment, the development advantages of CANN, MindSpore and MindX will disappear, so Huawei has launched a full-process tool chain MindStudio 2.0, with which all development work can be completed in one environment and support one-click deployment.
As Zhou Bin, CTO of Huawei’s Ascension Computing Business, said, "MindStudio is the developer’s AK47"! You can throw away your box gun.
According to China’s New Generation Artificial Intelligence Development Plan, by 2030, the scale of artificial intelligence related industries will exceed 10 trillion yuan. With such a large industrial scale, the required computing power will far exceed the GPT-3 with 175 billion parameters, and the Ascending AI computing platform will fill this "computing vacuum", empower thousands of industries, and provide ubiquitous computing power for AI. Ascending AI can be expected in the future!

Reporting/feedback