The arm is so, so over this AI accelerator unit craze

ANALYSIS Arm this week announced the availability of new high-end CPUs and GPUs ready for system-on-chips for laptops, smartphones and similar personal devices. These cores are expected to power next-generation Android phones at least until the end of 2024.

The announcements touched on a variety of topics, some obvious from marketing; some don’t. Here are the highlights in our view.

The new cores themselves

Arm announced the 64-bit Armv9.2 Cortex-A925 CPU core that succeeds last year’s Cortex-X4. The X925 can be up to 3.8GHz, can target 3nm process nodes, and according to Armit executes instructions at least roughly 15 percent faster than the X4 on a level playing field.

We are told that the CPU has various architectural improvements, such as doubling the bandwidth of the L1 instructions and data cache, a doubled size of the instruction window, better prediction and branch prediction – the main drivers of performance – and broader microarchitecture (eg, four rather than three load pipelines, double the execution of integer multiplication and increased SIMD/FP emission queues). All the things CPU designers get excited about. The bottom line for users is that Arm believes that X925-powered devices in real-world use will see a 36 percent increase in peak single-core performance compared to last year’s hardware, down to about a 30 percent increase average performance in a mix. of workloads.

The X925 is intended to be the core or core application cores in a large CPU array of up to 14 total cores in future devices. How that array is configured depends on the system-on-chip designer that licenses this technology from Arm. Other CPU cores in the lineup could be the new mid-range Cortex-A725 and the smaller and more efficient A520. The X925 can have up to 3MB of private L2 memory, while the A725 can go up to 1MB of L2. The cluster management system has been tuned to deliver power savings too, we’re told.

Next is the new Immortalis-G925 GPU that chip designers can license and add to their processors; a 14-core G720 cluster is supposed to have roughly 30 percent or more performance over its 12-core G720 predecessor. The GPU and its drivers are said to be optimized to boost machine learning tasks in games and graphics applications, especially those built using Unity.

The G925, according to Arm, has an interesting hardware-level acceleration to reduce the amount of work that CPU-based rendering threads have to do; this includes rendering objects on the GPU, avoiding the need to draw to the screen things that are hidden anyway, and similarly better removal of the hidden surface. This should improve performance and reduce power usage, which is good for battery-powered things. There are also optimizations to hardware ray tracing, support for up to 24 GPU cores in a cluster, and improvements to the slab and job dispatch to take advantage of that increase in GPU cores.

Overall, it’s more Arm CPU cores and GPUs from Arm with the usual promises of increased performance and efficiency, meaning the next batch of Android phones will – among other things – run faster and not eat it the battery enough. We will be waiting for independent reviews and comparison of real devices.

Physical applications

Normally, system-on-chip chip designers license cores and other parts from Armi to integrate into their processors; and then, after performing rounds of testing, verification, and optimization, these chip designers pass the final layout to a factory to be manufactured and placed into devices.

Last year Arm began offering pre-baked designs — physical implementations — of its cores that have already gone through optimization and validation with select factories; these designs were offered as Neoverse Compute subsystems for data center-scale processors. This was offered as a way for server chip designers to get a jump start on creating high-performance components.

Now Arm has taken that shake-and-bake approach to personal or client devices and will offer full physical implementations of the new Cortex CPU and GPU models above under the Compute Subsystems banner. for the customer”. These designs were made with the help of TSMC and Samsung, specifically targeting those factories’ 3nm process nodes. Again, the idea is that chip designers will license these physical implementations to include in their own processor, and using TSMC or Samsung will get a head start on creating competitive, high-end processors for PCs and mobiles.

It’s also necessary, in Arm’s mind, because scaling below 7nm starts to open up engineering challenges that can’t be solved simply by system-on-chip designers. The DRAM in the internal memory and the minute wires that carry signals from one part of the device to another don’t shrink as easily at 3nm as they do at 7nm, or so we’re told. If you don’t get the scale right, at the microarchitecture level, the resulting chip may not perform as well as expected.

This has prompted Armi to provide these optimized physical plans for its cores at 3nm, with the help of the fabs themselves, to help processor designers avoid what Armi staff have called a pain point in reaching 3nm. It’s a step closer to getting Arm to fully design entire chips for its customers, though we have a feeling the business isn’t ready or willing to enter that kind of space just yet.

We understand that it is not necessary or necessary for Arm’s licensees to use the computing subsystem; they can license and integrate the cores as they usually have, but they’ll have to do all the tuning and optimization themselves and find a way to overcome the 3nm scaling issues without hampering core performance. Also, licensees are not required to use Armi’s GPU if they choose CPU cores; we are told that there is no shutdown or similar situation here.

As we said, this is interesting, but not entirely revolutionary: Arm already offers this kind of pre-baked design IP for Neoverse. It’s just extending that access to client-grade chips now.

Who needs special AI accelerators?

This is where things start to get a little messy and where Armi needs to be positioned carefully. Arm licenses its CPU and GPU designs to system-on-chip designers, who themselves can include in their processors their own hardware acceleration units for AI code. These units typically speed up the execution of matrix multiplication and other operations essential to the operation of neural networks, easily offloading that work from CPU and GPU cores, and are often called NPUs or neural processing units.

Armi’s licensees, from Qualcomm to Google, like to put their own AI acceleration into their processors as it helps those designers differentiate their products from one another. And Arm doesn’t want to step on people’s toes too much and publicly state that it’s not a fan of that custom acceleration. The Army staff repeatedly emphasized to us that it is not against the NPU.

But.

Arm told us that at least on Android, 70 percent of AI inferences made by apps typically run on a device’s CPU cores, not the NPU if present, nor the GPU. Most application code simply dumps the neural network and other ML operations onto the CPU cores. There are a number of reasons why this happens, we assume one is that app creators don’t want to make any assumptions about the hardware present on the device.

If it’s possible to use a framework that automatically detects the available acceleration and uses it, great, but in general: The bottom line is the CPU. Naturally, first-party apps, such as Google’s own mobile software, are expected to use the popular built-in acceleration, such as Google’s Tensor-branded NPUs in its Pixel range of phones.

And here’s the kicker: The wing staff we spoke to want to see 80 to 90 percent of AI inferences running on CPU cores. This, for one thing, would prevent third-party apps from losing the acceleration that first-party apps enjoy. That’s because, crucially, this approach simplifies the environment for developers: It’s okay to run AI work on CPU cores because modern Arm CPU cores, such as the new Armv9.2 cortex above, include acceleration for AI operations at the ISA level of the CPU.

Specifically, we’re talking about Armv9’s Scalable Matrix Expansion (SME2) as well as its Scalable Vector Expansion (SVE2) instructions.

Arm really wants chip designers to migrate to using Armv9, which brings more neural network acceleration to the CPU side. And that’s kind of why Arm has this beef with Qualcomm, which sticks to Armv8 (with NEON) and custom NPUs for its latest Nuvia-derived Snapdragon system chips. On one hand, you have devices like Apple using Armv9 and SME2 in its latest M4 chips, and Qualcomm and others on the other hand continuing with NPUs. Armi would be happier without this fragmentation going forward.

And so that brings us to KleidiAI, a useful open source library Arm has made available, still in development, and said to be upstream of projects like the LLM Llama.cpp inference driver, which provides a standard interface for all possible CPU -level acceleration available on modern ARM architecture. It is best illustrated with this informative slide:

Informative slide provided by KleidiAI summarizing wing – Click to enlarge

The idea, ultimately, is that app developers won’t have to use any new frameworks or APIs, or make any assumptions. They just keep using the engines they’re already using; hopefully these engines will include KleidiAI so that the appropriate CPU-level acceleration is automatically selected at runtime depending on the device being used, and AI operations are handled efficiently by the CPU cores without having to need to offload that work to a GPU or NPU.

We are led to believe that uploading that work to SME2 or SVE2 is preferable to NEON.

Arm says it is not opposed to NPUs and can see the benefit of offloading certain tasks to custom units. But our impression is that Armi was made this way with the hype over AI accelerators and the notion that finishing AI can only be done properly by custom units.

For 90 percent of applications, Arm would prefer that you use CPU cores and add-ons like SME2 to run your neural networks. And that means more chip designers licensing more modern CPU cores from Arm, natch. ®

The new cores themselves

Physical applications

Who needs special AI accelerators?

Leave a Comment Cancel reply