AI engine

AI engine is a computing architecture created by AMD (formerly by Xilinx, which AMD acquired in 2022^[1]). It is commonly used for the of linear algebra operations^[2], such as matrix multiplications, artificial intelligence^[3]^[4] workloads, digital signal processing^[5], and, more generally, high-performance computing^[6]^[7]. The first products containing AI engines were the Versal adaptive compute acceleration platforms^[8], which combine scalar, adaptable and intelligent engines, all connected through a Network on Chip (NoC)^[9].

AI engines have evolved significantly to keep pace with the changing demands of modern workloads, especially AI applications. The basic architecture of a single AI engine integrates vector processor and scalar processor, offering Single Instruction Multiple Data (SIMD)^[10]^[11] capabilities. In terms of products, AI engines are today integrated with many other architectures like FPGAs, CPUs, and GPUs, composing a plethora of architectures for high performance heterogeneous computation, widely employed in different domains^[12]^[13]^[14].

"AI" does not stand for artificial intelligence or adaptable intelligent. Indeed, as specifically asserted by the company support, they do not mean any acronym for the AI word^[15].

History

The AMD AI engines were originally released by Xilinx, Inc., an American company notable for its contribution to the field of FPGA.^[16] Their initial goal was to accelerate signal processing and, more generally, applications where data parallelism could offer significant improvements. Initially, AI engines were released combined with an FPGA layer in the novel Versal platforms^[8]. The initial systems, the VCK190 and VCK5000, contained 400 AI engines in their AI engine layer, connected through a VC1902. For connectivity, this architecture class relied on an innovative Network on Chip, a high-performance connectivity devised to become the core connectivity of modern FPGA fabric^[9].

In 2022, the AI engine project evolved when Xilinx was officially acquired by AMD^[1], another American company already famous in the computing architecture market. The AI engines were integrated with other computing systems to target a wider range of applications, finding benefits when considering AI workloads. Indeed, even though the Versal architecture proved powerful, it was complicated and unfamiliar to a vast academic and industrial community segment^[12]. For this reason, AMD, along with third-party developers, began releasing improved toolsets and software stacks aimed at simplifying the programming challenges posed by the platform, targeting productivity and programmability^[17]^[18]^[19]^[20].

Aware of the AI workload needs, in 2023, AMD announced the AI engine ML (AIE-ML)^[21], the second generation of such architecture. It added support for AI-specific data types like bfloat16^[22], a common data type for deep learning applications. The version retained the same vector processing capabilities of the previous instance, but enlarged memory to support more intermediate computations^[23]. From this generation, AMD integrates AI engines with other processing units like CPUs and GPUs, which are incorporated into modern Ryzen AI processors. In such systems, AI engines are usually referred to as Compute Tiles that are a self-contained processing block designed to efficiently execute AI and signal processing workloads. These blocks are integrated with different other types of tiles^[17]^[24], namely Memory tile and Shim tile. The apparatus containing the interconnected three kinds of tiles is named XDNA^[25], and its first generation, namely XDMA 1, is released on Ryzen AI Phoenix PCs. Along with this release, AMD continues the research about programmability, releasing, as open source tool, Riallto^[26].

On a similar path, at the end of 2023, early 2024, AMD announced the XDNA 2, along with the Strix series of Ryzen AI architectures ^[27] ^[28]. Different from the first generation of XDNA architectures, the second one offers more units to target the massive workload of ML systems. Again, to keep the efforts on the programmability side, AMD released the open source Ryzen AI SW toolchain, which includes the tools and runtime libraries for optimizing and deploying AI inference on Ryzen AI PC^[25].

Lastly, as neural processing and deep learning applications are spreading across different domains, researchers and industry are referring to XDNA architectures as Neural Processing Units (NPUs). However, the term includes all those architectures specifically meant for deep learning workloads^[29] and several companies, such as Huawei^[30] and Tesla^[31], are proposing their own alternative^[30]^[31].

Hardware architecture

AI engine tile

A single AI engine is a 7-way VLIW^[11]^[32] processor that offers vector and scalar capabilities, enabling parallel execution of multiple operations per clock cycle. The architecture includes a 128-bit wide vector unit capable of SIMD (Single Instruction, Multiple Data) execution, a scalar unit for control and sequential logic, and a set of load/store units for memory access. The maximum vector register size is 1024 bit, leading to different vector sizes depending on the vector data type^[32] .

In the first generation, each AI engine tile has a 32KB memory to load partial computations and 16KB of program memory^[32] .

AI engines are statically scheduled architectures. As widely studied in literature, static scheduling suffers from code explosion, requiring manual code optimizations when writing the AI engine kernel to handle this side effect^[20]^[11].

The main programming language for a single AI engine is C++, used for both the connection declaration among multiple engines and the kernel logic executed by a specific AI engine tile^[33]. However, different toolchains can offer support for other programming languages, targeting specific applications or offering automation^[20].

First generation - the AI engine layer

In the first generation of Versal systems, each AI engine is connected to multiple other engines through three main interfaces, namely cascade, memory and stream interfaces. Each one represents a possible communication mechanism of each AI engine with the others ^[6].

The AI engine layer of the first versal systems combined 400 AI engines together^[34]. Each AI engine has a 32KB memory that extended up to 128KB by using the memory of neighbouring engines. This leads to a reduced number of actual compute cores but ensures enlarged data memory^[8]^[20].

Each AI engine can execute an independent function, or multiple functions by leveraging time multiplexing. The programming structure used to describe the AI engine instantiation, placement and connection is named AIE graph. The official programming model suggested by AMD requires writing such a file in C++. However, different programming toolchains, from both companies and research, can support different alternatives to improve programmability and/or performance^[20]^[24].

To compile the application, the original toolchain relies on a closed-source AI engine compiler that automatically performs placement and routing, despite custom indications that can be given when writing the AIE graph^[35].

As the AI engine were initially integrated in Versal systems only, thus combining AI engines with FPGAs capabilities and Network on Chip connectivity, this architectural layer also offers a limited number of direct communications with both of them. Such communications needs to be specified in both the AIE graph, to ensure a correct placement of the AI engines, and during the system-level design^[20]^[7].

Second generation - the AI engine ML

The second generation of AMD's AI engines, or AI engine ML (AIE-ML), provides some architectural modifications with respect to the first generation, focusing on performance and efficiency for machine-learning workloads^[23].

AIE-ML possesses almost twice the density of computing per tile, improved memory bandwidth, and natively supports data types with more AI inference workload-optimized formats such as INT8 and bfloat formats. These optimizations allow the second-generation engine to deliver up to three times more TOPS per watt than the underlying AI engine, which was primarily built for DSP-heavy workloads and required explicit SIMD programming and hand-coded data partitioning^[3].

Recent publications from researchers and institutions^[36] confirm that AIE-ML offers more scalable, more on-chip memory, and more computational power^[3], making it better suited for edge-based modern ML inference workloads. These advances collectively counter the limitations of the first generation^[23].

According to the company official documentation, there are some key similarities and differences between the two architectures^[23].

Key similarities and differences between AI engine of first (AIE) and second (AIE-ML) generation^[23]
similarities between AIE-ML and AIE	differences between AIE-ML and AIE
Same process, voltage, frequency, clock and power distribution	AIE-ML features doubled compute/memory. AIE-ML features a processor bus for direct read/write accesses to local tile memory-mapped registers.
One VLIW SIMD processor per tile	AIE-ML features an increased memory capacity (64 KB)
Same debug functionality	AIE-ML features an improved power efficiency (TOPs/W).
Same connectivity with PL and NoC	AIE-ML features an improved stream switch functionality, performing source to destination parity check and deterministic merge
Same bandwidth for stream interconnect	AIE-ML features a grid-array architecture supporting both vertical (top to bottom) and horizontal (left to right) 512-bit cascade, versus the 384-bit horizontal cascade only of AIE.

XDNA 1

The XDNA is the hardware layer combining three types of tiles^[24]^[25]:

The Compute Tile (AI engine ML) is responsible for executing vector and scalar operations.
The Memory Tile is responsible for 512 KB of local memory and computes pattern-specific data movements to upstream Compute Tile fetch requests.
The ShimTile, which handles the host memory interaction, controls the data exchanges between Memory and Compute Tiles.

The XDNA architecture is combined with other architectural layers such as CPUs and GPUs, for Ryzen AI Phoenix architectures, composing the AMD product line for energy-efficient inference and AI workloads^[24].

XDNA 2

Second generation of XDNA layers is integrated within Ryzen AI Strix architecture and official documents from the producer claim it as specifically tailored for LLM inference workloads^[25].

Tools and programming model

The main programming environments for AI engine, officially supported by AMD, are the Vitis flow, which uses the Vitis toolchain to program the hardware accelerator^[33]^[37]^[7].

Vitis offers support for both hardware and software developers using a unified development environment, including high-level synthesis, RTL-based flows, and domain-specific libraries ^[38]. Vitis enables applications to be deployed onto heterogeneous platforms, including AI engines, FPGAs, and scalar processors^[38].

Newer architectures are rather moving towards a design approach utilizing Vitis for hardware and IP design, while relying on Vivado for system integration and hardware setup. Vivado^[39], is also a part of the AMD toolchain ecosystem, is primarily utilized for RTL design and IP integration and offers a GUI-based design environment to design block designs and manage synthesis, implementation, and bitstream generation^[39].

About the AI engine layer, the main programming language for a single AI engine is C++, used for both the connection declaration among multiple engines and the kernel logic executed by a specific AI engine tile^[33].

Research toolchains

Parallely to the company efforts in proposing programming models, design flows and tools, researchers also proposed their own toolchains targeting programmability, performance, or simplifying development for a subset of applications^[20]^[40]^[24]^[19].

Following some of the main research toolchains are brefly described^[41]^[20]^[40]^[19].

IRON is an open-source toolchain developed by AMD in collaboration with several researchers. IRON toolchain uses MLIR as its middle representation^[41]. At the user level, IRON permits a Python API for placing and orchestrating multiple AI engines. Such Python code is then translated into MLIR using one of the two possible backends: a Vitis-based backend and an open-source backend using the Peano compiler^[24]. IRON still relies on C++ for kernel development, supporting all the APIs of the standard AI engine kernel development flow^[24].
ARIES (An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI engines) presents a high-level, tile-based programming model and shared MLIR intermediate representation encompassing both AI engines and FPGA fabric. It represents task-level, tile-level, and instruction-level parallelism in MLIR and accommodates global and local optimization passes. ARIES generates compact C++ code for AI engine kernels and data-movement logic, allowing kernel specification through Python^[20].
EA4RCA is aimed at a specialized subclass of algorithms, regular Communication-Avoiding algorithms. EA4RCA introduces a design environment optimized for the Versal heterogeneity, emphasizing AI engine performance and high-speed data streaming abstractions. EA4RCA is aimed at algorithms exhibiting regular communication patterns to make the most out of parallelism and hierarchies of memory in the Versal platform^[40].
CHARM is a framework to compose multiple diverse matrix multiplication accelerators working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling^[19].

References

^ ^a ^b "AMD Completes Acquisition of Xilinx". Advanced Micro Devices, Inc. 2022-02-14. Retrieved 2025-07-08.
^ "Developing a BLAS library for the AMD AI Engine Extended Abstract". arxiv.org. Retrieved 2025-07-09.
^ ^a ^b ^c Mhatre, Kaustubh; Taka, Endri; Arora, Aman (2025-04-15), GAMA: High-Performance GEMM Acceleration on AMD Versal ML-Optimized AI Engines, arXiv, doi:10.48550/arXiv.2504.09688, arXiv:2504.09688, retrieved 2025-07-08
^ Chen, Paul; Manjunath, Pavan; Wijeratne, Sasindu; Zhang, Bingyi; Prasanna, Viktor (2023-09-04). "Exploiting On-Chip Heterogeneity of Versal Architecture for GNN Inference Acceleration". International Conference on Field-Programmable Logic and Applications (FPL). IEEE: 219–227. doi:10.1109/FPL60245.2023.00038. ISBN 979-8-3503-4151-5.
^ Flores, Fernando; Peña, María Dolores Valdés; Sánchez, José Manuel Villapún; Pazo, Jesús Manuel Costa; Graña, Camilo Quintáns (2024-11-13). "Evaluation of the Versal Intelligent Engines for Digital Signal Processing Basic Core Units". 2024 39th Conference on Design of Circuits and Integrated Systems (DCIS). IEEE: 1–6. doi:10.1109/DCIS62603.2024.10769170. ISBN 979-8-3503-6439-2.
^ ^a ^b "AI Engine: Meeting the Compute Demands of Next-Generation Applications".
^ ^a ^b ^c Menzel, Johannes; Plessl, Christian (2025-05-04). "Efficient and Distributed Computation of Electron Repulsion Integrals on AMD AI Engines". 2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines: 95–104. doi:10.1109/FCCM62733.2025.00044.
^ ^a ^b ^c Vissers, Kees (2019-02-20). "Versal: The Xilinx Adaptive Compute Acceleration Platform (ACAP)". Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. FPGA '19. New York, NY, USA: Association for Computing Machinery: 83. doi:10.1145/3289602.3294007. ISBN 978-1-4503-6137-8.
^ ^a ^b Swarbrick, Ian; Gaitonde, Dinesh; Ahmad, Sagheer; Gaide, Brian; Arbel, Ygal (2019-02-20). "Network-on-Chip Programmable Platform in VersalTM ACAP Architecture". Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. FPGA '19. New York, NY, USA: Association for Computing Machinery: 212–221. doi:10.1145/3289602.3293908. ISBN 978-1-4503-6137-8.
^ Chhugani, Jatin; Nguyen, Anthony D.; Lee, Victor W.; Macy, William; Hagog, Mostafa; Chen, Yen-Kuang; Baransi, Akram; Kumar, Sanjeev; Dubey, Pradeep (2008-08-01). "Efficient implementation of sorting on multi-core SIMD CPU architecture". Proc. VLDB Endow. 1 (2): 1313–1324. doi:10.14778/1454159.1454171. ISSN 2150-8097.
^ ^a ^b ^c Hennessy, John L.; Patterson, David A. (2019). Computer architecture: a quantitative approach. Krste Asanović (Sixth ed.). Cambridge, Mass: Morgan Kaufmann Publishers, an imprint of Elsevier. ISBN 978-0-12-811905-1.
^ ^a ^b Brown, Nick (2023-02-12). "Exploring the Versal AI Engines for Accelerating Stencil-based Atmospheric Advection Simulation". Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. FPGA '23. New York, NY, USA: Association for Computing Machinery: 91–97. doi:10.1145/3543622.3573047. ISBN 978-1-4503-9417-8.
^ Shimamura, Kotaro; Ohno, Ayumi; Takamaeda-Yamazaki, Shinya (2025-02-17), Exploring the Versal AI Engine for 3D Gaussian Splatting, arXiv, doi:10.48550/arXiv.2502.11782, arXiv:2502.11782, retrieved 2025-07-08
^ Brown, Nick; Canal, Gabriel Rodríguez (2025-02-14), Seamless acceleration of Fortran intrinsics via AMD AI engines, arXiv, doi:10.48550/arXiv.2502.10254, arXiv:2502.10254, retrieved 2025-07-08
^ "AMD Customer Community - AI engine name". adaptivesupport.amd.com. Retrieved 2025-07-10.
^ Mehta, Nick (2014). "UltraScale Architecture: Highest Device Utilization, Performance, and Scalability" (PDF).
^ ^a ^b Levental, Maksim; Khan, Arham; Chard, Ryan; Chard, Kyle; Neuendorffer, Stephen; Foster, Ian (2024-06-19). "An End-to-End Programming Model for AI Engine Architectures". Proceedings of the 14th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies. HEART '24. New York, NY, USA: Association for Computing Machinery: 135–136. doi:10.1145/3665283.3665294. ISBN 979-8-4007-1727-7.
^ Nguyen, Tan; Blair, Zachary; Neuendorffer, Stephen; Wawrzynek, John (2023-09-04). "SPADES: A Productive Design Flow for Versal Programmable Logic". 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL): 65–71. doi:10.1109/FPL60245.2023.00017.
^ ^a ^b ^c ^d Zhuang, Jinming; Lau, Jason; Ye, Hanchen; Yang, Zhuoping; Du, Yubo; Lo, Jack; Denolf, Kristof; Neuendorffer, Stephen; Jones, Alex; Hu, Jingtong; Chen, Deming; Cong, Jason; Zhou, Peipei (2023-02-12). "CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture". Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. FPGA '23. New York, NY, USA: Association for Computing Machinery: 153–164. doi:10.1145/3543622.3573210. ISBN 978-1-4503-9417-8.
^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ Zhuang, Jinming; Xiang, Shaojie; Chen, Hongzheng; Zhang, Niansong; Yang, Zhuoping; Mao, Tony; Zhang, Zhiru; Zhou, Peipei (2025-02-27). "ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines". Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. FPGA '25. New York, NY, USA: Association for Computing Machinery: 92–102. doi:10.1145/3706628.3708870. ISBN 979-8-4007-1396-5.
^ Delaye, Elliott (2022-05-30). "CGRA4HPC 2022 Invited Speaker: Mapping ML to the AMD/Xilinx AIE-ML architecture". 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW): 628–628. doi:10.1109/IPDPSW55747.2022.00109.
^ Kalamkar, Dhiraj; Mudigere, Dheevatsa; Mellempudi, Naveen; Das, Dipankar; Banerjee, Kunal; Avancha, Sasikanth; Vooturi, Dharma Teja; Jammalamadaka, Nataraj; Huang, Jianyu (2019-06-13), A Study of BFLOAT16 for Deep Learning Training, arXiv, doi:10.48550/arXiv.1905.12322, arXiv:1905.12322, retrieved 2025-07-08
^ ^a ^b ^c ^d ^e "AMD Technical Information Portal - AIE-ML comparison with AIE". docs.amd.com. Retrieved 2025-07-09.
^ ^a ^b ^c ^d ^e ^f ^g ^h Hunhoff, Erika; Melber, Joseph; Denolf, Kristof; Bisca, Andra; Bayliss, Samuel; Neuendorffer, Stephen; Fifield, Jeff; Lo, Jack; Vasireddy, Pranathi; James-Roxby, Phil; Keller, Eric (2025-05-04). "Efficiency, Expressivity, and Extensibility in a Close-to-Metal NPU Programming Interface". The 33rd IEEE International Symposium on Field-Programmable Custom Computing Machines. IEEE: 85–94. doi:10.1109/FCCM62733.2025.00043. ISBN 979-8-3315-0281-2.
^ ^a ^b ^c ^d Rico, Alejandro; Pareek, Satyaprakash; Cabezas, Javier; Clarke, David; Ozgul, Baris; Barat, Francisco; Fu, Yao; Münz, Stephan; Stuart, Dylan; Schlangen, Patrick; Duarte, Pedro; Date, Sneha; Paul, Indrani; Weng, Jian; Santan, Sonal (2024-07-10). "AMD XDNA NPU in Ryzen AI Processors". IEEE Micro. 44 (6): 73–82. doi:10.1109/MM.2024.3423692. ISSN 1937-4143.
^ Schmidt, Andrew (2024-05-27). "RAW 2024 Invited Talk-9: Riallto: An Open-Source Exploratory Framework for Ryzen AI™". International Parallel and Distributed Processing Symposium Workshops. IEEE: 91–91. doi:10.1109/IPDPSW63119.2024.00030. ISBN 979-8-3503-6460-6.
^ Alcorn, Paul (July 15, 2024). "AMD deep-dives Zen 5 architecture — Ryzen 9000 and AI 300 benchmarks, RDNA 3.5 GPU, XDNA 2, and more". TomsHardware.{{cite web}}: CS1 maint: url-status (link)
^ Bonshor, Gavin. "The AMD Zen 5 Microarchitecture: Powering Ryzen AI 300 Series For Mobile and Ryzen 9000 for Desktop". www.anandtech.com. Retrieved 2025-07-09.
^ Lee, Kyuho J. (2021-01-01), Kim, Shiho; Deka, Ganesh Chandra (eds.), "Chapter Seven - Architecture of neural processing unit for deep neural networks", Advances in Computers, Hardware Accelerator Systems for Artificial Intelligence and Machine Learning, vol. 122, Elsevier, pp. 217–245, doi:10.1016/bs.adcom.2020.11.001, retrieved 2025-07-08
^ ^a ^b Liao, Heng; Tu, Jiajin; Xia, Jing; Liu, Hu; Zhou, Xiping; Yuan, Honghui; Hu, Yuxing (2021-02-27). "Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing : Industry Track Paper". 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA): 789–801. doi:10.1109/HPCA51647.2021.00071.
^ ^a ^b Talpes, Emil; Sarma, Debjit Das; Venkataramanan, Ganesh; Bannon, Peter; McGee, Bill; Floering, Benjamin; Jalote, Ankit; Hsiong, Christopher; Arora, Sahil; Gorti, Atchyuth; Sachdev, Gagandeep S. (2020-03-24). "Compute Solution for Tesla's Full Self-Driving Computer". IEEE Micro. 40 (2): 25–35. doi:10.1109/MM.2020.2975764. ISSN 1937-4143.
^ ^a ^b ^c "Very Long Instruction Word (VLIW) Architecture". GeeksforGeeks. 2020-12-01. Retrieved 2025-07-07.
^ ^a ^b ^c "AMD Technical Information Portal - Tools". docs.amd.com. Retrieved 2025-07-08.
^ "VCK5000 Versal Development Card - Documentation". AMD. Retrieved 2025-07-11.
^ "AMD Technical Information Portal - AI engine compiler". docs.amd.com. Retrieved 2025-07-09.
^ "Design Rationale of Two Generations of AI Engines" (PDF). indico.cern.ch. Archived from the original (PDF) on 2024-12-17. Retrieved 2025-07-08.
^ "AMD Technical Information Portal - AI Engine programming model". docs.amd.com. Retrieved 2025-07-09.
^ ^a ^b Kathail, Vinod (2020-02-24). "Xilinx Vitis Unified Software Platform". Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. FPGA '20. New York, NY, USA: Association for Computing Machinery: 173–174. doi:10.1145/3373087.3375887. ISBN 978-1-4503-7099-8.
^ ^a ^b Zhao, Zhipeng; Hoe, James C. (2017-02-22). "Using Vivado-HLS for Structural Design: a NoC Case Study (Abstract Only)". Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. FPGA '17. New York, NY, USA: Association for Computing Machinery: 289. doi:10.1145/3020078.3021772. ISBN 978-1-4503-4354-1.
^ ^a ^b ^c Zhang, Wenbo; Liu, Yiqi; Zang, Tianhao; Bao, Zhenshan (2024-11-19). "EA4RCA: Efficient AIE accelerator design framework for regular Communication-Avoiding Algorithm". ACM Trans. Archit. Code Optim. 21 (4): 71:1–71:24. doi:10.1145/3678010. ISSN 1544-3566.
^ ^a ^b Lattner, Chris; Amini, Mehdi; Bondhugula, Uday; Cohen, Albert; Davis, Andy; Pienaar, Jacques; Riddle, River; Shpeisman, Tatiana; Vasilache, Nicolas; Zinenko, Oleksandr (2021-02-21). "MLIR: Scaling Compiler Infrastructure for Domain Specific Computation". 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO): 2–14. doi:10.1109/CGO51591.2021.9370308.

External links

[:11-1] "AMD Completes Acquisition of Xilinx". Advanced Micro Devices, Inc. 2022-02-14. Retrieved 2025-07-08.

[2] "Developing a BLAS library for the AMD AI Engine Extended Abstract". arxiv.org. Retrieved 2025-07-09.

[:5-3] Mhatre, Kaustubh; Taka, Endri; Arora, Aman (2025-04-15), GAMA: High-Performance GEMM Acceleration on AMD Versal ML-Optimized AI Engines, arXiv, doi:10.48550/arXiv.2504.09688, arXiv:2504.09688, retrieved 2025-07-08

[4] Chen, Paul; Manjunath, Pavan; Wijeratne, Sasindu; Zhang, Bingyi; Prasanna, Viktor (2023-09-04). "Exploiting On-Chip Heterogeneity of Versal Architecture for GNN Inference Acceleration". International Conference on Field-Programmable Logic and Applications (FPL). IEEE: 219–227. doi:10.1109/FPL60245.2023.00038. ISBN 979-8-3503-4151-5.

[5] Flores, Fernando; Peña, María Dolores Valdés; Sánchez, José Manuel Villapún; Pazo, Jesús Manuel Costa; Graña, Camilo Quintáns (2024-11-13). "Evaluation of the Versal Intelligent Engines for Digital Signal Processing Basic Core Units". 2024 39th Conference on Design of Circuits and Integrated Systems (DCIS). IEEE: 1–6. doi:10.1109/DCIS62603.2024.10769170. ISBN 979-8-3503-6439-2.

[:8-6] "AI Engine: Meeting the Compute Demands of Next-Generation Applications".

[:12-7] Menzel, Johannes; Plessl, Christian (2025-05-04). "Efficient and Distributed Computation of Electron Repulsion Integrals on AMD AI Engines". 2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines: 95–104. doi:10.1109/FCCM62733.2025.00044.

[:7-8] Vissers, Kees (2019-02-20). "Versal: The Xilinx Adaptive Compute Acceleration Platform (ACAP)". Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. FPGA '19. New York, NY, USA: Association for Computing Machinery: 83. doi:10.1145/3289602.3294007. ISBN 978-1-4503-6137-8.

[:18-9] Swarbrick, Ian; Gaitonde, Dinesh; Ahmad, Sagheer; Gaide, Brian; Arbel, Ygal (2019-02-20). "Network-on-Chip Programmable Platform in VersalTM ACAP Architecture". Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. FPGA '19. New York, NY, USA: Association for Computing Machinery: 212–221. doi:10.1145/3289602.3293908. ISBN 978-1-4503-6137-8.

[:17-10] Chhugani, Jatin; Nguyen, Anthony D.; Lee, Victor W.; Macy, William; Hagog, Mostafa; Chen, Yen-Kuang; Baransi, Akram; Kumar, Sanjeev; Dubey, Pradeep (2008-08-01). "Efficient implementation of sorting on multi-core SIMD CPU architecture". Proc. VLDB Endow. 1 (2): 1313–1324. doi:10.14778/1454159.1454171. ISSN 2150-8097.

[:0-11] Hennessy, John L.; Patterson, David A. (2019). Computer architecture: a quantitative approach. Krste Asanović (Sixth ed.). Cambridge, Mass: Morgan Kaufmann Publishers, an imprint of Elsevier. ISBN 978-0-12-811905-1.

[:13-12] Brown, Nick (2023-02-12). "Exploring the Versal AI Engines for Accelerating Stencil-based Atmospheric Advection Simulation". Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. FPGA '23. New York, NY, USA: Association for Computing Machinery: 91–97. doi:10.1145/3543622.3573047. ISBN 978-1-4503-9417-8.

[13] Shimamura, Kotaro; Ohno, Ayumi; Takamaeda-Yamazaki, Shinya (2025-02-17), Exploring the Versal AI Engine for 3D Gaussian Splatting, arXiv, doi:10.48550/arXiv.2502.11782, arXiv:2502.11782, retrieved 2025-07-08

[14] Brown, Nick; Canal, Gabriel Rodríguez (2025-02-14), Seamless acceleration of Fortran intrinsics via AMD AI engines, arXiv, doi:10.48550/arXiv.2502.10254, arXiv:2502.10254, retrieved 2025-07-08

[15] "AMD Customer Community - AI engine name". adaptivesupport.amd.com. Retrieved 2025-07-10.

[16] Mehta, Nick (2014). "UltraScale Architecture: Highest Device Utilization, Performance, and Scalability" (PDF).

[:1-17] Levental, Maksim; Khan, Arham; Chard, Ryan; Chard, Kyle; Neuendorffer, Stephen; Foster, Ian (2024-06-19). "An End-to-End Programming Model for AI Engine Architectures". Proceedings of the 14th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies. HEART '24. New York, NY, USA: Association for Computing Machinery: 135–136. doi:10.1145/3665283.3665294. ISBN 979-8-4007-1727-7.

[18] Nguyen, Tan; Blair, Zachary; Neuendorffer, Stephen; Wawrzynek, John (2023-09-04). "SPADES: A Productive Design Flow for Versal Programmable Logic". 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL): 65–71. doi:10.1109/FPL60245.2023.00017.

[:9-19] Zhuang, Jinming; Lau, Jason; Ye, Hanchen; Yang, Zhuoping; Du, Yubo; Lo, Jack; Denolf, Kristof; Neuendorffer, Stephen; Jones, Alex; Hu, Jingtong; Chen, Deming; Cong, Jason; Zhou, Peipei (2023-02-12). "CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture". Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. FPGA '23. New York, NY, USA: Association for Computing Machinery: 153–164. doi:10.1145/3543622.3573210. ISBN 978-1-4503-9417-8.

[:4-20] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ Zhuang, Jinming; Xiang, Shaojie; Chen, Hongzheng; Zhang, Niansong; Yang, Zhuoping; Mao, Tony; Zhang, Zhiru; Zhou, Peipei (2025-02-27). "ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines". Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. FPGA '25. New York, NY, USA: Association for Computing Machinery: 92–102. doi:10.1145/3706628.3708870. ISBN 979-8-4007-1396-5.

[21] Delaye, Elliott (2022-05-30). "CGRA4HPC 2022 Invited Speaker: Mapping ML to the AMD/Xilinx AIE-ML architecture". 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW): 628–628. doi:10.1109/IPDPSW55747.2022.00109.

[22] Kalamkar, Dhiraj; Mudigere, Dheevatsa; Mellempudi, Naveen; Das, Dipankar; Banerjee, Kunal; Avancha, Sasikanth; Vooturi, Dharma Teja; Jammalamadaka, Nataraj; Huang, Jianyu (2019-06-13), A Study of BFLOAT16 for Deep Learning Training, arXiv, doi:10.48550/arXiv.1905.12322, arXiv:1905.12322, retrieved 2025-07-08

[:14-23] "AMD Technical Information Portal - AIE-ML comparison with AIE". docs.amd.com. Retrieved 2025-07-09.

[:6-24] ^ ^a ^b ^c ^d ^e ^f ^g ^h Hunhoff, Erika; Melber, Joseph; Denolf, Kristof; Bisca, Andra; Bayliss, Samuel; Neuendorffer, Stephen; Fifield, Jeff; Lo, Jack; Vasireddy, Pranathi; James-Roxby, Phil; Keller, Eric (2025-05-04). "Efficiency, Expressivity, and Extensibility in a Close-to-Metal NPU Programming Interface". The 33rd IEEE International Symposium on Field-Programmable Custom Computing Machines. IEEE: 85–94. doi:10.1109/FCCM62733.2025.00043. ISBN 979-8-3315-0281-2.

[:3-25] Rico, Alejandro; Pareek, Satyaprakash; Cabezas, Javier; Clarke, David; Ozgul, Baris; Barat, Francisco; Fu, Yao; Münz, Stephan; Stuart, Dylan; Schlangen, Patrick; Duarte, Pedro; Date, Sneha; Paul, Indrani; Weng, Jian; Santan, Sonal (2024-07-10). "AMD XDNA NPU in Ryzen AI Processors". IEEE Micro. 44 (6): 73–82. doi:10.1109/MM.2024.3423692. ISSN 1937-4143.

[26] Schmidt, Andrew (2024-05-27). "RAW 2024 Invited Talk-9: Riallto: An Open-Source Exploratory Framework for Ryzen AI™". International Parallel and Distributed Processing Symposium Workshops. IEEE: 91–91. doi:10.1109/IPDPSW63119.2024.00030. ISBN 979-8-3503-6460-6.

[27] Alcorn, Paul (July 15, 2024). "AMD deep-dives Zen 5 architecture — Ryzen 9000 and AI 300 benchmarks, RDNA 3.5 GPU, XDNA 2, and more". TomsHardware.{{cite web}}: CS1 maint: url-status (link)

[28] Bonshor, Gavin. "The AMD Zen 5 Microarchitecture: Powering Ryzen AI 300 Series For Mobile and Ryzen 9000 for Desktop". www.anandtech.com. Retrieved 2025-07-09.

[29] Lee, Kyuho J. (2021-01-01), Kim, Shiho; Deka, Ganesh Chandra (eds.), "Chapter Seven - Architecture of neural processing unit for deep neural networks", Advances in Computers, Hardware Accelerator Systems for Artificial Intelligence and Machine Learning, vol. 122, Elsevier, pp. 217–245, doi:10.1016/bs.adcom.2020.11.001, retrieved 2025-07-08

[:19-30] Liao, Heng; Tu, Jiajin; Xia, Jing; Liu, Hu; Zhou, Xiping; Yuan, Honghui; Hu, Yuxing (2021-02-27). "Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing : Industry Track Paper". 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA): 789–801. doi:10.1109/HPCA51647.2021.00071.

[:20-31] Talpes, Emil; Sarma, Debjit Das; Venkataramanan, Ganesh; Bannon, Peter; McGee, Bill; Floering, Benjamin; Jalote, Ankit; Hsiong, Christopher; Arora, Sahil; Gorti, Atchyuth; Sachdev, Gagandeep S. (2020-03-24). "Compute Solution for Tesla's Full Self-Driving Computer". IEEE Micro. 40 (2): 25–35. doi:10.1109/MM.2020.2975764. ISSN 1937-4143.

[:2-32] "Very Long Instruction Word (VLIW) Architecture". GeeksforGeeks. 2020-12-01. Retrieved 2025-07-07.

[:10-33] "AMD Technical Information Portal - Tools". docs.amd.com. Retrieved 2025-07-08.

[34] "VCK5000 Versal Development Card - Documentation". AMD. Retrieved 2025-07-11.

[35] "AMD Technical Information Portal - AI engine compiler". docs.amd.com. Retrieved 2025-07-09.

[36] "Design Rationale of Two Generations of AI Engines" (PDF). indico.cern.ch. Archived from the original (PDF) on 2024-12-17. Retrieved 2025-07-08.

[37] "AMD Technical Information Portal - AI Engine programming model". docs.amd.com. Retrieved 2025-07-09.

[:21-38] Kathail, Vinod (2020-02-24). "Xilinx Vitis Unified Software Platform". Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. FPGA '20. New York, NY, USA: Association for Computing Machinery: 173–174. doi:10.1145/3373087.3375887. ISBN 978-1-4503-7099-8.

[:15-39] Zhao, Zhipeng; Hoe, James C. (2017-02-22). "Using Vivado-HLS for Structural Design: a NoC Case Study (Abstract Only)". Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. FPGA '17. New York, NY, USA: Association for Computing Machinery: 289. doi:10.1145/3020078.3021772. ISBN 978-1-4503-4354-1.

[:16-40] Zhang, Wenbo; Liu, Yiqi; Zang, Tianhao; Bao, Zhenshan (2024-11-19). "EA4RCA: Efficient AIE accelerator design framework for regular Communication-Avoiding Algorithm". ACM Trans. Archit. Code Optim. 21 (4): 71:1–71:24. doi:10.1145/3678010. ISSN 1544-3566.

[:22-41] Lattner, Chris; Amini, Mehdi; Bondhugula, Uday; Cohen, Albert; Davis, Andy; Pienaar, Jacques; Riddle, River; Shpeisman, Tatiana; Vasilache, Nicolas; Zinenko, Oleksandr (2021-02-21). "MLIR: Scaling Compiler Infrastructure for Domain Specific Computation". 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO): 2–14. doi:10.1109/CGO51591.2021.9370308.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

v t e Digital electronics
Components	Transistor Resistor Inductor Capacitor Printed electronics Printed circuit board Electronic circuit Flip-flop Memory cell Combinational logic Sequential logic Logic gate Boolean circuit Integrated circuit (IC) Hybrid integrated circuit (HIC) Mixed-signal integrated circuit Three-dimensional integrated circuit (3D IC) Emitter-coupled logic (ECL) Erasable programmable logic device (EPLD) Macrocell array Programmable logic array (PLA) Programmable logic device (PLD) Programmable Array Logic (PAL) Generic Array Logic (GAL) Complex programmable logic device (CPLD) Field-programmable gate array (FPGA) Field-programmable object array (FPOA) Application-specific integrated circuit (ASIC) Tensor Processing Unit (TPU)
Theory	Digital signal Boolean algebra Logic synthesis Logic in computer science Computer architecture Digital signal Digital signal processing Circuit minimization Switching circuit theory Gate equivalent
Design	Logic synthesis Place and route Placement Routing Transaction-level modeling Register-transfer level Hardware description language High-level synthesis Formal equivalence checking Synchronous logic Asynchronous logic Finite-state machine Hierarchical state machine
Applications	Computer hardware Hardware acceleration Digital audio radio Digital photography Digital telephone Digital video cinematography television Electronic literature
Design issues	Metastability Runt pulse

v t e Hardware acceleration
Theory	Universal Turing machine Parallel computing Distributed computing
Applications	GPU GPGPU DirectX Audio Digital signal processing Hardware random number generation Neural processing unit Cryptography TLS Machine vision Custom hardware attack scrypt Networking Data
Implementations	High-level synthesis C to HDL FPGA ASIC CPLD System on a chip Network on a chip
Architectures	Dataflow Transport triggered Multicore Manycore Heterogeneous In-memory computing Systolic array Neuromorphic
Related	Programmable logic Processor design chronology Digital electronics Virtualization Hardware emulation Logic synthesis Embedded systems

AI engine

History