Intel Xe HP GPU | 糖橙结 ITCJ 的博客

type

status

date

slug

summary

集群使用

JumpServer 是全球首款开源的堡垒机，使用 GNU GPL v3.0 开源协议，是符合 4A 规范的运维安全审计系统。 JumpServer 使用 Python / Django 为主进行开发，遵循 Web 2.0 规范，配备了业界领先的 Web Terminal 方案，交互界面美观、用户体验好。 JumpServer 采纳分布式架构，支持多机房跨区域部署，支持横向扩展，无资产数量及并发限制。改变世界，从一点点开始。开源：零门槛，线上快速获取和安装；分布式：轻松支持大规模并发访问；无插件：仅需浏览器，极致的 Web Terminal 使用体验；多云支持：一套系统，同时管理不同云上面的资产；云端存储：审计录像云端存储，永不丢失；多租户：一套系统，多个子公司和部门同时使用；多应用支持：数据库，Windows远程应用，Kubernetes。 JumpServer是一款安全产品，请参考基本安全建议部署安装. 如果你发现安全问题，可以直接联系我们：

https://docs.jumpserver.org/zh/master/

Intel® oneAPI Toolkits

Intel® oneAPI products deliver the freedom to develop with a unified toolset and to deploy applications and solutions across CPU, GPU, and FPGA architectures.

https://www.intel.cn/content/www/cn/zh/developer/tools/oneapi/toolkits.html

架构信息 Intel® Iris® Xe GPU Architecture

比赛使用的可能是 Intel Arctic Sound（Xe-HP）

实体图

Intel Arctic Sound（Xe-HP）曝光, 搭 32GB HBM2e 記憶體 - 滄者極限

Intel Xe-HP 是屬於 Xe 架構中高階的產品，主要是資料中心所使用，工作站所用於專業用途而不是給玩家用的遊戲顯卡，嗯...這挖礦算力不知道如何XD。最近已經有實品曝光，有兩款，分別為 Arctic Sound 1T 以及 Arctic Sound 2T。 Intel Arctic Sound 1T 指的是1核心配置512組 EU，不過該晶片某部分禁用，實際規格是384組 EU，如果 Xe 架構與前幾代相同原理，推算應該會有3072個著色單元（串流處理器），該卡配置 16GB HBM2e 記憶體，TDP 為150W，採用 PCIe 4.0 介面。 Intel Arctic Sound 2T，是有兩顆核心，每顆480組 EU，有7680個著色單元，仍不是完整的8192的配置，2T 型號配置 32GB HBM2e 記憶體，TDP 為300W。 Intel Xe 產品定位

https://www.coolaler.com/index/intel-arctic-sound%EF%BC%88xe-hp%EF%BC%89%E6%9B%9D%E5%85%89-%E6%90%AD-32gb-hbm2e-%E8%A8%98%E6%86%B6%E9%AB%94/

架构信息

Intel® Iris® Xe GPU Architecture

An Execution Unit (EU) is the smallest thread-level building block of the Intel Iris X-LP GPU architecture. Each EU is simultaneously multithreaded (SMT) with seven threads. The primary computation unit consists of a 8-wide Single Instruction Multiple Data (SIMD) Arithmetic Logic Units (ALU) supporting SIMD8 FP/INT operations and a 2-wide SIMD ALU supporting SIMD2 extended math operations.

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/xe-arch.html

EU → Xe Core

相较于低功耗的Xe-LP，HPC、HPG使用的是类似于LP dual subslice的Xe core

基础信息

8 vector and 8 matrix engines

vector engine: 512 bit FP32 SIMD16 (FMA)

512 FP16
256 FP32
256 FP64 operations/cycle

💡

这里原文说的是 16 FP32 SIMD 8 vector engines，128个FMA，对应的就是256个operation/cycle 值得注意的是，FP32和FP64是对等的

matrix engine: 4096 bit

8192 int8
4096 FP16/BF16
2048 FP32 operations/cycle

512KB L1 cache/SLM
512B/cycle load/store

Xe Slice

16 * Xe core

512KB * 16 = 8M L1

16 * ray tracing units

1 hardware context

Xe Stack

4 * Xe-slice

64 * Xe-cores

64 ray tracing units

4 hardware contexts

4 HBM2e controllers

1 media engine

8 Xe-Link high speed coherent fabric

shared L2 cache

Xe-HPC 2-Stack Ponte Vecchio GPU

2个stack

8 slices

28 Xe-cores

128 ray tracing units

8 hardware contexts

8 HBM2e controllers

16 Xe-Links

Xe-HPG GPU

HPG则包括了16 vector engines和16 matrix engines，但减少为256bit， 8FP32

ATS-P 参数说明

相对于PVC架构，可以参考核心参数，但cache等规格不同

每个tail 30 Xe core，480 EU （VE），2个tail组成960EU（VE）

💡

480/30=16，ATS-P的Core包含了16个VE，相对的每个VE的平均缓存减少了

计算FP64单元数：

ATS-P 2Tail clinfo 信息

THREAD

MEMORY

Ponte Vecchio 参数对比

ㅤ	Xe Stack	Xe Slice	Xe Core	Xe VE	FMA	operation
ㅤ	2	2 * 4 = 8	8 * 16 = 128	128 * 8 = 1024	1024 * 16 SIMD = 16384	32768

执行模型概述

GPU Execution Model Overview

The General Purpose GPU (GPGPU) compute model consists of a host connected to one or more compute devices. Each compute device consists of many GPU Compute Engines (CE), also known as Execution Units (EU) or Vector Engines (VE).

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/execution-model.html

1个host - 多个compute device - 多个GPU Compute Engines（EU/VE）

compute device带有独立的缓存和内存等部件

application包括运行在host上的部分和提交到VE的kernel

由kernel function组成contest发送到device执行，其中包括了内存索引信息、变量参数

💡

device 可以有自己的command queue以执行不同的任务是否就是通过不同的command来实现同一套代码运行在不同的平台

一个kernel function —— work item

多个执行中的work item 集合成work group，compute device 通过work group管理 item，感觉类似于cuda 的index通过 global id或group ID+ local ID来定义

work group：runs the same kernel on several unit items in a group

💡

SIMD就是VE同时执行多个item

类似barriers 这样的函数apply到work group中的每一个work-item

Compiler generates SIMD code to map several work-items to be executed simultaneously within a given hardware thread.

SYCL Thread Mapping

SYCL

如何看待英特尔 oneAPI 编程语言 DPC++ 功能？

DPC++是Data Parallel ...

https://www.zhihu.com/question/504501358/answer/2276515129

SYCL也是一个大坑，是一套编写可移植的编程模型，intel在SYCL 和 LLVM的基础上开发的DPC++，遵守C++17标准，具有便捷的可移植性

SYCL 到 Xe GPU

3维 work-item，再组织为thread group称为work-group

work-group中的thread（item）进一步被划分vector group称为sub-group

Work-item

A work-item represents one of a collection of parallel executions of a kernel.

💡

代表核函数中的一个线程，每个线程具有相同的“指令队列”

Sub-group

A sub-group represents a short range of consecutive work-items that are processed together as a SIMD vector of length 8, 16, 32, or a multiple of the native vector length of a CPU with Intel® UHD Graphics.

💡

由于硬件设计，需要以SIMD的方式同时执行多个thread，因此有sub group概念以组织，同一个sub group的线程会被分配到同一个硬件单元VE上执行 group被划分为多个sub gorup以映射硬件

Work-group

A work-group is a 1-, 2-, or 3-dimensional set of threads within the thread hierarchy. In SYCL, synchronization across work-items is only possible with barriers for the work-items within the same work-group.

💡

一个核函数所有thread（work-item）的集合，在行为和硬件映射上相似，因此能有有item间的sync，不同group（核函数）间只能整体sync

nd_range

sycl::nd_range — SYCL Reference documentation

代表number of dimension

💡

通常来说，可以将dimension设置得类似于内存排布以方便理解线程的行为

💡

dimension 2是sub-group对应的dimension

线程索引

一个小块是一个work item，认为是一个oeration，一个group计算时分配个一个xe EU？？

1个VE每个时钟周期能执行32个FP64

💡

如果这里正确，则Xe-HP 的core 也为8个VE

💡

更正：一个sub group 分配一个VE

kernel内线程同步方式

``mem_fence`` inserts a memory fence on global and local memory access across all work-items in a work-group.

``barrier`` inserts a memory fence and blocks the execution of all work-items within the work-group until all work-items have reached its location.

⭐work-group到Xe core映射关系与占用率

官网上有对应Xe-LP的架构表，类似的写出Xe-HP

1个tail	XVE	thread	Operation	max work group size
Xe-core	8 ? 16	8 * 16 = 128	128 * 2 = 256 ?	512 ? 1024
total	16 * 30 = 480	128 * 30 = 3840	256 * 30 = 7680	512 ？ 1024

Ponte Vecchio

Xe Stack	Xe Slice	Xe Core	Xe VE	FMA	operation
2	2 * 4 = 8	8 * 16 = 128	128 * 8 = 1024	1024 * 16 SIMD = 16384	32768

Arctic Sound (HP)

Xe Stack → Tail	~~Xe Slice~~	Xe Core	Xe VE	FMA	operation
2	~~2 * 4 = 8~~	2 * 30 = 60	~~60 * 8 = 480~~ 60 * 16 = 960	~~480 * 16 = 7680~~ 960 * 16 SIMD = 15360	~~15360~~ 30720
共享L2 cache 内存控制器交换总线等	增加了ray tracing unit，但对于 HP架构尚不明晰	共享load/store 共享L1/SLM cache	ㅤ	ㅤ	ㅤ
ㅤ	ㅤ	ㅤ	ㅤ	ㅤ	ㅤ

服务器上有2块此GPU，同样性能水平的有AMD MI250，45.3TFLOPs,其对应的内存带宽是3276.8GB/s，支持ECC，位宽8192bit，频率1.6GHz

💡

混合了thread和SIMD的概念，但每个thread每个周期能进行FMA，因此是2个operation

每个Xe core 16个EU

7个thread context（压入7个work group）

第二个work group 分配给第二个Xe core，共6个

💡

这是通过减少dispatch次数来提高效率的？？？