IPCC 22赛前讲座 | 糖橙结 ITCJ 的博客

type

status

date

slug

summary

tags

category

icon

password

比赛计算资源争抢

profile、联网
抢节点
软硬件平台bench

理解局部代码逻辑——优化空间

代码热点

手动插入

每个主要for循环

gprof、vtune、ncu工具，硬件开销

计算、带宽效率

计算密集、访存密集
是否逼近极限

增加代码局部性

循环融合
循环次序调整
算子融合

混合精度

访存密集：低精度存储，高精度运算
计算密集：高精度存储，低精度运算
迭代算法：先低后高

硬件特性

绑核，分配？numa？
-O3过量展开

编译器

不同编译器对程序不同部分的影响，为什么
另一个编译器能否达到同样效果
aocc/icc/icx/clang 一定比gcc好吗
新版本一定比旧版本好吗

GPGPU-Sim
icc17 对比 16 18 19差异很大，各部分不同

💡

重大的策略 * gcc7.5.0 * 多版本 aocc/icc/icx/clang/gcc 编译测试 * 不同编译器在不同代码段的运行时间（比赛集群和调试集群）

notion image

《编译原理》

notion image

2 超算基本原理与方法论

3 现代处理器优化概论

处理器架构 - YatCPU 实验文档

By: howardlau1999 本项目支持的指令集是 RISC-V 的 RV32I 整数运算指令集。微架构设计上，有两种可供选择：一种是和 Z-scale 以及 riscv-mini 类似的三级流水线单发射顺序执行架构，另一种是和《Computer Organization and Design, RISC-V edition》中描述的 CPU 类似的五级流水线单发射顺序执行架构。流水线总体架构分为取指（IF）、译码（DE）、执行/写回/访存（EX、WB、MEM）三个阶段，如下图所示：流水线总体架构分为取指（IF）、译码（DE）、执行（EX）、访存（MEM）和写回（WB）五个阶段，如下图所示：取指（InstructionFetch）取指单元包含程序计数器（PC）。在每个时钟上升沿，若 PC 阻塞信号 PCStall 有效，则 PC 的值保持不变；否则若跳转标志 JumpFlag 有效，则 PC 的值变为跳转目标地址 JumpAddress ；否则 PC 的值加 4。取指阶段根据 PC 的值通过 AXI4Lite 总线从内存中读取指令并将指令和地址向后传递。由于 CPU 复位时将指令从 ROM 载入内存、通过总线从内存读取每条指令都需要若干时钟周期，在指令未就绪期间，它向后传递空指令 nop，并且阻塞 PC（令 StallFlagIF 有效）。

处理器架构 - YatCPU 实验文档

https://yatcpu.sysu.tech/tutorial/arch/

处理器架构 - YatCPU 实验文档

LCTES

notion image

notion image

4 MPI

避免死锁

调整依赖关系

使用非阻塞接口

notion image

阻塞与非阻塞

计算与通信重叠的实现

notion image

系统缓冲

数据可能先传输到System buffer进行缓冲，会导致下述通信模式

notion image

缓冲适合于多小数据，不保证接收方是否开始接收，发送端拷贝到发送缓冲区从api返回

ready：避免额外缓冲区操作、建立连接等

notion image

notion image

集合通信

有规律而频繁的通信

notion image

💡

常见的bug是在if内进行collective communication

栅栏同步

notion image

数据移动

notion image

allgather 所有进程都有完整的数据

alltoall 转置（常用于数据在不同进程间交换，比如现在的大规模通信）

notion image

规约

reduce

allreduce：都得到10（神经网络梯度规约）

notion image

辅助函数

notion image

mpi example

notion image

notion image

不可并行部分不变，加上开销

error：数值精度问题？allreduce的顺序问题

5 OpenMP

共享内存并行，不需要显式的表达数据收发，只需要读取共享的内存

notion image

notion image

并行区构造

notion image

notion image

private进入后需要初始化，而first从主线程获取并初始化

reduction：离开并行区会执行operator操作

notion image

notion image

工作共享构造

notion image

notion image

schedule 调度控制

notion image

默认chunk任务粒度为1，但越细的力度有越大的调度开销，很多时候需要尝试调整，可使用guided自动调整，即越靠后任务粒度越小。

notion image

notion image

runtime 基于运行时确定

auto：摆烂

notion image

ordered 控制顺序执行

notion image

collapse 多重循环展开

nowait 离开并行区行为

不需要写到主线程

notion image

其他工作共享构造

notion image

同步构造

notion image

顺序执行代码段：critical

顺序执行语句：atomic（atomic往往开销更小一点）

💡

猜测是硬件支持原因

notion image

小结

notion image

notion image

notion image