The_2nd_High_Performance_Computing_Youth_Forum_Workshop

ChunyuYe included in MEETING

2024-11-19 2260 words 5 minutes

Contents

时间：2024年11月7日——2024年12月1日

地点：中国·重庆

地址：重庆雾都宾馆（重庆市渝中区曾家岩24号）

举办单位：中国科学院计算技术研究所

协办单位：中国计算机协会高性能计算专委会

承办单位：中国计算技术西部研究所

会议主题：

AI4Science及科学应用。
国产超算上的应用移植和优化。
机器学习系统。
混合精度计算方法。
数据压缩。

会议议程

日期	安排	时间
11/27	签到
11/28	会议Day1	8:20-11:50 13:00-17:35 17:35-18:30 18:30-20:30
11/29	会议Day2	8:20-11:50 13:00-17:35
11/30	会议Day3
12/01	会议Day4

会议报告

时间/Time	报告题目/Title	报告人/Speaker	单位/Affiliation	备注
11月28日上午
8:20-8:30	会议情况介绍+致辞/Introduction and Greeting	谭光明
8:30-8:55	Pushing the Limit of Quantum Mechanical Simulation to the Raman Spectra of a Biological System with 100 Million Atoms 推动量子力学模拟极限至具有1亿原子的生物系统的拉曼光谱	商红慧	中国科学技术大学
8:55-9:20	A Performance-Portable Kilometer-Scale Global Ocean Model Across Various Architecture Systems 跨不同体系结构的高性能便携式公里级全球海洋模型	韦健	中国科学院计算机网络信息中心
9:20-9:45	A High-Quality Workflow for Multi-Resolution Scientific Data Reduction and Visualization 多分辨率科学数据降维与可视化的高质量工作流	刘泽辉	中国科学院计算技术研究所
9:45-10:10	Moiræ: Generating High-Performance Composite Stencil Programs with Global Optimizations Moiræ：通过全局优化生成高性能复合模板程序	刘笑明	北京航空航天大学	1.内存分析 2.模板化 3.高性能代码生成
10:10-10:35	Towards Highly Compatible IO-aware Workload Scheduling on HPC Systems 面向高性能计算系统的高度兼容IO感知工作负载调度	戴俊政	国防科技大学	1.工作流调度
10:35-11:00	MCFuser: High-performance and Rapid-fail of Memory-bound Compute-intensive Operators MCFuser：内存受限的计算密集型算子高性能和快速融合	张帆	武汉大学	1.算子融合
11:00-11:25	Boosting DA Center Performance via Intelligently Managed Multi-backed Disaggregated Memory 通过智能管理的多后端分解内存提升数据中心性能	王靖	上海交通大学	1.内存配制
11:25-11:50	SMless Serving DAG-based Inference with Dynamic Invocations under Serverless Computing SMIless：在无服务器计算条件下通过动态调用为基于 DAG 的推理提供服务	卢浩远	中国科学院深圳先进技术研究院
11月28日下午
13:00-13:25	Scaling Molecular Dynamics with ab initio Accuracy to 149 Nanoseconds per Day 将分子动力学的从头算精度扩展到每天149纳秒	李剑锋	中国科学院计算技术研究所
13:25-13:50	A Conflict-aware Divide-and-Conquer Algorithm for Symmetric Sparse Matrix-Vector Multiplication 面向对称稀疏矩阵-向量乘法的冲突感知分治算法	邱奥中	国防科技大学
13:50-14:15	Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching 通过优化TT分解和微批处理加速分布式DLRM训练	王威威	武汉大学
14:15-14:40	Scaling New Heights: Transformative Cross-GPU Sampling for Training Billion-Edge Graphs 达到新高度：用于训练十亿边图的变革性跨GPU采样	夏亚东	武汉大学
14:40-15:05	AmgX: Algebraic Multigrid Solver on Many Cores AmgX：多核上的代数多重网格求解器	曾礼杰	中国石油大学(北京)
15:05-15:30	LoRaStendil: Low-Rank Adaptation of Stencil Computation on Tensor Cores LoRaStendil：张量核上的模板计算低秩适应	张祥胜	中国科学院计算技术研究所
15:30-15:55	GVARP: Detecting Performance Variance on Large-Scale Heterogeneous System GVARP：在大规模异构系统上检测性能差异	游心	北京航空航天大学
15:55-16:20	Mille-feuille: A Tile-grained Mixed-Precision Single-Kernel GPU Gradient Solver on CPUs Mlle-feuille：CPU 上的分层混合精度单核 GPU 梯度求解器	杨逸翔	中国石油大学(北京)	1.混精
16:20-16:45	DBSSR: An Efficient Storage Format for Vectorizing Sparse Triangular Solvers on Structured Grids DBSSR：在结构化网格上向量化稀疏三角求解器的高效存储格式	杨南剑	国防科技大学
16:45-17:10	Enabling 1K-atom Stencil-based ON Calculations via Low-rank Approximations and High-performance Computing on leadership supercomputers 通过低秩近似和领导级超级计算机上的高性能计算实现1K原子模板基础的ON计算	吴文斌	中国科学技术大学
17:10-17:35	MIXQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Prediction MIXQ：通过在线预测驯服混合精度量化中的动态异常值	陈逸东	清华大学	1.混精
17:35-18:30	Poster section
18:30-20:30	Reception
11月29日上午
8:30-8:55	MIXQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Prediction MIXQ：通过在线预测驯服混合精度量化中的动态异常值	陈逸东	清华大学	1.混精
8:55-9:20	Long Sequences: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity 长序列：在阴影稀疏性下加速大型语言模型的参数高效微调	王拓为	清华大学
9:20-9:45	Exploring Efficient Partial Differential Equation Solution using Speed Galerkin Transformer 探索使用Speed Galerkin Transformer求解高效偏微分方程	朱英浩	中国石油大学（华东）
9:45-10:10	Unlocking High-Performance with Low-Bit NPUs and CPUs for Highly Optimized HPLMKP on Cloud Brain II 解锁高性能：在云脑II上使用低比特NPU和CPU优化的HPLMKP	薛伟诚	鹏城实验室
10:10-10:35	APTMoE: Affinity-aware Pipeline Tuning for MoE Models on Bandwidth-constrained GPU Nodes APTMoE：在带宽受限的GPU节点上为MoE模型进行亲和感知的流水线调优	韦媛媛	中山大学
10:35-11:00	Enumeration of Billions of Maximal Biclique in Bipartite Graphs without Using GPUs 在不使用GPU的情况下枚举二分图中的数十亿最大双团	潘哲	浙江大学
11:00-11:25	UMR: Unified Notifiable RMA Library for HPC UMR：面向高性能计算的统一可通知远程内存访问库	丰光南	中山大学
11:25-11:50	EXO: Accelerating Storage Paravirtualization with eBPF EXO：使用eBPF加速存储虚拟化	仇奕	厦门大学
11月29日下午
13:00-13:25	Hydrogen: Contention-Aware Hybrid Memory for Heterogeneous CPU-GPU Architectures Hydrogen：面向异构CPU-GPU架构的争用感知混合内存	李一鸣	清华大学	1.混合内存
13:25-13:50	Accelerated Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression 在深度学习推荐模型训练中使用双层自适应有损压缩加速通信	杨锦武	中国科学院计算技术研究所
13:50-14:15	FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplication on Tensor Cores FlashSparse：最小化计算冗余以在张量核心上快速进行稀疏矩阵乘法	石金良	北京邮电大学
14:15-14:40	GLUMI: Fast Connectivity Check Based on ULTs for Efficient Graph Pattern Matching GLUMI：基于ULT的快速连通性检查，用于高效的图模式匹配	曹桂辰	中国科学院计算技术研究所
14:40-15:05	COMPSO: Optimizing Gradient Compression for Distributed Training with Second-order Optimizers COMPSO：为分布式训练优化梯度压缩，使用二阶优化器	谷裕达	中国科学院计算技术研究所
15:05-15:30	Mario: Near Zero-cost Activation Checkpointing in Pipeline Parallelism Mario：在管道并行性中近乎零成本的激活检查点	刘伟建	中国科学院计算技术研究所
15:30-15:55	Harnessing Inter-GPU Shared Memory for Seamless MOE Communication-Computation Fusion 利用GPU间共享内存实现MOE通信计算融合	王成林	武汉大学
15:55-16:20	TA: A Tensor Property-Aware Optimization System for Long-Context DNN Programs TA：面向长上下文DNN程序的张量属性感知优化系统	钟恒昌	清华大学
16:20-16:45	Helios: Efficient Distributed Dynamic Graph Sampling for Online GNN Inference Helios：在线GNN推理的高效分布式动态图采样	徐定	浙江大学
16:45-17:10	WePipe: Weight Pipeline Parallelism for Communication-Efficient Long-Context Large Model Training WePipe：用于通信高效长上下文大模型训练的权重流水线并行性	林俊峰	清华大学
17:10-17:35	Swift Unfolding of communities: GPU-Accelerated Louvain Algorithm 社区的快速展开：GPU加速的Louvain算法	王智彬	南京大学
17:35-18:00	Effectively Virtual Page Prefetching via Spatial-Temporal Patterns for Memory-intensive Cloud Applications 通过空间-时间模式有效实现虚拟页面预取，用于内存密集型云应用	汪涛	上海交通大学
18:00-	To Be Determined/待定