{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# COMMUNICATION Summary\n",
"\n",
"集群场景通信算子数据分析\n",
"\n",
"主要包含以下3个统计内容:\n",
"1. 按算子类型分组的,整个集群通信算子耗时的统计情况\n",
"2. 按算子类型分组的,每个Rank上通信算子的耗时情况\n",
"3. 整个集群平均耗时最久的TOP通信算子"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 数据准备"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import display, HTML\n",
"display(HTML(\"<style>.container { width:95% !important; }</style>\"))\n",
"\n",
"import plotly.offline as pyo\n",
"\n",
"def is_lab_notebook():\n",
" import re\n",
" import psutil\n",
" return any(re.search('jupyter--lab-script', x) for x in psutil.Process().parent().cmdline())\n",
"\n",
"if is_lab_notebook():\n",
" pyo.init_notebook_mode()\n",
"\n",
"import pandas as pd\n",
"pd.options.plotting.backend = \"plotly\"\n",
"pd.set_option(\"display.max_rows\", 100)\n",
"pd.set_option(\"display.width\", 1000)\n",
"\n",
"import cluster_display\n",
"\n",
"all_stats_df = pd.read_csv(\"all_stats.csv\", index_col=\"OpType\")\n",
"rank_stats_df = pd.read_csv(\"rank_stats.csv\", index_col=\"OpType\")\n",
"top_op_stats_df = pd.read_csv(\"top_op_stats.csv\", index_col=\"OpName\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 集群通信算子耗时分析\n",
"\n",
"将整个集群所有Rank的通信算子进行汇总,按算子类型分类,统计分析耗时情况,时间单位为微秒(us)\n",
"\n",
"包含以下统计项:\n",
"- Count:算子数量\n",
"- Mean:平均耗时\n",
"- Std:标准差\n",
"- Min:最小值\n",
"- Q1:四分之一分位数\n",
"- Median:中位数\n",
"- Q3:四分之三分位数\n",
"- Max:最大值\n",
"- Sum:总耗时"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"display(all_stats_df)\n",
"fig_all_rank = cluster_display.display_duration_boxplots(None, all_stats_df, x_title=\"Hccl OpType\")\n",
"fig_per_rank = cluster_display.display_graph(None, all_stats_df.index, all_stats_df[[\"Q1(Us)\", \"Median(Us)\", \"Q3(Us)\"]], title=\"50% of Distribution\", x_title=\"Hccl OpType\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 集群Rank通信算子耗时分析\n",
"\n",
"将集群内每个Rank的通信算子进行汇总,按算子类型分类,统计分析耗时情况,时间单位为微秒(us)\n",
"\n",
"包含以下统计项:\n",
"- Count:算子数量\n",
"- Mean:平均耗时\n",
"- Std:标准差\n",
"- Min:最小值\n",
"- Q1:四分之一分位数\n",
"- Median:中位数\n",
"- Q3:四分之三分位数\n",
"- Max:最大值\n",
"- Sum:总耗时"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rank_stats_gdf = rank_stats_df.groupby(rank_stats_df.index)\n",
"cluster_display.display_stats_per_rank_groups_combobox(rank_stats_gdf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 集群TOP-N通信算子耗时分析\n",
"\n",
"统计集群内耗时最多的TOP-N通信算子,时间单位为微秒(us)\n",
"\n",
"包含以下统计项:\n",
"- Count:算子数量\n",
"- Mean:平均耗时\n",
"- Std:标准差\n",
"- Min:最小值\n",
"- Q1:四分之一分位数\n",
"- Median:中位数\n",
"- Q3:四分之三分位数\n",
"- Max:最大值\n",
"- Sum:总耗时\n",
"- MinRank:耗时最少算子所在的Rank\n",
"- MaxRank:耗时最长算子所在的Rank"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"display(top_op_stats_df)\n",
"fig_top_op = cluster_display.display_duration_boxplots(None, top_op_stats_df, x_title=\"Hccl OpName\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.12.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}