6c5fe9bd创建于 2025年2月19日历史提交
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# COMMUNICATION Summary\n",
    "\n",
    "集群场景通信算子数据分析\n",
    "\n",
    "主要包含以下3个统计内容:\n",
    "1. 按算子类型分组的,整个集群通信算子耗时的统计情况\n",
    "2. 按算子类型分组的,每个Rank上通信算子的耗时情况\n",
    "3. 整个集群平均耗时最久的TOP通信算子"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据准备"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import display, HTML\n",
    "display(HTML(\"<style>.container { width:95% !important; }</style>\"))\n",
    "\n",
    "import plotly.offline as pyo\n",
    "\n",
    "def is_lab_notebook():\n",
    "    import re\n",
    "    import psutil\n",
    "    return any(re.search('jupyter--lab-script', x) for x in psutil.Process().parent().cmdline())\n",
    "\n",
    "if is_lab_notebook():\n",
    "    pyo.init_notebook_mode()\n",
    "\n",
    "import pandas as pd\n",
    "pd.options.plotting.backend = \"plotly\"\n",
    "pd.set_option(\"display.max_rows\", 100)\n",
    "pd.set_option(\"display.width\", 1000)\n",
    "\n",
    "import cluster_display\n",
    "\n",
    "all_stats_df = pd.read_csv(\"all_stats.csv\", index_col=\"OpType\")\n",
    "rank_stats_df = pd.read_csv(\"rank_stats.csv\", index_col=\"OpType\")\n",
    "top_op_stats_df = pd.read_csv(\"top_op_stats.csv\", index_col=\"OpName\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 集群通信算子耗时分析\n",
    "\n",
    "将整个集群所有Rank的通信算子进行汇总,按算子类型分类,统计分析耗时情况,时间单位为微秒(us)\n",
    "\n",
    "包含以下统计项:\n",
    "- Count:算子数量\n",
    "- Mean:平均耗时\n",
    "- Std:标准差\n",
    "- Min:最小值\n",
    "- Q1:四分之一分位数\n",
    "- Median:中位数\n",
    "- Q3:四分之三分位数\n",
    "- Max:最大值\n",
    "- Sum:总耗时"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(all_stats_df)\n",
    "fig_all_rank = cluster_display.display_duration_boxplots(None, all_stats_df, x_title=\"Hccl OpType\")\n",
    "fig_per_rank = cluster_display.display_graph(None, all_stats_df.index, all_stats_df[[\"Q1(Us)\", \"Median(Us)\", \"Q3(Us)\"]], title=\"50% of Distribution\", x_title=\"Hccl OpType\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 集群Rank通信算子耗时分析\n",
    "\n",
    "将集群内每个Rank的通信算子进行汇总,按算子类型分类,统计分析耗时情况,时间单位为微秒(us)\n",
    "\n",
    "包含以下统计项:\n",
    "- Count:算子数量\n",
    "- Mean:平均耗时\n",
    "- Std:标准差\n",
    "- Min:最小值\n",
    "- Q1:四分之一分位数\n",
    "- Median:中位数\n",
    "- Q3:四分之三分位数\n",
    "- Max:最大值\n",
    "- Sum:总耗时"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rank_stats_gdf = rank_stats_df.groupby(rank_stats_df.index)\n",
    "cluster_display.display_stats_per_rank_groups_combobox(rank_stats_gdf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 集群TOP-N通信算子耗时分析\n",
    "\n",
    "统计集群内耗时最多的TOP-N通信算子,时间单位为微秒(us)\n",
    "\n",
    "包含以下统计项:\n",
    "- Count:算子数量\n",
    "- Mean:平均耗时\n",
    "- Std:标准差\n",
    "- Min:最小值\n",
    "- Q1:四分之一分位数\n",
    "- Median:中位数\n",
    "- Q3:四分之三分位数\n",
    "- Max:最大值\n",
    "- Sum:总耗时\n",
    "- MinRank:耗时最少算子所在的Rank\n",
    "- MaxRank:耗时最长算子所在的Rank"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(top_op_stats_df)\n",
    "fig_top_op = cluster_display.display_duration_boxplots(None, top_op_stats_df, x_title=\"Hccl OpName\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}