4a7a9654创建于 2025年6月18日历史提交
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Semantic Layer on CSV\n",
    "\n",
    "In this notebook, we will show how to create a semantic layer on a CSV file.\n",
    "The semantic layer works as a bridge between the raw data and the natural language layer.\n",
    "\n",
    "### Why use a Semantic Layer?\n",
    "- Adds context and meaning to data columns\n",
    "- Makes it easier for the large language model to understand context\n",
    "- Set once, use across multiple sessions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import PandasAI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandasai as pai"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read raw data\n",
    "\n",
    "For this example, we will use a small dataset of heart disease patients from [Kaggle](https://www.kaggle.com/datasets/arezaei81/heartcsv)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the heart disease dataset\n",
    "file_df = pai.read_csv(\"./dataheart.csv\")\n",
    "\n",
    "# Display the first few rows\n",
    "file_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create the Semantic Layer\n",
    "\n",
    "Requirements for the semantic layer:\n",
    "- `path`: Must be in format 'organization/dataset'\n",
    "- `name`: A descriptive name for the dataset\n",
    "-  `df`: A dataframe\n",
    "- `description`: Brief overview of the dataset\n",
    "- `columns`: List of dictionaries with format:\n",
    "  ```python\n",
    "  {\n",
    "      \"name\": \"column_name\",\n",
    "      \"type\": \"column_type\",  # string, number, date, datetime\n",
    "      \"description\": \"column_description\"\n",
    "  }\n",
    "  ```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = pai.create(path=\"organization/heart\",\n",
    "    name=\"Heart\",\n",
    "    description=\"Heart Disease Dataset\",\n",
    "    df = file_df,\n",
    "    columns=[\n",
    "        {\n",
    "            \"name\": \"Age\",\n",
    "            \"type\": \"integer\",\n",
    "            \"description\": \"Age of the patient in years\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"Sex\",\n",
    "            \"type\": \"string\",\n",
    "            \"description\": \"Gender of the patient (M: Male, F: Female)\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"ChestPainType\",\n",
    "            \"type\": \"string\",\n",
    "            \"description\": \"Type of chest pain (ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic, TA: Typical Angina)\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"RestingBP\",\n",
    "            \"type\": \"integer\",\n",
    "            \"description\": \"Resting blood pressure in mm Hg\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"Cholesterol\",\n",
    "            \"type\": \"integer\",\n",
    "            \"description\": \"Serum cholesterol in mg/dl\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"FastingBS\",\n",
    "            \"type\": \"integer\",\n",
    "            \"description\": \"Fasting blood sugar (1: if FastingBS > 120 mg/dl, 0: otherwise)\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"RestingECG\",\n",
    "            \"type\": \"string\",\n",
    "            \"description\": \"Resting electrocardiogram results (Normal, ST: having ST-T wave abnormality, LVH: showing probable or definite left ventricular hypertrophy)\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"MaxHR\",\n",
    "            \"type\": \"integer\",\n",
    "            \"description\": \"Maximum heart rate achieved\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"ExerciseAngina\",\n",
    "            \"type\": \"string\",\n",
    "            \"description\": \"Exercise-induced angina (Y: Yes, N: No)\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"Oldpeak\",\n",
    "            \"type\": \"float\",\n",
    "            \"description\": \"ST depression induced by exercise relative to rest\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"ST_Slope\",\n",
    "            \"type\": \"string\",\n",
    "            \"description\": \"Slope of the peak exercise ST segment (Up, Flat, Down)\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"HeartDisease\",\n",
    "            \"type\": \"integer\",\n",
    "            \"description\": \"Target variable - Heart disease presence (1: heart disease, 0: normal)\"\n",
    "        }\n",
    "    ])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Semantic Dataframe\n",
    "\n",
    "Once you have saved the dataframe with its semantic layer, you can load it in any session using the `load()` method. This allows you to:\n",
    "- Maintain data context across sessions\n",
    "- Ask questions about your data in natural language\n",
    "- Generate more accurate analysis and visualizations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the semantically enhanced dataset\n",
    "df = pai.load(\"organization/heart\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chat with your dataframe\n",
    "\n",
    "You can now ask questions about your data in natural language to your dataframe using the `chat()` method. PandasAI can be used with several LLMs. For the purpose of this example, we are using LiteLLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandasai_litellm.litellm import LiteLLM\n",
    "\n",
    "# Initialize LiteLLM with your OpenAI model\n",
    "llm = LiteLLM(model=\"gpt-4.1-mini\", api_key=\"YOUR_OPENAI_API_KEY\")\n",
    "\n",
    "# Configure PandasAI to use this LLM\n",
    "pai.config.set({\n",
    "    \"llm\": llm\n",
    "})\n",
    "\n",
    "response = df.chat(\"What is the correlation between age and cholesterol?\")\n",
    "\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}