基于NVIDIA Nemotron-3-Nano-30B生成的大规模医学推理数据集，含444,544个样本，每个样本均包含详细的思维链推理内容，适用于微调医学推理模型和构建医疗问答系统。【此简介由AI生成】

文件	最后提交记录	最后更新时间
data	Upload dataset	3 个月前
.gitattributes	initial commit	3 个月前
README.md	Update README.md (#1) - Update README.md (ce8f0a1b0692870e78c111944ce4e7849ddb00dc)	3 个月前

自动翻译

dataset_info: features:

name: messages list:
- name: content dtype: string
- name: reasoning_content dtype: string
- name: role dtype: string splits:
name: train num_bytes: 4453767623 num_examples: 444544 download_size: 2424411824 dataset_size: 4453767623 configs:
config_name: default data_files:
- split: train path: data/train-* task_categories:
text-generation
question-answering language:
en tags:
medical
reasoning
healthcare
clinical
chain-of-thought
thinking
sft license: apache-2.0

Medical-Reasoning-SFT-Nemotron-Nano-30B

一个使用nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16生成的大规模医疗推理数据集，包含超过444,000个样本，针对医疗和健康问题提供详细的思维链推理。

数据集概述

指标	数值
模型	nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
总样本数	444,544
带推理样本数	444,544 (100%)
预估标记数	~10.1亿
内容标记数	~8.08亿
推理标记数	~2亿
语言	英语

数据结构

每个样本均遵循包含推理内容的对话消息格式：

{
  "messages": [
    {
      "role": "system",
      "content": "You are a medical expert...",
      "reasoning_content": null
    },
    {
      "role": "user",
      "content": "What are the symptoms of diabetes?",
      "reasoning_content": null
    },
    {
      "role": "assistant",
      "content": "The main symptoms of diabetes include...",
      "reasoning_content": "Let me think through this systematically. Diabetes affects blood sugar regulation, so I should consider symptoms related to hyperglycemia..."
    }
  ]
}

字段

字段	类型	描述
`messages`	list	对话中的消息对象数组
`messages[].role`	string	可以是 "system"、"user" 或 "assistant"
`messages[].content`	string	消息的主要内容
`messages[].reasoning_content`	string 或 null	思维链推理（仅助手消息）

使用方法

使用 Datasets 库加载

from datasets import load_dataset

dataset = load_dataset("OpenMed/Medical-Reasoning-SFT-Nemotron-Nano-30B")

访问样本

# Get a sample
sample = dataset['train'][0]

# Access messages
for msg in sample['messages']:
    print(f"Role: {msg['role']}")
    print(f"Content: {msg['content'][:100]}...")
    if msg['reasoning_content']:
        print(f"Reasoning: {msg['reasoning_content'][:100]}...")

按推理能力筛选

# Get samples with reasoning content
samples_with_reasoning = dataset['train'].filter(
    lambda x: x['messages'][-1]['reasoning_content'] is not None
)

模型特点

NVIDIA Nemotron-3-Nano-30B 具有以下特点：

高效架构：采用先进的混合专家（MoE）设计，总参数达300亿
快速推理：在保持质量的同时针对速度进行了优化
强大推理能力：能够提供详细的分步医学解释
生产就绪：专为实际部署场景设计

预期用途

本数据集适用于：

医学推理模型微调：训练大型语言模型（LLMs）以提供详细的分步医学推理
思维链训练：开发能够展示思考过程的模型
医学问答系统：构建用于医疗健康应用的问答系统
研究用途：研究医学领域人工智能的推理模式

局限性与注意事项

本数据集由AI模型生成，不应替代专业医疗建议
响应内容可能存在不准确之处，应由医疗专业人员进行验证
未经专家审核，不适用于临床决策
推理过程仅反映模型的思考方式，不一定代表最佳临床推理

引用说明

如使用本数据集，请引用：

@dataset{medical_reasoning_sft_nemotron_nano_30b,
  title={Medical-Reasoning-SFT-Nemotron-Nano-30B},
  author={OpenMed},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/datasets/OpenMed/Medical-Reasoning-SFT-Nemotron-Nano-30B}
}

许可证

Apache 2.0

项目介绍

下载使用量

项目总下载次数（含Clone、Pull、 zip 包及 release 下载），每日凌晨更新

Medical-Reasoning-SFT-Nemotron-Nano-30B:超44万样本的医学推理数据集，含详细思维链