dataset_info: features:
- name: messages
list:
- name: content dtype: string
- name: reasoning_content dtype: string
- name: role dtype: string splits:
- name: train num_bytes: 4453767623 num_examples: 444544 download_size: 2424411824 dataset_size: 4453767623 configs:
- config_name: default
data_files:
- split: train path: data/train-* task_categories:
- text-generation
- question-answering language:
- en tags:
- medical
- reasoning
- healthcare
- clinical
- chain-of-thought
- thinking
- sft license: apache-2.0
Medical-Reasoning-SFT-Nemotron-Nano-30B
A large-scale medical reasoning dataset generated using nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16, containing over 444,000 samples with detailed chain-of-thought reasoning for medical and healthcare questions.
Dataset Overview
| Metric | Value |
|---|---|
| Model | nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 |
| Total Samples | 444,544 |
| Samples with Reasoning | 444,544 (100%) |
| Estimated Tokens | ~1.01 Billion |
| Content Tokens | ~808 Million |
| Reasoning Tokens | ~200 Million |
| Language | English |
Schema
Each sample follows the conversational messages format with reasoning content:
{
"messages": [
{
"role": "system",
"content": "You are a medical expert...",
"reasoning_content": null
},
{
"role": "user",
"content": "What are the symptoms of diabetes?",
"reasoning_content": null
},
{
"role": "assistant",
"content": "The main symptoms of diabetes include...",
"reasoning_content": "Let me think through this systematically. Diabetes affects blood sugar regulation, so I should consider symptoms related to hyperglycemia..."
}
]
}
Fields
| Field | Type | Description |
|---|---|---|
messages |
list | Array of message objects in the conversation |
messages[].role |
string | Either "system", "user", or "assistant" |
messages[].content |
string | The main message content |
messages[].reasoning_content |
string or null | Chain-of-thought reasoning (assistant messages only) |
Usage
Loading with Datasets Library
from datasets import load_dataset
dataset = load_dataset("OpenMed/Medical-Reasoning-SFT-Nemotron-Nano-30B")
Accessing Samples
# Get a sample
sample = dataset['train'][0]
# Access messages
for msg in sample['messages']:
print(f"Role: {msg['role']}")
print(f"Content: {msg['content'][:100]}...")
if msg['reasoning_content']:
print(f"Reasoning: {msg['reasoning_content'][:100]}...")
Filtering by Reasoning
# Get samples with reasoning content
samples_with_reasoning = dataset['train'].filter(
lambda x: x['messages'][-1]['reasoning_content'] is not None
)
Model Characteristics
NVIDIA Nemotron-3-Nano-30B is known for:
- Efficient Architecture: Advanced Mixture-of-Experts (MoE) design with 30B total parameters
- Fast Inference: Optimized for speed while maintaining quality
- Strong Reasoning: Capable of detailed step-by-step medical explanations
- Production Ready: Designed for practical deployment scenarios
Intended Use
This dataset is designed for:
- Fine-tuning medical reasoning models: Train LLMs to provide detailed, step-by-step medical reasoning
- Chain-of-thought training: Develop models that show their thinking process
- Medical QA systems: Build question-answering systems for healthcare applications
- Research: Study reasoning patterns in medical domain AI
Limitations and Considerations
- This dataset is generated by an AI model and should not be used as a substitute for professional medical advice
- Responses may contain inaccuracies and should be validated by medical professionals
- Not intended for clinical decision-making without expert review
- The reasoning traces reflect the model's approach, not necessarily optimal clinical reasoning
Citation
If you use this dataset, please cite:
@dataset{medical_reasoning_sft_nemotron_nano_30b,
title={Medical-Reasoning-SFT-Nemotron-Nano-30B},
author={OpenMed},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/datasets/OpenMed/Medical-Reasoning-SFT-Nemotron-Nano-30B}
}
License
Apache 2.0