dataset_info: features:

name: messages list:
- name: content dtype: string
- name: reasoning_content dtype: string
- name: role dtype: string splits:
name: train num_bytes: 4453767623 num_examples: 444544 download_size: 2424411824 dataset_size: 4453767623 configs:
config_name: default data_files:
- split: train path: data/train-* task_categories:
text-generation
question-answering language:
en tags:
medical
reasoning
healthcare
clinical
chain-of-thought
thinking
sft license: apache-2.0

Medical-Reasoning-SFT-Nemotron-Nano-30B

A large-scale medical reasoning dataset generated using nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16, containing over 444,000 samples with detailed chain-of-thought reasoning for medical and healthcare questions.

Dataset Overview

Metric	Value
Model	nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Total Samples	444,544
Samples with Reasoning	444,544 (100%)
Estimated Tokens	~1.01 Billion
Content Tokens	~808 Million
Reasoning Tokens	~200 Million
Language	English

Schema

Each sample follows the conversational messages format with reasoning content:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a medical expert...",
      "reasoning_content": null
    },
    {
      "role": "user",
      "content": "What are the symptoms of diabetes?",
      "reasoning_content": null
    },
    {
      "role": "assistant",
      "content": "The main symptoms of diabetes include...",
      "reasoning_content": "Let me think through this systematically. Diabetes affects blood sugar regulation, so I should consider symptoms related to hyperglycemia..."
    }
  ]
}

Fields

Field	Type	Description
`messages`	list	Array of message objects in the conversation
`messages[].role`	string	Either "system", "user", or "assistant"
`messages[].content`	string	The main message content
`messages[].reasoning_content`	string or null	Chain-of-thought reasoning (assistant messages only)

Usage

Loading with Datasets Library

from datasets import load_dataset

dataset = load_dataset("OpenMed/Medical-Reasoning-SFT-Nemotron-Nano-30B")

Accessing Samples

# Get a sample
sample = dataset['train'][0]

# Access messages
for msg in sample['messages']:
    print(f"Role: {msg['role']}")
    print(f"Content: {msg['content'][:100]}...")
    if msg['reasoning_content']:
        print(f"Reasoning: {msg['reasoning_content'][:100]}...")

Filtering by Reasoning

# Get samples with reasoning content
samples_with_reasoning = dataset['train'].filter(
    lambda x: x['messages'][-1]['reasoning_content'] is not None
)

Model Characteristics

NVIDIA Nemotron-3-Nano-30B is known for:

Efficient Architecture: Advanced Mixture-of-Experts (MoE) design with 30B total parameters
Fast Inference: Optimized for speed while maintaining quality
Strong Reasoning: Capable of detailed step-by-step medical explanations
Production Ready: Designed for practical deployment scenarios

Intended Use

This dataset is designed for:

Fine-tuning medical reasoning models: Train LLMs to provide detailed, step-by-step medical reasoning
Chain-of-thought training: Develop models that show their thinking process
Medical QA systems: Build question-answering systems for healthcare applications
Research: Study reasoning patterns in medical domain AI

Limitations and Considerations

This dataset is generated by an AI model and should not be used as a substitute for professional medical advice
Responses may contain inaccuracies and should be validated by medical professionals
Not intended for clinical decision-making without expert review
The reasoning traces reflect the model's approach, not necessarily optimal clinical reasoning

Citation

If you use this dataset, please cite:

@dataset{medical_reasoning_sft_nemotron_nano_30b,
  title={Medical-Reasoning-SFT-Nemotron-Nano-30B},
  author={OpenMed},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/datasets/OpenMed/Medical-Reasoning-SFT-Nemotron-Nano-30B}
}

License

Apache 2.0