dataset_info: features:

  • name: messages list:
    • name: content dtype: string
    • name: reasoning_content dtype: string
    • name: role dtype: string splits:
  • name: train num_bytes: 4453767623 num_examples: 444544 download_size: 2424411824 dataset_size: 4453767623 configs:
  • config_name: default data_files:
    • split: train path: data/train-* task_categories:
  • text-generation
  • question-answering language:
  • en tags:
  • medical
  • reasoning
  • healthcare
  • clinical
  • chain-of-thought
  • thinking
  • sft license: apache-2.0

Medical-Reasoning-SFT-Nemotron-Nano-30B

A large-scale medical reasoning dataset generated using nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16, containing over 444,000 samples with detailed chain-of-thought reasoning for medical and healthcare questions.

Dataset Overview

Metric Value
Model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Total Samples 444,544
Samples with Reasoning 444,544 (100%)
Estimated Tokens ~1.01 Billion
Content Tokens ~808 Million
Reasoning Tokens ~200 Million
Language English

Schema

Each sample follows the conversational messages format with reasoning content:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a medical expert...",
      "reasoning_content": null
    },
    {
      "role": "user",
      "content": "What are the symptoms of diabetes?",
      "reasoning_content": null
    },
    {
      "role": "assistant",
      "content": "The main symptoms of diabetes include...",
      "reasoning_content": "Let me think through this systematically. Diabetes affects blood sugar regulation, so I should consider symptoms related to hyperglycemia..."
    }
  ]
}

Fields

Field Type Description
messages list Array of message objects in the conversation
messages[].role string Either "system", "user", or "assistant"
messages[].content string The main message content
messages[].reasoning_content string or null Chain-of-thought reasoning (assistant messages only)

Usage

Loading with Datasets Library

from datasets import load_dataset

dataset = load_dataset("OpenMed/Medical-Reasoning-SFT-Nemotron-Nano-30B")

Accessing Samples

# Get a sample
sample = dataset['train'][0]

# Access messages
for msg in sample['messages']:
    print(f"Role: {msg['role']}")
    print(f"Content: {msg['content'][:100]}...")
    if msg['reasoning_content']:
        print(f"Reasoning: {msg['reasoning_content'][:100]}...")

Filtering by Reasoning

# Get samples with reasoning content
samples_with_reasoning = dataset['train'].filter(
    lambda x: x['messages'][-1]['reasoning_content'] is not None
)

Model Characteristics

NVIDIA Nemotron-3-Nano-30B is known for:

  • Efficient Architecture: Advanced Mixture-of-Experts (MoE) design with 30B total parameters
  • Fast Inference: Optimized for speed while maintaining quality
  • Strong Reasoning: Capable of detailed step-by-step medical explanations
  • Production Ready: Designed for practical deployment scenarios

Intended Use

This dataset is designed for:

  • Fine-tuning medical reasoning models: Train LLMs to provide detailed, step-by-step medical reasoning
  • Chain-of-thought training: Develop models that show their thinking process
  • Medical QA systems: Build question-answering systems for healthcare applications
  • Research: Study reasoning patterns in medical domain AI

Limitations and Considerations

  • This dataset is generated by an AI model and should not be used as a substitute for professional medical advice
  • Responses may contain inaccuracies and should be validated by medical professionals
  • Not intended for clinical decision-making without expert review
  • The reasoning traces reflect the model's approach, not necessarily optimal clinical reasoning

Citation

If you use this dataset, please cite:

@dataset{medical_reasoning_sft_nemotron_nano_30b,
  title={Medical-Reasoning-SFT-Nemotron-Nano-30B},
  author={OpenMed},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/datasets/OpenMed/Medical-Reasoning-SFT-Nemotron-Nano-30B}
}

License

Apache 2.0