用于评估文本教育价值的分类模型，输出0-5分评分。基于Snowflake-arctic-embed-m微调，使用Nemotron-4-340B-Instruct标注，适用于LLM预训练高质量内容筛选。【此简介由AI生成】

bed4779c创建于 2025年2月15日7次提交

文件	最后提交记录	最后更新时间
.gitattributes	initial commit	1 年前
README.md	Update README.md	1 年前
config.json	Upload 7 files	1 年前
model.safetensorsLFS	Upload 7 files	1 年前
special_tokens_map.json	Upload 7 files	1 年前
tokenizer.json	Upload 7 files	1 年前
tokenizer_config.json	Upload 7 files	1 年前
training_args.binLFS	Upload 7 files	1 年前
vocab.txt	Upload 7 files	1 年前

自动翻译

license: other

NemoCurator FineWeb Nemotron-4 Edu 分类器

模型概述

本文本分类模型用于评估文本的教育价值（评分范围0-5分，从低到高）。其设计理念与FineWeb-Edu分类器相似，基于相同的文本样本进行训练，但采用Nemotron-4-340B-Instruct生成的标注数据。相比之下，原始FineWeb-Edu分类器使用的是Llama 3 70B-Instruct生成的标注。该分类器作为集成分类器组件，曾应用于Nemotron-CC数据集的构建过程。模型基于Snowflake/snowflake-arctic-embed-m模型进行微调训练。

许可协议

管辖条款：本模型使用受NVIDIA开放模型许可协议约束。补充信息：Apache 2.0。

参考文献

模型架构

架构类型：Transformer (BERT)
网络架构：Snowflake/snowflake-arctic-embed-m

在NeMo Curator中的使用方法

NeMo Curator通过大规模处理文本、图像和视频数据来提升生成式AI模型的训练与定制精度，同时提供预制流水线用于生成合成数据，以定制和评估生成式AI系统。

本模型的推理代码可通过NeMo Curator GitHub仓库获取。请参考此示例笔记本开始使用。

如何在 Transformers 中使用

要使用 FineWeb Nemotron-4 Edu Classifier，请参考以下示例代码：

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer


texts = ["To make lemonade, you will need lemon juice, water, and sugar."]

model = AutoModelForSequenceClassification.from_pretrained(
    "nvidia/nemocurator-fineweb-nemotron-4-edu-classifier",
    torch_dtype=torch.bfloat16,
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

tokenizer = AutoTokenizer.from_pretrained(
    "nvidia/nemocurator-fineweb-nemotron-4-edu-classifier"
)

inputs = tokenizer(
    texts,
    return_tensors="pt",
    padding="longest",
    truncation=True,
    max_length=512,
).to(device)

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits.squeeze(-1).float().cpu().numpy()

float_score = logits.tolist()
int_score = [int(round(max(0, min(score, 5)))) for score in logits]
pred_labels = ["high_quality" if score >= 2.5 else "low_quality" for score in logits]

print("Score:", float_score)
print("Rounded score:", int_score)
print("Predicted label:", pred_labels)
# Score: [1.0859375]
# Rounded score: [1]
# Predicted label: ['low_quality']

输入与输出

输入

输入类型：文本
输入格式：字符串
输入参数：一维
其他输入相关属性：令牌限制为512个令牌

输出

输出类型：分类分数
输出格式：浮点数
输出参数：一维
其他输出相关属性：输出范围为0-5，代表教育价值从低到高。

软件集成

运行时引擎：

Python 3.10 和 NeMo Curator

支持的硬件微架构兼容性：

NVIDIA GPU，Volta™ 或更高版本（计算能力7.0+），CUDA 12（或以上）

操作系统：

Ubuntu 22.04/20.04

模型版本：

训练、测试与评估数据集

该模型基于以下数据集的文本进行训练：https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations（FineWeb数据集的467k文档子集），标注由Nemotron-4-340B-Instruct生成。

训练数据集：

链接： https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations

数据集的数据收集方法

自动化

数据集的标注方法

合成

属性： 该模型基于fineweb-edu-llama3-annotations数据集的文本进行训练，但标注来自Nemotron-4-340B-Instruct，而非Llama 3.1 70B提供的标注。该数据集是FineWeb数据集的随机抽样467k文档子集，包含从网络爬取的过滤后文档。更多详情请参阅https://arxiv.org/abs/2406.17557。

评估结果

在Nemotron-CC论文中，这些模型被证明在作为集成分类器的一部分，用于LLM预训练的高质量内容分类方面非常有用。请参见下文的论文表9。

在上表中，“Ours-mistral”指的是NemoCurator FineWeb Mixtral教育分类器，“Ours-nemotron-340B”指的是NemoCurator FineWeb Nemotron-4教育分类器（即本模型）。“Ours-ensembled”包括NemoCurator FineWeb Mixtral教育分类器、NemoCurator FineWeb Nemotron-4教育分类器和DCLM。

推理部署

引擎要求：Python 3.10 及 PyTorch 框架
测试硬件：NVIDIA H100 计算卡

伦理考量

英伟达认为可信人工智能是共同责任，我们已建立政策与实践体系以支持广泛AI应用开发。开发者根据服务条款下载或使用本模型时，应协同内部模型团队确保其符合相关行业与应用场景要求，并预防不可预见的产品误用情况。

如发现安全漏洞或存在英伟达AI相关疑虑，请在此提交报告。

项目介绍

下载使用量

项目总下载次数（含Clone、Pull、 zip 包及 release 下载），每日凌晨更新

nemocurator-fineweb-nemotron-4-edu-classifier:基于Nemotron-4标注的文本教育价值分类器