modern-gliner-bi-large-v1.0:基于双编码器架构的通用命名实体识别模型

GLiNER双编码器模型,采用ModernBERT-large和BGE-base-en,支持任意实体类型识别,推理更快,泛化能力强,上下文长度达8192 tokens,效率提升4倍。【此简介由AI生成】

分支1Tags0
Aai-modelscopeUpdate README.md
38e07675创建于 2025年2月26日5次提交
文件最后提交记录最后更新时间
Add the tag (#4) - Add the tag (78f0eb3749b111719c3b5de361107517531d08ab) Co-authored-by: Tom Aarsen <tomaarsen@users.noreply.huggingface.co> 1 年前
Update README.md 1 年前
Update README.md 1 年前
Update README.md 1 年前
Update README.md 1 年前
Update README.md 1 年前
Update README.md 1 年前
Update README.md 1 年前
Update README.md 1 年前
Update README.md 1 年前

license: apache-2.0 language:

  • en library_name: gliner datasets:
  • urchade/pile-mistral-v0.1
  • numind/NuNER
  • knowledgator/GLINER-multi-task-synthetic-data pipeline_tag: token-classification tags:
  • NER
  • GLiNER
  • information extraction
  • encoder
  • entity recognition
  • modernbert base_model:
  • answerdotai/ModernBERT-large
  • BAAI/bge-base-en-v1.5

关于

GLiNER 是一款基于双向 Transformer 编码器(类 BERT 架构)的命名实体识别(NER)模型,能够识别任意类型的实体。它既突破了传统 NER 模型仅能识别预设实体类型的限制,也为资源受限场景提供了优于大型语言模型(LLM)的解决方案——虽然 LLM 具备灵活性,但其计算成本高昂且模型体积庞大。

本版本采用双编码器架构:文本编码器采用 ModernBERT-large,实体标签编码器采用句子 Transformer 模型 BGE-base-en

相比单编码器版 GLiNER,此架构具有以下优势:

  • 可同时识别的实体数量无上限;
  • 若实体嵌入已预处理,推理速度更快;
  • 对未见过实体的泛化能力更强;

采用 ModernBERT 编码器后,模型在保持性能相当的同时,推理效率较基于 DeBERTa 的模型提升高达 4 倍,且支持最大 8,192 个标记的上下文长度。

推理时间对比

但双编码器架构也存在局限性,例如缺乏标签间的交互机制,可能导致模型难以区分语义相近但上下文语境不同的实体。

安装与使用

安装或更新 gliner 包:

pip install gliner -U

您需要安装最新版本的transformers才能使用此模型:

pip install git+https://github.com/huggingface/transformers.git

下载 GLiNER 库后,您可以导入 GLiNER 类。随后可通过 GLiNER.from_pretrained 加载此模型,并使用 predict_entities 进行实体预测。

from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/modern-gliner-bi-large-v1.0")

text = """
Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 official senior career goals for club and country, making him the top goalscorer of all time.
"""

labels = ["person", "award", "date", "competitions", "teams"]

entities = model.predict_entities(text, labels, threshold=0.3)

for entity in entities:
    print(entity["text"], "=>", entity["label"])
Cristiano Ronaldo dos Santos Aveiro => person
5 February 1985 => date
Al Nassr => teams
Portugal national team => teams
Ballon d'Or => award
UEFA Men's Player of the Year Awards => award
European Golden Shoes => award
UEFA Champions Leagues => competitions
UEFA European Championship => competitions
UEFA Nations League => competitions
Champions League => competitions
European Championship => competitions

若需启用Flash Attention或扩展序列长度,请参考以下代码实现:

首先安装Flash Attention与Triton扩展包:

pip install flash-attn triton
model = GLiNER.from_pretrained("knowledgator/modern-gliner-bi-large-v1.0",
                                _attn_implementation = 'flash_attention_2',
                                                max_len = 2048).to('cuda:0')

如果您有大量实体并希望对其进行预嵌入,请参考以下代码片段:

labels = ["your entities"]
texts = ["your texts"]

entity_embeddings = model.encode_labels(labels, batch_size = 8)

outputs = model.batch_predict_with_embeds(texts, entity_embeddings, labels)

性能基准测试

不同数据集上的表现结果

下方表格展示了在多种命名实体识别数据集上的基准测试结果:

数据集 得分
ACE 2004 30.5%
ACE 2005 26.7%
AnatEM 37.2%
Broad Tweet Corpus 72.1%
CoNLL 2003 69.3%
FabNER 22.0%
FindVehicle 40.3%
GENIA_NER 55.6%
HarveyNER 16.1%
MultiNERD 73.8%
Ontonotes 39.2%
PolyglotNER 49.1%
TweetNER7 39.6%
WikiANN en 54.7%
WikiNeural 83.7%
bc2gm 53.7%
bc4chemd 52.1%
bc5cdr 67.0%
ncbi 61.7%
平均得分 49.7%
CrossNER_AI 58.1%
CrossNER_literature 60.0%
CrossNER_music 73.0%
CrossNER_politics 72.8%
CrossNER_science 66.5%
mit-movie 47.6%
mit-restaurant 40.6%
零样本基准平均得分 59.8%

加入Discord社区

欢迎加入我们的Discord社区,获取最新动态、技术支持并参与模型相关讨论。点击加入Discord

引用说明

若在您的研究工作中使用本模型,请引用:

@misc{modernbert,
      title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference}, 
      author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
      year={2024},
      eprint={2412.13663},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13663}, 
}
@misc{zaratiana2023gliner,
      title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer}, 
      author={Urchade Zaratiana and Nadi Tomeh and Pierre Holat and Thierry Charnois},
      year={2023},
      eprint={2311.08526},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{stepanov2024gliner,
      title={GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks}, 
      author={Ihor Stepanov and Mykhailo Shtopko},
      year={2024},
      eprint={2406.12925},
      archivePrefix={arXiv},
      primaryClass={id='cs.LG' full_name='Machine Learning' is_active=True alt_name=None in_archive='cs' is_general=False description='Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.'}
}

项目介绍

GLiNER双编码器模型,采用ModernBERT-large和BGE-base-en,支持任意实体类型识别,推理更快,泛化能力强,上下文长度达8192 tokens,效率提升4倍。【此简介由AI生成】

定制我的领域

下载使用量

0

项目总下载次数(含Clone、Pull、 zip 包及 release 下载),每日凌晨更新