ee68da9a创建于 2024年11月18日历史提交

license: mit language:

  • en
  • zh pipeline_tag: text-generation frameworks:
  • PyTorch hardwares:
  • NPU library_name: openmind

bilingual-gpt-neox-4b

rinna-icon

Overview

This repository provides an English-Japanese bilingual GPT-NeoX model of 3.8 billion parameters.


Benchmarking

  • Japanese benchmark

    Our evaluation experiments suggest that the bilingual-gpt-neox-4b model performs slightly better than the previous [Japanese GPT-NeoX 3.6B] in Japanese tasks.

    • The 4-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, and JSQuAD.
    • The 6-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, JSQuAD, XWinograd, and JAQKET-v2.
    Model 4-task average accuracy 6-task average accuracy
    bilingual-gpt-neox-4b-instruction-ppo 61.01 61.16
    bilingual-gpt-neox-4b-instruction-sft 61.02 61.69
    bilingual-gpt-neox-4b 56.12 51.83
    japanese-gpt-neox-3.6b-instruction-ppo 59.86 60.07
    japanese-gpt-neox-3.6b 55.07 50.32
  • English benchmark

    Using the [EleutherAI Language Model Evaluation Harness], we found the bilingual-gpt-neox-4b performs comparably with English/multilingual models of similar sizes.

    • The average accuracy is based on results of Arc-Challenge, Arc-Easy, BoolQ, COPA, HellaSwag, OpenBookQA, PIQA, PROST, SWAG, and WinoGrande.
    Model Average accuracy
    mpt-7b 59.30
    llama-7b 57.35
    bloom-7b 51.51
    xglm-7.5b 50.96
    xglm-4.5b 50.15
    bilingual-gpt-neox-4b 49.49
    bloom-3b 48.56
    xglm-2.9b 47.44
    bloom-1.7b 46.54

How to use

import argparse
from openmind import AutoModel, AutoTokenizer
from openmind import is_torch_npu_available

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_name_or_path",type=str,help="Path to model",default=None,)
    args = parser.parse_args()
    return args


if __name__ == '__main__':
    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
    args = parse_args()
    model_path = args.model_name_or_path

    # Note: CodeSage requires adding eos token at the end of
    # each tokenized sequence to ensure good performance
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, add_eos_token=True)

    model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(device)

    inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
    embedding = model(inputs)[0]
    print(f'Dimension of the embedding: {embedding[0].size()}')
    print(embedding)

Tokenization

The model uses a [sentencepiece]-based tokenizer.

  • The tokenizer has a vocabulary size of 65,536.
  • It uses byte fallback to decompose unknown text pieces into UTF-8 byte pieces to avoid producing <UNK> tokens.
  • It can recognize consecutive whitespaces, newlines, and tabs to handle structured texts better.
  • We turned off the default behaviour of prepending leading whitespace because it is not beneficial for processing Japanese.
  • Specifically, single whitespace is always processed as one token so that any English word won't have a preceding whitespace like in many other tokenizers (e.g. _Hello).
    • This decision trades the English processing efficiency for a unified way to treat whitespaces.
    • It leads to a significantly lower loss of next token prediction on English data because whitespaces are easy to predict.
  • Don't forget to set use_fast=False to make the above features function correctly.