bilingual-gpt-neox-4b

rinna-icon

Overview

This repository provides an English-Japanese bilingual GPT-NeoX model of 3.8 billion parameters.

Library

The model was trained using code based on [EleutherAI/gpt-neox].
Model architecture

A 36-layer, 2816-hidden-size transformer-based language model.
Pre-training

The model was trained on around 524B tokens from a mixture of the following corpora
- [Japanese CC-100]
- [Japanese C4]
- [The Pile]
- [Redpajama]
- [Wikipedia]

Model Series

Variant	Link
Bilingual 4B MiniGPT4
Bilingual 4B PPO	https://modelers.cn/models/FuJianAscend/bilingual-gpt-neox-4b-instruction-ppo
Bilingual 4B SFT	https://modelers.cn/models/FuJianAscend/bilingual-gpt-neox-4b-instruction-sft
Bilingual 4B 8K
Bilingual 4B	https://modelers.cn/models/FuJianAscend/bilingual-gpt-neox-4b
Japanese 3.6B PPO
Japanese 3.6B SFT-v2
Japanese 3.6B SFT
Japanese 3.6B

Contributors
- [Tianyu Zhao]
- [Toshiaki Wakatsuki]
- [Akio Kaga]
- [Koh Mitsuda]
- [Kei Sawada]

Benchmarking

Japanese benchmark

Our evaluation experiments suggest that the bilingual-gpt-neox-4b model performs slightly better than the previous [Japanese GPT-NeoX 3.6B] in Japanese tasks.

The 4-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, and JSQuAD.
The 6-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, JSQuAD, XWinograd, and JAQKET-v2.

Model	4-task average accuracy	6-task average accuracy
bilingual-gpt-neox-4b-instruction-ppo	61.01	61.16
bilingual-gpt-neox-4b-instruction-sft	61.02	61.69
bilingual-gpt-neox-4b	56.12	51.83
japanese-gpt-neox-3.6b-instruction-ppo	59.86	60.07
japanese-gpt-neox-3.6b	55.07	50.32

English benchmark

Using the [EleutherAI Language Model Evaluation Harness], we found the bilingual-gpt-neox-4b performs comparably with English/multilingual models of similar sizes.
- The average accuracy is based on results of Arc-Challenge, Arc-Easy, BoolQ, COPA, HellaSwag, OpenBookQA, PIQA, PROST, SWAG, and WinoGrande.
Model Average accuracy

mpt-7b 59.30

llama-7b 57.35

bloom-7b 51.51

xglm-7.5b 50.96

xglm-4.5b 50.15

bilingual-gpt-neox-4b 49.49

bloom-3b 48.56

xglm-2.9b 47.44

bloom-1.7b 46.54

Model	Average accuracy
mpt-7b	59.30
llama-7b	57.35
bloom-7b	51.51
xglm-7.5b	50.96
xglm-4.5b	50.15
bilingual-gpt-neox-4b	49.49
bloom-3b	48.56
xglm-2.9b	47.44
bloom-1.7b	46.54

How to use

import argparse
from openmind import AutoModel, AutoTokenizer
from openmind import is_torch_npu_available

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_name_or_path",type=str,help="Path to model",default=None,)
    args = parser.parse_args()
    return args


if __name__ == '__main__':
    if is_torch_npu_available():
        device = "npu:0"
    else:
        device = "cpu"
    args = parse_args()
    model_path = args.model_name_or_path

    # Note: CodeSage requires adding eos token at the end of
    # each tokenized sequence to ensure good performance
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, add_eos_token=True)

    model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(device)

    inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
    embedding = model(inputs)[0]
    print(f'Dimension of the embedding: {embedding[0].size()}')
    print(embedding)

Tokenization

The model uses a [sentencepiece]-based tokenizer.

The tokenizer has a vocabulary size of 65,536.
It uses byte fallback to decompose unknown text pieces into UTF-8 byte pieces to avoid producing <UNK> tokens.
It can recognize consecutive whitespaces, newlines, and tabs to handle structured texts better.
We turned off the default behaviour of prepending leading whitespace because it is not beneficial for processing Japanese.
Specifically, single whitespace is always processed as one token so that any English word won't have a preceding whitespace like in many other tokenizers (e.g. _Hello).
- This decision trades the English processing efficiency for a unified way to treat whitespaces.
- It leads to a significantly lower loss of next token prediction on English data because whitespaces are easy to predict.
Don't forget to set use_fast=False to make the above features function correctly.