license: mit language:
- en
- zh pipeline_tag: text-generation frameworks:
- PyTorch hardwares:
- NPU library_name: openmind
bilingual-gpt-neox-4b

Overview
This repository provides an English-Japanese bilingual GPT-NeoX model of 3.8 billion parameters.
-
Library
The model was trained using code based on [EleutherAI/gpt-neox].
-
Model architecture
A 36-layer, 2816-hidden-size transformer-based language model.
-
Pre-training
The model was trained on around 524B tokens from a mixture of the following corpora
- [Japanese CC-100]
- [Japanese C4]
- [The Pile]
- [Redpajama]
- [Wikipedia]
-
Model Series
Variant Link Bilingual 4B MiniGPT4 Bilingual 4B PPO https://modelers.cn/models/FuJianAscend/bilingual-gpt-neox-4b-instruction-ppo Bilingual 4B SFT https://modelers.cn/models/FuJianAscend/bilingual-gpt-neox-4b-instruction-sft Bilingual 4B 8K Bilingual 4B https://modelers.cn/models/FuJianAscend/bilingual-gpt-neox-4b Japanese 3.6B PPO Japanese 3.6B SFT-v2 Japanese 3.6B SFT Japanese 3.6B -
Contributors
- [Tianyu Zhao]
- [Toshiaki Wakatsuki]
- [Akio Kaga]
- [Koh Mitsuda]
- [Kei Sawada]
Benchmarking
-
Japanese benchmark
Our evaluation experiments suggest that the bilingual-gpt-neox-4b model performs slightly better than the previous [Japanese GPT-NeoX 3.6B] in Japanese tasks.
- The 4-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, and JSQuAD.
- The 6-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, JSQuAD, XWinograd, and JAQKET-v2.
Model 4-task average accuracy 6-task average accuracy bilingual-gpt-neox-4b-instruction-ppo 61.01 61.16 bilingual-gpt-neox-4b-instruction-sft 61.02 61.69 bilingual-gpt-neox-4b 56.12 51.83 japanese-gpt-neox-3.6b-instruction-ppo 59.86 60.07 japanese-gpt-neox-3.6b 55.07 50.32 -
English benchmark
Using the [EleutherAI Language Model Evaluation Harness], we found the bilingual-gpt-neox-4b performs comparably with English/multilingual models of similar sizes.
- The average accuracy is based on results of Arc-Challenge, Arc-Easy, BoolQ, COPA, HellaSwag, OpenBookQA, PIQA, PROST, SWAG, and WinoGrande.
Model Average accuracy mpt-7b 59.30 llama-7b 57.35 bloom-7b 51.51 xglm-7.5b 50.96 xglm-4.5b 50.15 bilingual-gpt-neox-4b 49.49 bloom-3b 48.56 xglm-2.9b 47.44 bloom-1.7b 46.54
How to use
import argparse
from openmind import AutoModel, AutoTokenizer
from openmind import is_torch_npu_available
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--model_name_or_path",type=str,help="Path to model",default=None,)
args = parser.parse_args()
return args
if __name__ == '__main__':
if is_torch_npu_available():
device = "npu:0"
else:
device = "cpu"
args = parse_args()
model_path = args.model_name_or_path
# Note: CodeSage requires adding eos token at the end of
# each tokenized sequence to ensure good performance
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, add_eos_token=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(device)
inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
embedding = model(inputs)[0]
print(f'Dimension of the embedding: {embedding[0].size()}')
print(embedding)
Tokenization
The model uses a [sentencepiece]-based tokenizer.
- The tokenizer has a vocabulary size of 65,536.
- It uses byte fallback to decompose unknown text pieces into UTF-8 byte pieces to avoid producing
<UNK>tokens. - It can recognize consecutive whitespaces, newlines, and tabs to handle structured texts better.
- We turned off the default behaviour of prepending leading whitespace because it is not beneficial for processing Japanese.
- Specifically, single whitespace is always processed as one token so that any English word won't have a preceding whitespace like in many other tokenizers (e.g.
_Hello).- This decision trades the English processing efficiency for a unified way to treat whitespaces.
- It leads to a significantly lower loss of next token prediction on English data because whitespaces are easy to predict.
- Don't forget to set
use_fast=Falseto make the above features function correctly.