CCherrytestUpload folder using ModelScope SDK (batch 1/1)

d6026de4创建于 2025年10月9日历史提交

language:

en license:
cc-by-4.0 multilinguality:
monolingual size_categories:
10M<n<100M source_datasets:
amphion/Emilia-YODAS pretty_name: NeuCodec Emilia-YODAS tags:
speech
codec
audio
FSQ
audio-compression dataset_info: features:
- name: _id dtype: string
- name: dnsmos dtype: float64
- name: duration dtype: float64
- name: language dtype: string
- name: phone_count dtype: int64
- name: speaker dtype: string
- name: text dtype: string
- name: codes sequence: int64 splits:
- name: train num_bytes: 120073409108 num_examples: 30592309 download_size: 41111605599 dataset_size: 120073409108 configs:
config_name: default data_files:
- split: train path: data/train-*

Dataset Card for NeuCodec Emilia-YODAS

Dataset Description

Repository: NeuCodec repository
Paper: Coming Soon

Dataset Summary

The NeuCodec Emilia-YODAS dataset is an English-language dataset containing >30M audio samples (>78k hours), taken from the English-language subset of Emilia-YODAS and compressed with NeuCodec.

Usage

import torch
from datasets import load_dataset
from neucodec import NeuCodec

# load dataset and model
dataset = load_dataset("neuphonic/emilia-yodas-english-neucodec", split="train", streaming=True)
model = NeuCodec.from_pretrained("neuphonic/neucodec")
model.eval()

# reconstruct a sample
fsq_codes = torch.tensor(next(iter(dataset))["codes"])[None, None, :]
print(f"FSQ codes shape: {fsq_codes.shape}")
recon = model.decode_code(fsq_codes)
print(f"Recon shape: {recon.shape}")

Dataset Structure

Data Instances

For each instance, the corresponding information from the Emilia-YODAS dataset is preserved. There is a column for id, dnsmos, duration, phone_count, speaker, text, and the codes.

{'id': 'EN_9eylmAUb-SQ_W000139',
 'dnsmos': 3.0064,
 'duration': 5.67,
 'phone_count': 102,
 'speaker': 'EN_9eylmAUb-SQ_SPEAKER_00',
 'text': 'In the Soviet Union during the 1920s, Yiddish...',
 'codes': [3254, 49895, 26484, 869, 23077, 27555, 20391,...]}

Each parquet file contains approx 200 MB. There are 241 parquet files. Each code sequence is meant to be used with our NeuCodec decoder, which currently supports an output sampling rate of 24 kHz.

Data Fields

id: a string containing the corresponding id from Emilia-YODAS
dnsmos: a float containing the DNSMOS score from Emilia-YODAS
duration: a float containing the sample duration from Emilia-YODAS
phone_count: an integer of the number of phones in a sample from Emilia-YODAS
speaker: a string containing the corresponding speaker from Emilia-YODAS
text: a string containing the utterance text
codes: a list containing the compressed audio segment as NeuCodec codes

Dataset Card for NeuCodec Emilia-YODAS

Dataset Description

Dataset Summary

Usage

Dataset Structure

Data Instances

Data Fields

Dataset Creation

Source Data

Additional Information

Licensing Information

Citation Information