language:
- en license:
- cc-by-4.0 multilinguality:
- monolingual size_categories:
- 10M<n<100M source_datasets:
- amphion/Emilia-YODAS pretty_name: NeuCodec Emilia-YODAS tags:
- speech
- codec
- audio
- FSQ
- audio-compression
dataset_info:
features:
- name: _id dtype: string
- name: dnsmos dtype: float64
- name: duration dtype: float64
- name: language dtype: string
- name: phone_count dtype: int64
- name: speaker dtype: string
- name: text dtype: string
- name: codes sequence: int64 splits:
- name: train num_bytes: 120073409108 num_examples: 30592309 download_size: 41111605599 dataset_size: 120073409108 configs:
- config_name: default
data_files:
- split: train path: data/train-*
Dataset Card for NeuCodec Emilia-YODAS
Dataset Description
- Repository: NeuCodec repository
- Paper: Coming Soon
Dataset Summary
The NeuCodec Emilia-YODAS dataset is an English-language dataset containing >30M audio samples (>78k hours), taken from the English-language subset of Emilia-YODAS and compressed with NeuCodec.
Usage
import torch
from datasets import load_dataset
from neucodec import NeuCodec
# load dataset and model
dataset = load_dataset("neuphonic/emilia-yodas-english-neucodec", split="train", streaming=True)
model = NeuCodec.from_pretrained("neuphonic/neucodec")
model.eval()
# reconstruct a sample
fsq_codes = torch.tensor(next(iter(dataset))["codes"])[None, None, :]
print(f"FSQ codes shape: {fsq_codes.shape}")
recon = model.decode_code(fsq_codes)
print(f"Recon shape: {recon.shape}")
Dataset Structure
Data Instances
For each instance, the corresponding information from the Emilia-YODAS dataset is preserved. There is a column for id, dnsmos, duration, phone_count, speaker, text, and the codes.
{'id': 'EN_9eylmAUb-SQ_W000139',
'dnsmos': 3.0064,
'duration': 5.67,
'phone_count': 102,
'speaker': 'EN_9eylmAUb-SQ_SPEAKER_00',
'text': 'In the Soviet Union during the 1920s, Yiddish...',
'codes': [3254, 49895, 26484, 869, 23077, 27555, 20391,...]}
Each parquet file contains approx 200 MB. There are 241 parquet files. Each code sequence is meant to be used with our NeuCodec decoder, which currently supports an output sampling rate of 24 kHz.
Data Fields
id: a string containing the corresponding id from Emilia-YODASdnsmos: a float containing the DNSMOS score from Emilia-YODASduration: a float containing the sample duration from Emilia-YODASphone_count: an integer of the number of phones in a sample from Emilia-YODASspeaker: a string containing the corresponding speaker from Emilia-YODAStext: a string containing the utterance textcodes: a list containing the compressed audio segment as NeuCodec codes
Dataset Creation
Source Data
The data was sourced from Emilia-YODAS
Additional Information
Licensing Information
The NeuCodec Emilia-YODAS dataset is released under the CC-BY-4.0 license.
Citation Information
Coming soon