language:

  • en license:
  • cc-by-4.0 multilinguality:
  • monolingual size_categories:
  • 10M<n<100M source_datasets:
  • amphion/Emilia-YODAS pretty_name: NeuCodec Emilia-YODAS tags:
  • speech
  • codec
  • audio
  • FSQ
  • audio-compression dataset_info: features:
    • name: _id dtype: string
    • name: dnsmos dtype: float64
    • name: duration dtype: float64
    • name: language dtype: string
    • name: phone_count dtype: int64
    • name: speaker dtype: string
    • name: text dtype: string
    • name: codes sequence: int64 splits:
    • name: train num_bytes: 120073409108 num_examples: 30592309 download_size: 41111605599 dataset_size: 120073409108 configs:
  • config_name: default data_files:
    • split: train path: data/train-*

Dataset Card for NeuCodec Emilia-YODAS

Dataset Description

Dataset Summary

The NeuCodec Emilia-YODAS dataset is an English-language dataset containing >30M audio samples (>78k hours), taken from the English-language subset of Emilia-YODAS and compressed with NeuCodec.

Usage

import torch
from datasets import load_dataset
from neucodec import NeuCodec

# load dataset and model
dataset = load_dataset("neuphonic/emilia-yodas-english-neucodec", split="train", streaming=True)
model = NeuCodec.from_pretrained("neuphonic/neucodec")
model.eval()

# reconstruct a sample
fsq_codes = torch.tensor(next(iter(dataset))["codes"])[None, None, :]
print(f"FSQ codes shape: {fsq_codes.shape}")
recon = model.decode_code(fsq_codes)
print(f"Recon shape: {recon.shape}")

Dataset Structure

Data Instances

For each instance, the corresponding information from the Emilia-YODAS dataset is preserved. There is a column for id, dnsmos, duration, phone_count, speaker, text, and the codes.

{'id': 'EN_9eylmAUb-SQ_W000139',
 'dnsmos': 3.0064,
 'duration': 5.67,
 'phone_count': 102,
 'speaker': 'EN_9eylmAUb-SQ_SPEAKER_00',
 'text': 'In the Soviet Union during the 1920s, Yiddish...',
 'codes': [3254, 49895, 26484, 869, 23077, 27555, 20391,...]}

Each parquet file contains approx 200 MB. There are 241 parquet files. Each code sequence is meant to be used with our NeuCodec decoder, which currently supports an output sampling rate of 24 kHz.

Data Fields

  • id: a string containing the corresponding id from Emilia-YODAS
  • dnsmos: a float containing the DNSMOS score from Emilia-YODAS
  • duration: a float containing the sample duration from Emilia-YODAS
  • phone_count: an integer of the number of phones in a sample from Emilia-YODAS
  • speaker: a string containing the corresponding speaker from Emilia-YODAS
  • text: a string containing the utterance text
  • codes: a list containing the compressed audio segment as NeuCodec codes

Dataset Creation

Source Data

The data was sourced from Emilia-YODAS

Additional Information

Licensing Information

The NeuCodec Emilia-YODAS dataset is released under the CC-BY-4.0 license.

Citation Information

Coming soon