Distilled Speech Encoder

A Data2Vec-style bidirectional speech encoder trained via distillation from AuriStream models.

Model Details

  • Architecture: 12-layer transformer with RoPE positional encoding
  • Hidden size: 768
  • Attention heads: 12
  • Parameters: ~85M
  • Teacher model: TuKoResearch/AuriStream100M_40Pred_BigAudioDataset_500k
  • Training step: 100000
  • Input: 16kHz raw audio waveform
  • Output: 50Hz contextualized representations (768-dim)

Usage

from transformers import AutoModel, Wav2Vec2FeatureExtractor
import torch

# Load model and feature extractor
model = AutoModel.from_pretrained("TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960", trust_remote_code=True)
model.eval()  # Important for inference!
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960")

# Prepare audio (16kHz, mono)
audio = torch.randn(16000).numpy()  # 1 second of audio

# Extract features
inputs = feature_extractor(audio, return_tensors="pt", sampling_rate=16000)
with torch.no_grad():
    outputs = model(inputs.input_values, output_hidden_states=True)

# Get representations
last_hidden = outputs.last_hidden_state  # (1, 50, 768) for 1 second
all_hidden = outputs.hidden_states  # Tuple of 13 tensors

Hidden States

When output_hidden_states=True, the model returns hidden states from all layers:

  • hidden_states[0]: Feature projection output (after conv encoder + projection)
  • hidden_states[1] to hidden_states[12]: Transformer layer outputs
  • hidden_states[12]: Final layer output (same as last_hidden_state)

This makes the model suitable for linear probing experiments at different layers.

Training

This model was trained using Data2Vec-style distillation:

  1. A frozen AuriStream teacher model generates target representations
  2. The student sees masked audio and learns to predict teacher representations
  3. Loss is computed only on masked positions

Citation

If you use this model, please cite:

@misc{distilled_speech_encoder,
  title={Distilled Speech Encoder},
  author={TuKo Research},
  year={2025},
  url={https://huggingface.co/TuKoResearch/AuriStreamDistill_100M40PredTeacher_librispeech960}
}
Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support