Stable Audio Open Small β Core AI (on-device music generation)
The model zoo's first MUSIC / AUDIO generation model for Apple Core AI. Type a prompt, get ~11s
of 44.1 kHz stereo audio β generated entirely on-device on Apple Silicon. A community port of
stabilityai/stable-audio-open-small
(Stability AI + Arm) to Core AI.
A latent diffusion text-to-audio model: a T5 text encoder conditions a DiT (diffusion transformer) that denoises a latent over 8 rectified-flow steps, then an Oobleck VAE decodes the latent to a waveform. Distilled (ARC) for few-step generation, so it's fast.
What's in the bundle (macos/)
Three Core AI .aimodel bundles + a tiny host sampler loop:
| bundle | role | I/O |
|---|---|---|
sa_cond_fp16b |
T5-base encoder + number conditioner | input_ids[1,64], attention_mask[1,64], seconds_norm[1] β cross_attn_cond[1,65,768], global_embed[1,768], cond_mask[1,65] |
sa_dit_fp16 |
diffusion transformer (run 8Γ) | x[1,64,256], t[1], cross_attn_cond, global_embed, cross_attn_cond_mask β v[1,64,256] |
sa_vae_fp16 |
Oobleck VAE decoder | latent[1,64,256] β audio[1,2,524288] |
Host loop (StableAudioRunner): tokenize (T5, t5_tokenizer/) β conditioner β start from Gaussian
noise β 8-step rectified-flow euler x = x + (t_next β t)Β·v over the fixed schedule
[1.0, .9944, .9845, .9579, .8909, .7455, .5125, .2739] β 0 β VAE decode β 44.1 kHz stereo wav.
No KV cache, no CFG (cfg_scale 1.0 β the model is ARC-distilled).
Performance (M4 Max, GPU)
| metric | value |
|---|---|
| 8-step DiT | ~200 ms (25 ms/step) |
| VAE decode | ~185 ms |
| total | |
| size | fp16, ~1.0 GB (DiT 651M + cond 210M + VAE 149M) |
Numerics: each bundle engine-gated vs the reference at cos β₯ 0.9999; full pipeline reproduces the reference audio exactly.
Roadmap
- iPhone (h18p) build β bundles AOT-compile; device RTF pending
- int8 (further size cut)
- a music-generation tab in the zoo app
Credits & license
A community Core AI conversion β all credit to Stability AI (and Arm) for Stable Audio Open Small; T5 text encoder by Google. This bundle is governed by the Stability AI Community License (free for non-commercial use and for commercial use under $1M annual revenue; review the license before use). No retraining β conversion only.
Part of the Core AI model zoo.
Model tree for mlboydaisuke/Stable-Audio-Open-Small-CoreAI
Base model
stabilityai/stable-audio-open-small