Stable Audio Open Small — Core AI (on-device music generation)

The model zoo's first MUSIC / AUDIO generation model for Apple Core AI. Type a prompt, get ~11s of 44.1 kHz stereo audio — generated entirely on-device on Apple Silicon. A community port of stabilityai/stable-audio-open-small (Stability AI + Arm) to Core AI.

A latent diffusion text-to-audio model: a T5 text encoder conditions a DiT (diffusion transformer) that denoises a latent over 8 rectified-flow steps, then an Oobleck VAE decodes the latent to a waveform. Distilled (ARC) for few-step generation, so it's fast.

What's in the bundle (`macos/`)

Three Core AI .aimodel bundles + a tiny host sampler loop:

bundle	role	I/O
`sa_cond_fp16b`	T5-base encoder + number conditioner	`input_ids[1,64], attention_mask[1,64], seconds_norm[1] → cross_attn_cond[1,65,768], global_embed[1,768], cond_mask[1,65]`
`sa_dit_fp16`	diffusion transformer (run 8×)	`x[1,64,256], t[1], cross_attn_cond, global_embed, cross_attn_cond_mask → v[1,64,256]`
`sa_vae_fp16`	Oobleck VAE decoder	`latent[1,64,256] → audio[1,2,524288]`

Host loop (StableAudioRunner): tokenize (T5, t5_tokenizer/) → conditioner → start from Gaussian noise → 8-step rectified-flow euler x = x + (t_next − t)·v over the fixed schedule [1.0, .9944, .9845, .9579, .8909, .7455, .5125, .2739] → 0 → VAE decode → 44.1 kHz stereo wav. No KV cache, no CFG (cfg_scale 1.0 — the model is ARC-distilled).

Performance (M4 Max, GPU)

metric	value
8-step DiT	~200 ms (25 ms/step)
VAE decode	~185 ms
total	~~0.4 s for ~11.9 s of audio (~~30× real-time)
size	fp16, ~1.0 GB (DiT 651M + cond 210M + VAE 149M)

Numerics: each bundle engine-gated vs the reference at cos ≥ 0.9999; full pipeline reproduces the reference audio exactly.

Roadmap

iPhone (h18p) build — bundles AOT-compile; device RTF pending
int8 (further size cut)
a music-generation tab in the zoo app

Credits & license

A community Core AI conversion — all credit to Stability AI (and Arm) for Stable Audio Open Small; T5 text encoder by Google. This bundle is governed by the Stability AI Community License (free for non-commercial use and for commercial use under $1M annual revenue; review the license before use). No retraining — conversion only.

Part of the Core AI model zoo.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mlboydaisuke/Stable-Audio-Open-Small-CoreAI

Base model

stabilityai/stable-audio-open-small

Finetuned

(3)