How to contribute data

#29
by BAROTDHRUMIL21 - opened

Hey I have some good and useful Fable convos. Roughly 15M tokens i think.

How can I share it with you for you to use it in training?

Hey, that's really generous β€” thank you. Authentic Fable 5 conversations are exactly what I'm short on right now (Fable got sunset, so I can't generate fresh ones myself anymore), so ~15M tokens could genuinely help the next version.
Two easy ways to get it to me:

  1. A HuggingFace dataset repo β€” cleanest for this size. Create one under your account; private is fine, just add yuxinlu1 as a collaborator, or make it public and drop the link here. It handles large files well and I can pull it directly.
  2. If you'd rather not make it public β€” DM me on X (@Dadahelper1 (https://x.com/Dadahelper1)) and send a private link there. A single JSONL or a zip is totally fine.

Couple of things that would help a lot:

  • Keep it as close to the original export as you can. I verify Fable provenance from the model field in the raw session JSON β€” I've had "Fable" data turn out to be other models or synthetic before, so raw exports with their metadata intact are ideal (rather than text that's already been cleaned or reformatted).
  • A rough idea of what's in there (coding, agentic/tool use, general reasoning, languages…) so I know where it fits.
  • A quick OK that it's fine to use for training, with the resulting model staying Apache-2.0 β€” everything I release is open.

On privacy: your conversations run straight through my existing cleaning/dedup pipeline, which is fully automated through Claude Code β€” I'm not going to be sitting there reading your chats. Provenance checks and filtering all happen programmatically.

No pressure on format β€” whatever's least work for you, I'll handle the rest. Really appreciate you offering it up.

Just fyi @yuxinlu1 - your DMs are not open on X

@yuxinlu1 sure. also myconvos are scattered across claude web, and on CC instances in multiple systems. And i'm affraid there were instances where the model got switched back to opus 4.8 in the middle of conversation.

Is there a way to do a clean, unified export? and also filter out things like secrets, keys, personal info etc..?

@BAROTDHRUMIL21 First β€” sorry for the slow reply. I just started collaborating with an AI lab on an open-source model, so I've been swamped and I've fallen behind on the community; I'm getting to everyone as fast as I can find the time. And thank you again β€” authentic Fable 5 data is exactly what I'm short on for the next version, so this is genuinely valuable.

(And thanks @suhaild β€” you're right, I'll get my X DMs opened.)

On a clean, unified export, here's the least-painful path:

  1. Claude.ai web chats β†’ Settings β†’ Privacy β†’ Export Data. Anthropic emails you a zip with a conversations.json that keeps the metadata intact β€” that's the ideal raw form, no cleanup needed.

  2. Claude Code sessions (each machine) β†’ thed locally as JSONL:

  • macOS/Linux: ~/.claude/projects/**/*.jsonl
  • Windows: C:\Users<you>.claude\projects***.jsonl
    Copy those files off every system into one folder. They're the gold source because every assistant message carries its own model field.
  1. Don't worry about the mid-chat switches to Opus 4.8 β€” this is the important part: because the model field is per message, my provenance pipeline keeps only the genuine claude-fable-5 turns and drops the Opus-4.8 ones automatically. So a session that flipped models mid-way isn to sort anything by hand. Raw export =best; I'll do the splitting.
    4. Secrets / keys / PII β€” my pipeline scrubspatterns, emails, etc.) and it's fullyautomated, so I'm never sitting there reading your chats. But since they're your secrets, if you'd rather pre-scrub for peace of mind, run gitleaks over the folder (or a quick regex pass for things like sk-…, ghp_…, AKIA…, -----BEGIN … PRIVATE KEY-----, emails) and redact the hf it reaches the model.

To send it: cleanest is a private HuggingFace dataset repo with yuxinlu1 added as a collaborator β€” it handles big files and keeps the metadata. Drop the repo e are open). A single zip/JSONL is totallyfine; whatever's least work for you, I'll handle the rest. πŸ™

Sign up or log in to comment