AraMix AraMix is a SOTA Arabic pretraining dataset AdaMLLab/AraMix Viewer • Updated Jan 30 • 394M • 1.94k • 7 Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets Paper • 2512.18834 • Published Dec 21, 2025 • 4
Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets Paper • 2512.18834 • Published Dec 21, 2025 • 4
Fineweb-Edu-Ar Largest (as of 2024) machine translated Arabic educational corpus kaust-generative-ai/fineweb-edu-ar Viewer • Updated Nov 12, 2024 • 363M • 327 • 13 Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models Paper • 2411.06402 • Published Nov 10, 2024 • 2
Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models Paper • 2411.06402 • Published Nov 10, 2024 • 2
AraMix AraMix is a SOTA Arabic pretraining dataset AdaMLLab/AraMix Viewer • Updated Jan 30 • 394M • 1.94k • 7 Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets Paper • 2512.18834 • Published Dec 21, 2025 • 4
Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets Paper • 2512.18834 • Published Dec 21, 2025 • 4
Fineweb-Edu-Ar Largest (as of 2024) machine translated Arabic educational corpus kaust-generative-ai/fineweb-edu-ar Viewer • Updated Nov 12, 2024 • 363M • 327 • 13 Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models Paper • 2411.06402 • Published Nov 10, 2024 • 2
Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models Paper • 2411.06402 • Published Nov 10, 2024 • 2