DNABERT-6 — Promoter Motif Classifier (demo)

Fine-tuned zhihan1996/DNA_bert_6 to detect whether a DNA sequence contains the TATA box motif (TATAAA), a common promoter signal.

⚠️ Learning/demo model. Trained on a small synthetic dataset to demonstrate the DNABERT fine-tuning pipeline — not validated for real genomic analysis.

Model Details

Developed by: adipras1407
Base model: zhihan1996/DNA_bert_6 (BERT, 6-mer tokenization)
Task: binary sequence classification (motif present / absent)
Language(s): DNA sequences (A/C/G/T)
License: apache-2.0

Intended Uses

Direct use: classify whether a DNA sequence contains the TATAAA motif.
Out of scope: real promoter prediction, clinical/diagnostic use, non-TATA motifs.

How to Use

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

repo = "adipras1407/my-dnabert6-promoter"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo).eval()

def seq2kmer(seq, k=6):
    return " ".join(seq[i:i+k] for i in range(len(seq) - k + 1))

def predict(seq):
    x = tok(seq2kmer(seq), return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        prob = torch.softmax(model(**x).logits, -1)[0, 1].item()
    return {"has_motif": int(prob > 0.5), "confidence": round(prob, 3)}

print(predict("GGGCGCTATAAACGCGCGATCG"))

Training Data

Synthetic dataset: 1,200 random DNA sequences (length 100). Positives have TATAAA inserted at a random position; negatives are guaranteed not to contain it. 80/20 train/test split.

Training Procedure

3 epochs, batch size 16, learning rate 2e-5
6-mer tokenization (stride 1), max length 128
Trained on a single Colab GPU using 🤗 Transformers Trainer

Limitations

Trained only on synthetic data and a single motif. Real promoters don't always contain a clean TATA box, so this will not generalize to genuine genomic data. Built as a fine-tuning learning exercise.

Downloads last month: 33

Safetensors

Model size

89.2M params

Tensor type

F32

Model tree for adipras1407/my-dnabert6-promoter

Base model

zhihan1996/DNA_bert_6

Finetuned

(2)

this model