DNABERT-6 โ€” Promoter Motif Classifier (demo)

Fine-tuned zhihan1996/DNA_bert_6 to detect whether a DNA sequence contains the TATA box motif (TATAAA), a common promoter signal.

โš ๏ธ Learning/demo model. Trained on a small synthetic dataset to demonstrate the DNABERT fine-tuning pipeline โ€” not validated for real genomic analysis.

Model Details

  • Developed by: adipras1407
  • Base model: zhihan1996/DNA_bert_6 (BERT, 6-mer tokenization)
  • Task: binary sequence classification (motif present / absent)
  • Language(s): DNA sequences (A/C/G/T)
  • License: apache-2.0

Intended Uses

  • Direct use: classify whether a DNA sequence contains the TATAAA motif.
  • Out of scope: real promoter prediction, clinical/diagnostic use, non-TATA motifs.

How to Use

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

repo = "adipras1407/my-dnabert6-promoter"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo).eval()

def seq2kmer(seq, k=6):
    return " ".join(seq[i:i+k] for i in range(len(seq) - k + 1))

def predict(seq):
    x = tok(seq2kmer(seq), return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        prob = torch.softmax(model(**x).logits, -1)[0, 1].item()
    return {"has_motif": int(prob > 0.5), "confidence": round(prob, 3)}

print(predict("GGGCGCTATAAACGCGCGATCG"))

Training Data

Synthetic dataset: 1,200 random DNA sequences (length 100). Positives have TATAAA inserted at a random position; negatives are guaranteed not to contain it. 80/20 train/test split.

Training Procedure

  • 3 epochs, batch size 16, learning rate 2e-5
  • 6-mer tokenization (stride 1), max length 128
  • Trained on a single Colab GPU using ๐Ÿค— Transformers Trainer

Limitations

Trained only on synthetic data and a single motif. Real promoters don't always contain a clean TATA box, so this will not generalize to genuine genomic data. Built as a fine-tuning learning exercise.

Downloads last month
33
Safetensors
Model size
89.2M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for adipras1407/my-dnabert6-promoter

Finetuned
(2)
this model