DNABERT-6 โ Promoter Motif Classifier (demo)
Fine-tuned zhihan1996/DNA_bert_6 to detect
whether a DNA sequence contains the TATA box motif (TATAAA), a common promoter signal.
โ ๏ธ Learning/demo model. Trained on a small synthetic dataset to demonstrate the DNABERT fine-tuning pipeline โ not validated for real genomic analysis.
Model Details
- Developed by: adipras1407
- Base model: zhihan1996/DNA_bert_6 (BERT, 6-mer tokenization)
- Task: binary sequence classification (motif present / absent)
- Language(s): DNA sequences (A/C/G/T)
- License: apache-2.0
Intended Uses
- Direct use: classify whether a DNA sequence contains the
TATAAAmotif. - Out of scope: real promoter prediction, clinical/diagnostic use, non-TATA motifs.
How to Use
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
repo = "adipras1407/my-dnabert6-promoter"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo).eval()
def seq2kmer(seq, k=6):
return " ".join(seq[i:i+k] for i in range(len(seq) - k + 1))
def predict(seq):
x = tok(seq2kmer(seq), return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
prob = torch.softmax(model(**x).logits, -1)[0, 1].item()
return {"has_motif": int(prob > 0.5), "confidence": round(prob, 3)}
print(predict("GGGCGCTATAAACGCGCGATCG"))
Training Data
Synthetic dataset: 1,200 random DNA sequences (length 100). Positives have TATAAA inserted at a random position; negatives are guaranteed not to contain it. 80/20 train/test split.
Training Procedure
- 3 epochs, batch size 16, learning rate 2e-5
- 6-mer tokenization (stride 1), max length 128
- Trained on a single Colab GPU using ๐ค Transformers Trainer
Limitations
Trained only on synthetic data and a single motif. Real promoters don't always contain a clean TATA box, so this will not generalize to genuine genomic data. Built as a fine-tuning learning exercise.
- Downloads last month
- 33
Model tree for adipras1407/my-dnabert6-promoter
Base model
zhihan1996/DNA_bert_6