Qwen3.6-35B-A3B — cigna + Palmyra-identity LoRA SFT merged and quantized (no-reasoning / production)

This repo contains the merged LoRA adaptor https://huggingface.co/Writer/qwen3.6-35b-a3b-cigna-palmyra-noreason-sft/tree/main/lr1e-4_ep3_r16 to Qwen/Qwen3.6-35B-A3B in BF16. Then average of the merged model with the base model is performed in BF16. Then weights are requantized to FP8 using Megatron SWIFT library


Running inference (non-thinking)

GPU=<AVAILABLE_GPU>; PORT=<AVAILABLE_PORT>
MODEL=<HF_REPO_PATH>
VLLM_SQSH=/fsx/kiran/containers/vllm-v0.24.0dev-20260624.sqsh
export ENROOT_DATA_PATH=$HOME/enroot/data ENROOT_CACHE_PATH=$HOME/enroot/cache ENROOT_RUNTIME_PATH=$HOME/enroot/runtime
mkdir -p "$ENROOT_DATA_PATH" "$ENROOT_CACHE_PATH" "$ENROOT_RUNTIME_PATH"
export NVIDIA_VISIBLE_DEVICES=all CUDA_VISIBLE_DEVICES=$GPU PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
enroot remove --force cigna-fp8-serve 2>/dev/null
enroot create --name cigna-fp8-serve "$VLLM_SQSH"
enroot start --rw --mount "$MODEL:/model" \
  --env HF_HUB_OFFLINE=1 --env NVIDIA_VISIBLE_DEVICES --env CUDA_VISIBLE_DEVICES --env PYTORCH_CUDA_ALLOC_CONF \
  cigna-fp8-serve /model \
  --served-model-name cigna-fp8 --reasoning-parser qwen3 \
  --max-model-len 32768 --tensor-parallel-size 1 \
  --additional-config '{"gdn_prefill_backend":"triton"}' \
  --gpu-memory-utilization 0.90 --host 0.0.0.0 --port $PORT

Because these are non-thinking models, request with enable_thinking=False so the chat template doesn't open a <think> block:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
    model="cigna-palmyra-nr",
    messages=[{"role": "user", "content": "Who are you?"}],
    temperature=0,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},  # non-thinking
)
print(resp.choices[0].message.content)   # -> "I am palmyra, ... created by Writer."
# message.reasoning will be empty — these models intentionally do not emit reasoning.
Downloads last month
-
Safetensors
Model size
35B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Writer/qwen3.6-35b-a3b-cigna-palmyra-noreason-sft_lr1e-4_ep3_r16_merged_averaged_quantized_fp8

Adapter
(54)
this model