Qwen3.6-35B-A3B — cigna + Palmyra-identity LoRA SFT merged and quantized (no-reasoning / production)
This repo contains the merged LoRA adaptor https://huggingface.co/Writer/qwen3.6-35b-a3b-cigna-palmyra-noreason-sft/tree/main/lr1e-4_ep3_r16 to Qwen/Qwen3.6-35B-A3B in BF16. Then average of the merged model with the base model is performed in BF16. Then weights are requantized to FP8 using Megatron SWIFT library
Running inference (non-thinking)
GPU=<AVAILABLE_GPU>; PORT=<AVAILABLE_PORT>
MODEL=<HF_REPO_PATH>
VLLM_SQSH=/fsx/kiran/containers/vllm-v0.24.0dev-20260624.sqsh
export ENROOT_DATA_PATH=$HOME/enroot/data ENROOT_CACHE_PATH=$HOME/enroot/cache ENROOT_RUNTIME_PATH=$HOME/enroot/runtime
mkdir -p "$ENROOT_DATA_PATH" "$ENROOT_CACHE_PATH" "$ENROOT_RUNTIME_PATH"
export NVIDIA_VISIBLE_DEVICES=all CUDA_VISIBLE_DEVICES=$GPU PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
enroot remove --force cigna-fp8-serve 2>/dev/null
enroot create --name cigna-fp8-serve "$VLLM_SQSH"
enroot start --rw --mount "$MODEL:/model" \
--env HF_HUB_OFFLINE=1 --env NVIDIA_VISIBLE_DEVICES --env CUDA_VISIBLE_DEVICES --env PYTORCH_CUDA_ALLOC_CONF \
cigna-fp8-serve /model \
--served-model-name cigna-fp8 --reasoning-parser qwen3 \
--max-model-len 32768 --tensor-parallel-size 1 \
--additional-config '{"gdn_prefill_backend":"triton"}' \
--gpu-memory-utilization 0.90 --host 0.0.0.0 --port $PORT
Because these are non-thinking models, request with enable_thinking=False so
the chat template doesn't open a <think> block:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="cigna-palmyra-nr",
messages=[{"role": "user", "content": "Who are you?"}],
temperature=0,
extra_body={"chat_template_kwargs": {"enable_thinking": False}}, # non-thinking
)
print(resp.choices[0].message.content) # -> "I am palmyra, ... created by Writer."
# message.reasoning will be empty — these models intentionally do not emit reasoning.
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for Writer/qwen3.6-35b-a3b-cigna-palmyra-noreason-sft_lr1e-4_ep3_r16_merged_averaged_quantized_fp8
Base model
Qwen/Qwen3.6-35B-A3B