Generative AI tools such as large language models (LLMs) and diffusion models have become ubiquitous in everyday life, raising urgent safety concerns due to their potential to generate harmful content. In this research thrust, we develop control-theoretic algorithms that monitor safety during inference and steer these models toward producing safer outputs without compromising their performance.

Preemptive Detection and Steering of LLM Misalignment via Latent Reachability

For a more exhaustive list of our research work please go HERE!