Sympatheia: Emotionally Adaptive
Voice Assistant with
Continuous Affect Conditioning

Sukru Samet Dindar, Riki Shimizu, Xilin Jiang, Nima Mesgarani · Columbia University

We introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and explicit valence–arousal (VA) control from multimodal sensing modules. Sympatheia outperforms speech conversational baselines in generating responses that are both semantically appropriate and emotionally aligned. A shared VA interface integrates emotion estimates from facial expression, biosignals, and textual affect descriptions, improving response alignment when speech provides limited emotional evidence.

Code Paper Sympatheia-18K

System Overview

Sympatheia combines implicit affect inference from user speech with optional explicit control through a continuous valence–arousal (VA) interface. Optional pluggable sensing modules (facial expression, EEG/biosignals, and textual affect descriptions) feed into the same VA interface, enabling emotionally aligned spoken responses even when speech cues alone are subtle or ambiguous.

Example Responses to Neutral Queries with Emotion Conditioning

The user speaks a neutral question. All models receive the target labels for user emotion via system prompt. This demonstrates how models adapt to emotional tone even in the absence of semantic or acoustic cues in speech.

Example Responses to Emotional Queries

The user's voice itself carries emotion. No model receives emotion context in its system prompt, each processes the same emotionally-expressive audio query without guidance, testing how well models show tonal adaptation from semantic and acoustic cues in speech.

Continuous Emotion Interpolation

Sympatheia operates in continuous VA space, enabling smooth blending between any two emotional states. The same query audio is used for all five responses; valence and arousal are linearly interpolated between Happy (V=+0.85, A=+0.35) and Sad (V=−0.75, A=−0.65), and between Anxious (V=−0.40, A=+0.65) and Relaxed (V=+0.25, A=−0.60), demonstrating fine-grained tonal control.

Input query

Sympatheia-18K Dataset Samples

Sympatheia-18K is a synthesized dataset of emotion-controlled speech-to-speech dialogue pairs. Each sample is conditioned on one of 12 target emotions. The dataset includes two splits. Emotional Split (12k): Emotion-matched query-response pairs; Neutral Split (6k): Neutral queries paired with all 12 emotional response variants.

Emotional Split

Neutral Split

Query (shared)