We introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and explicit valence–arousal (VA) control from multimodal sensing modules. Sympatheia outperforms speech conversational baselines in generating responses that are both semantically appropriate and emotionally aligned. A shared VA interface integrates emotion estimates from facial expression, biosignals, and textual affect descriptions, improving response alignment when speech provides limited emotional evidence.
Sympatheia combines implicit affect inference from user speech with optional explicit control through a continuous valence–arousal (VA) interface. Optional pluggable sensing modules (facial expression, EEG/biosignals, and textual affect descriptions) feed into the same VA interface, enabling emotionally aligned spoken responses even when speech cues alone are subtle or ambiguous.
The user speaks a neutral question. All models receive the target labels for user emotion via system prompt. This demonstrates how models adapt to emotional tone even in the absence of semantic or acoustic cues in speech.
The user's voice itself carries emotion. No model receives emotion context in its system prompt, each processes the same emotionally-expressive audio query without guidance, testing how well models show tonal adaptation from semantic and acoustic cues in speech.
Sympatheia operates in continuous VA space, enabling smooth blending between any two emotional states. The same query audio is used for all five responses; valence and arousal are linearly interpolated between Happy (V=+0.85, A=+0.35) and Sad (V=−0.75, A=−0.65), and between Anxious (V=−0.40, A=+0.65) and Relaxed (V=+0.25, A=−0.60), demonstrating fine-grained tonal control.
Sympatheia-18K is a synthesized dataset of emotion-controlled speech-to-speech dialogue pairs. Each sample is conditioned on one of 12 target emotions. The dataset includes two splits. Emotional Split (12k): Emotion-matched query-response pairs; Neutral Split (6k): Neutral queries paired with all 12 emotional response variants.