🎬 CMU-MOSI — Multimodal Sentiment Analysis

Upload a video review to predict sentiment using cross-modal transformers.

71.49M params | 3-class sentiment + intensity | 6 cross-modal transformer pairs

Upload a video with speech for best multimodal fusion results.

Try these examples

Architecture: Text (DeBERTa 768d) + Audio (Whisper 512d) + Visual (ViT 768d) → Encoders → 6 Cross-Modal Transformer pairs (4 layers each) → Sentiment classifier + Intensity regressor

Built by Kareem Waly · GitHub · Google Scholar