🎬 CMU-MOSI — Multimodal Sentiment Analysis
Upload a video review to predict sentiment using cross-modal transformers.
71.49M params | 3-class sentiment + intensity | 6 cross-modal transformer pairs
Upload a video with speech for best multimodal fusion results.
Try these examples
Architecture: Text (DeBERTa 768d) + Audio (Whisper 512d) + Visual (ViT 768d) → Encoders → 6 Cross-Modal Transformer pairs (4 layers each) → Sentiment classifier + Intensity regressor
Built by Kareem Waly · GitHub · Google Scholar