Multimodal Learning from Videos: Exploring Models and Task Complexities