Multimodal Representation Learning Under Weak Supervision