Natural language guided image and video understanding