Grounding Visual Content Via Natural Language Expressions