Towards natural human-AI interactions in vision and language