Towards Ecologically Valid Evaluations of Language Models