Lessons learned about benchmarking, adversarial testing, the dangers of over- and under-claiming, and AI alignment. Transcript: https://web.stanford.edu/class/cs224u/podcast/bowman/ Sam's website Sam on Twitter NYU Linguistics NYU Data Science NYU Computer Science Anthropic SNLI paper: A large annotated corpus for learning natural language inference SNLI leaderboard FraCaS SICK A SICK cure for the evaluation of compositional distributional semantic models SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment RTE Knowledge Resources Richard Socher Chris Manning Andrew Ng Ray Kurtzweil SQuAD Gabor Angeli Adina Williams Adina Williams podcast episode MultiNLI paper: A broad-coverage challenge corpus for sentence understanding through inference MultiNLI leaderboards Twitter discussion of LLMs and negation GLUE SuperGLUE DecaNLP GPT-3 paper: Language Models are Few-Shot Learners FLAN Winograd schema challenges BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding JSALT: General-Purpose Sentence Representation Learning Ellie Pavlick Ellie Pavlick podcast episode Tal Linzen Ian Tenney Dipanjan Das Yoav Goldberg Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks Big Bench Upwork Surge AI Dynabench Douwe Kiela Douwe Kiela podcast episode Ethan Perez NYU Alignment Research Group Eliezer Shlomo Yudkowsky Alignment Research Center Redwood Research Percy Liang podcast episode Richard Socher podcast episode