Eureka
今天把jim fan的采访视频又听了一遍
https://youtu.be/yMGGpMyW_vw?si=m-4_xS7ib7gXyeIY
有几个有意思的东西
一个是eureka项目,让LLM来写reward code并根据反馈持续优化code。其中一个demo是自动学会五个手指转笔。
论文和代码网上都有
EUREKA’s generality is made possible through three key algorithmic design choices: environment as context, evolutionary search, and reward reflection. First, by taking the environment source code as context, EUREKA can zero-shot generate executable reward functions from the backbone coding LLM (GPT-4). Then, EUREKA substantially improves the quality of its rewards by performing evolutionary search, iteratively proposing batches of reward candidates and refining the most promising ones within the LLM context window. This in-context improvement is made effective via reward reflection, a textual summary of the reward quality based on policy training statistics that enables automated and targeted reward editing
idea就是拿llm做一个code generation engine。通过提示工程,模拟反馈,和演化搜索来优化。