I spent three months on the Test-Time Compute for Generative Robot Policies project, and it turned out to be far more complicated than I expected. I'm documenting the problems we encountered so we can return better prepared while focusing on the Generative World Models project.

Literature Review

  1. Unsure whether to trust literature benchmarks and don't know the proper way to read papers. Lacking critical thinking when reading research.
  2. We didn't track the state of the arts and wasted time trying to beat a benchmark to test a novel idea. This was due to my inexperience with managing literature and finding good, reliable sources. I probably need a good tutorial on the right way to use arXiv.

Documentation

  1. Keeping track of experiments and directions explored. We focused on velocity and forgot about documentation. When we did document, we didn't "connect the dots" and follow the Zettelkasten method. The thoughts are scattered everywhere.

Development Best Practices

  1. When collaborating, don't focus on software best practices. For example, don't worry about test-driven development or forking, just develop directly as a branch of the main repo.
  2. Agentic coding tools increase complexity quickly, so we need to regularly set aside time to refactor and prune the codebase.
  3. We didn't build on state-of-the-art methods from step 1, so we wasted time engineering the codebase from scratch. We also lacked the D4RL dataset that everyone uses. It's better to stick with established benchmarks rather than chasing the newest ones. Even when papers showcase new benchmarks, they run on the old ones first. This was probably due to a "futureproof and perfectionist" mentality.

Infrastructure & Tooling

  1. Training cluster access and Dockerizing early can save a lot of time for parallel experiment runs.