I spent three months on the Test-Time Compute for Generative Robot Policies project, and it turned out to be far more complicated than I expected. I'm documenting the problems we encountered so we can return better prepared while focusing on the Generative World Models project.

Literature Review

Unsure whether to trust literature benchmarks and don't know the proper way to read papers. Lacking critical thinking when reading research.
We didn't track the state of the arts and wasted time trying to beat a benchmark to test a novel idea. This was due to my inexperience with managing literature and finding good, reliable sources. I probably need a good tutorial on the right way to use arXiv.

"Literature is extremely biased—be cynical with everything you read" ~ Simon
Create a table for literature reviews that syncs with Zotero per project. Spend more time finding the most recent published materials. Know what problems people are trying to solve—this serves as the starting point for the tree search.
Once we have these problems, we can then start pruning so we choose the problem.
1. Verify if the problem exist.
2. If it does, why is it worth solving? Is it important to solve it?
3. If so, what causes this problem?

Documentation

Keeping track of experiments and directions explored. We focused on velocity and forgot about documentation. When we did document, we didn't "connect the dots" and follow the Zettelkasten method. The thoughts are scattered everywhere.

Once we've settled on a core problem, create a visual mind map that states the question we're trying to answer and links to the experiment log. For failed experiments, document new findings that extend the mind map and prune previous branches consistently. In each experiment, state the expected results upfront. Then explain the actual results once you see them.

When collaborating, don't focus on software best practices. For example, don't worry about test-driven development or forking, just develop directly as a branch of the main repo.
Agentic coding tools increase complexity quickly, so we need to regularly set aside time to refactor and prune the codebase.
We didn't build on state-of-the-art methods from step 1, so we wasted time engineering the codebase from scratch. We also lacked the D4RL dataset that everyone uses. It's better to stick with established benchmarks rather than chasing the newest ones. Even when papers showcase new benchmarks, they run on the old ones first. This was probably due to a "futureproof and perfectionist" mentality.

Work off someones existing repository to make sure their benchmarks are reproducible. Stick with what other papers benchmarks are and go from there.
Don't clean up too regularly, but do refactor when you can't explain the system.

Training cluster access and Dockerizing early can save a lot of time for parallel experiment runs.

Create a GitHub repository that handles all SLURM configurations so we can simply clone and start working. We technically still don’t know how to run on SLURM.