Thanks for the great job! I noticed that you mentioned For the AIME and GPQA benchmarks, we report the average accuracy over 32 and 8 samples respectively (Avg@32, Avg@8) to mitigate result variance.
I can move code around but sampling math and creating samplers is far beyond my ability. I didn't write any of the original samplers: The sampler node connects to a group node. You can chain group ...
The National Assessment of Educational Progress (NAEP) is often called the nation’s report card because it is the only measure of student achievement given periodically to a sampling of students ...