Paper : TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation
Authors : Lin Sun, Guangxiang Zhao, Xiaoqi Jian, Yuhan Wu, Weihong Lin, Yongfu Zhu, Change Jia, Linglin Zhang, Jinzhu Wu, Junfeng Ran, Sai-er Hu, Zihan Jiang, Junting Zhou, Wenrui Liu, Bin Cui, Tong Yang, Xiangzheng Zhang.
Presenter : Yuhan Wu, Beijing University.
Guest of Honor : Yuhan Wu, Beijing University.
Q1 (Host): You mentioned using Arcee Fusion to integrate models, so what made you make this design decision? Have you tried other merging methods, or is it basically based on your experience?
A1: We tried many model merging algorithms, as shown in Figure 2. The scores of the other methods are lower than Arcee’s. However, we only compared them on the GPQA-Diamond benchmark because the science ability is the weakest.
Q2 (Host): What’s the motivation for you to use the branch and merge algorithm? I think intuitively it really makes sense. The method basically allows one model to process capabilities across different tracks. Is your invention, or has this been explored in the machine learning community before?
A2: First, this algorithm was not invented by us; it is an open-source tool. It’s not even a paper, it’s just an open-source repository. A common method in fine-tuning is to use data mixtures that focus on different domains. A different group of people will try to find good data and train a high-scoring model for a specific domain. A common way is that after they find good data, they will mix it and fine-tune it from scratch to train a good mixed model. This process is time-consuming, and we noticed that some algorithms can merge models. I think model merging is not widely acknowledged in the industry. However, we found that model merging performs very well and is fast because you only need four GPU hours to merge the model, while data mixture requires reworking all the data.
Q3 (Host): You mentioned that the industry doesn’t use model merging much, so there must be a reason for that. We can clearly observe that model merging consumes much fewer hours than data mixture. What is the key reason why the industry hasn’t thought about using model merging?
A3: I think there are two reasons. First, in the industry, the final model needs to solve problems across many domains, not just three. Here we focus on three benchmarks, but the industry focuses on ten to twenty benchmarks. We can merge two models, but we don’t know how the final model will perform if we merge ten different models. The second reason is the gap between model merging and data mixture. Model merging requires more time from model developers to conduct experiments. However, large companies care more about developers’ time than GPU hours.
Q4 (Audience): Was there an “Aha moment” during the model merging process? If so, which domain’s data had the greatest impact on it?
A4: The most interesting moment was when we found that after March, the three models’ test scores rose from 73 to 78, which was beyond our expectations. We used to expect that the score would be higher than seventy.
Q5 (Host): Which domains’ data had the greatest impact on the final performance of your merged models?
A5: It’s hard to compare, but we found that if you train on math data, the coding score and science score also improve. The same is true for coding and science data.
Q6 (Host): Is it always the case that the merged model has better performance, or is it not always the case?
A6: It’s not always the case. I think it’s an open question.