Federated Optimization of ℓ0-norm Regularized Sparse Learning
Round 1
Reviewer 1 Report
This paper extends the l0-norm regularized sparse learning into the distributed scenarios and applies federated learning with reduced communication costs. The work is interesting with solid theoretical analysis. However, the reviewer has the following questions and suggestions.
1. Please polish the writing throughout the whole paper carefully. For example, it should be "Theoretical analysis of IHT currently assumes that ..." in the third line of the abstract.
2. The motivation and advantages of sparsity-constrained statistical models should be presented more clearly.
3. The related work should include the literature on federated learning with sparse gradients, e.g., "Adaptive Gradient Sparsification for Efficient Federated Learning: An Online Learning Approach", "Deep gradient compression: Reducing the communication bandwidth for distributed training", and "A distributed synchronous SGD algorithm with global top-k sparsification for low bandwidth networks". Authors should highlight the difference between the paper and these works.
4. What is the detailed definition of \mathcal{H}_{\tau} and how to compute the aggregated weight x_{t+1} in the algorithms based on \mathcal{H}_{\tau}?
5. How to choose p_i in the algorithms?
6. I recommend putting the proof of theorems and corollary into the appendix to clarify the whole organization.
7. In the experiments, it is confusing the performance of FedIter-HT is better than Fed-HT, which performs better than Distributed-HT, with a lower f(x) along with the communication round instead of the real communicated data size. Does it mean FedIter-HT and Fed-HT bring lower accuracy with lower communication rounds than Distributed-HT? What are the drawbacks and costs of FedIter-HT and Fed-HT?
8. More experiments should be executed on the performance of the proposed algorithms under varying K.
9. Only the communication rounds are evaluated in experiments. What is the transmitted data size in each communication round of every algorithm?
10. In algorithm A1, \mathcal{H}_{\tau} is also applied. What is the critical difference between algorithm A1 and the proposed Fed-HT? How do you clarify the difference in the experiments?
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
This work focuses on sparse learning by proposing new methods based on IHT. Experimental results show the superior performance of the proposed methods. The manuscript is well-organized. It can be accepted after some minor modifications.
1. Page 15, the url https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/ is invalid. Please fix it.
2. Please take care of the citation formats such as [34] "IEEE transactions on neural networks and learning systems". It should be "IEEE Transactions on Neural Networks and Learning Systems"
3. As sparse learning is an important research topic, it is suggested to enrich the article by citing more related techniques and their applications such as https://link.springer.com/article/10.1007/s00521-022-07200-w and https://www.sciencedirect.com/science/article/abs/pii/S0925231215003896
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
All questions are well answered.