1. Introduction
The growth of the app market has been boosted by the maturity of smartphones, allowing users to conveniently browse and download apps from app stores (e.g., Apple App Store and Google Play) and leave reviews for apps they have used, including star ratings and text-based feedback. Many studies have proven that app reviews, which contain problem feedback, feature requests, and other suggestions, can be regarded as references for the iterative design and development of the app [
1,
2,
3]. Rapid iteration, which is one of the main factors for the success of an app’s development [
4], mainly includes bug fixing, feature modification, and the addition of new features. Hence, by analyzing user data from the new versions, iteration strategies can be conducted by app designers and developers. The timely and accurate gathering of information revealed by user reviews can help developers maintain and update their apps and achieve effective word-of-mouth marketing [
5,
6].
Unfortunately, analyzing app reviews is a challenge, especially through manual methods. First, there are a great number of app reviews. For some popular apps, each version is appended with hundreds or thousands of reviews [
7]. Second, the quality of reviews vary widely, and some are simply emotional evaluations (e.g., “Great!”) that are not valuable for the development of apps. Third, the language expression of reviews is relatively informal, containing lots of noise, such as misspellings, casual grammatical structures, and non-English words [
8]. To address these issues, there are many studies dedicated to automatically filtering out non-informative reviews [
9], categorizing reviews and user requirements [
10], and obtaining valuable topics from massive reviews [
11] for the purpose of app maintenance and evolution.
These studies automatically extract useful information for developers (e.g., bug reports and feature requests). Developers can quickly bug test and implement fixes, but for user requirements, whether to implement user suggestions or not is not simply a subjective decision. In other words, the studies above can only extract user request topics or sentences from the reviews, but cannot speak on to what extent the requests are worthy of improving or adding, or which feature requests have a high priority to implement. For example, the review “I wish we had the option of making our own stylized photos though” can be classified as a feature request, and the topics can be extracted as “make”, “stylized”, and “photo”, which means the request of “editing photos”; however, whether the request should be implemented in the next few releases is hard to say. Developers will consider the questions “how many users proposed such requests”, “how the users think about the present functionalities”, etc. Recently, a few studies address this question, such as Nayebi and Ruhe [
12], who proposed a bi-criterion integer programming model to select optimized app functionalities based on feature value (e.g., rating) and cost (e.g., effort to implement). However, the study focuses mainly on the trade-off solution for the functionalities and neglect requirements discussed in reviews based on user feedback.
In this paper, we focus our attention on the user requirements in reviews and propose a novel approach to (1) extract requirement phrases for the app functionalities; (2) calculate features such as the occurrence frequency and rating for the requirement phrases; (3) automatically predict those requirement phrases with high priority to be implemented based on the features extracted for the phrases.
A total of 44,893 real-world reviews from six apps on the Apple App Store were collected to verify the feasibility of the approach. The results indicate that the optimal model can reach an average accuracy, precision, recall, F-Measure, and ROC_AUC of 67.6%, 67.3%, 69.2%, 68.0% and 71.4%, respectively, after optimization. To annotate the true priority of the requirement phrases, a half-automated method is proposed to link the requirement phrases with app changelogs and the requirement phrases linked successfully to changelogs are annotated with high priority. The app changelog (the list of changes in each release) is a short introduction presented in the App Store that is written by the publisher and describes the issues that were addressed and the new features in the latest version. The purpose of a changelog is encourage users to update and experience the new version. We conjecture that the requirement phrases mentioned in changelogs can be considered as high priority requirements.
Figure 1 illustrates an example of a changelog history for Google Photos in the Apple App Store.
The main contributions of this study are as follows. (1) A novel framework is implemented to automatically extract and predict high priority requirement phrases from app reviews with fourteen features calculated for the phrases. (2) A novel method is proposed to semi-automatically annotate requirement phrases with high or low priority with help of app changelogs and the effectiveness is verified. (3) An empirical study was designed and performed to examine the effectiveness, interpretation, and comparison of the novel approach in terms of high priority requirement mining.
The rest of the paper is structured as follows.
Section 2 introduces the related work.
Section 3 explains the details of the framework of the approach.
Section 4 describes the main research questions and the method of evaluation for our approach.
Section 5 presents the results and discussions.
Section 6 discusses the threats to validity, and
Section 7 concludes the paper.
6. Threats to Validity
Threats to construct validity mainly refer to the truth set creation, where we used word2vec model to calculate the similarity between requirement phrases and functionality phrases, which are manually selected by two authors. Different word2vec models and annotators can lead to bias. We alleviate the threat by applying the widely used Google News word2vec model and the manual annotation rules are specified, including conflict solving. The annotators are not required to be experts at app development, since the most important work for annotators is to understand the changelogs and judge whether the extracted phrases are meaningful functionalities. The annotation work is without request estimation for high or low priority, which is relatively objective. Finally, we double-check the semi-manual annotation with the word2vec model by sampling results and examining the accuracy.
Another threat is that the requirement phrases extracted from the reviews may be blended with some meaningless phrases or others, which can decrease the performance for the predicting task and increase the burden for result interpretation. In order to mitigate the threat, we applied a series of preprocessing steps to remove noise before collocation and POS techniques to filter out meaningless phrases; however, the probability still exists since we do not examine which preprocessing methods can lead to the best results, but instead apply general approaches.
For internal validity, the method of truth set creation has a basic requirement, which is that the apps’ changelogs cover sufficient high priority requests. These requests are derived from the changelogs released by the app development teams. However, the comprehensiveness and accuracy of changelogs are unstable, and usually, the changelog for one release is not detailed enough with only one or two sentences. How the changelogs state the new functionalities will also seriously affect the request annotating stage. For this, instead of analyzing reviews version by version, we concentrate on all the collected reviews with corresponding historical changelogs; the cost is that the requests detected may have been implemented in previous versions.
The final chosen machine learning method is RF since RF obtains relatively better results but the differences are not significant, as other models also have good results. We cannot exclude that there are more proper algorithms or hyper-parameter settings for the task that can achieve better performance. It is a very complex process for the releasing plan and the prediction results for high priority requests can only be considered as a recommendation for developers.
Threats to external validity relate the generalization of our approach and findings. Since our approach successfully runs on different apps from different categories and platforms, generalization stands; however, whether the approach and findings can be generalized to apps in other categories (e.g., games), or apps with insufficient reviews, was not investigated in this study. Additionally, since the RF estimator was trained for individual apps, the model cannot be applied to different apps.
Author Contributions
Conceptualization, C.Y. (Cheng Yang) and L.W.; methodology, C.Y. (Cheng Yang); software, L.W.; validation, C.Y. (Chunyang Yu) and Y.Z.; writing—original draft preparation, L.W.; writing—review and editing, L.W. and C.Y. (Chunyang Yu). All authors have read and agreed to the published version of the manuscript.
Funding
This research is supported by the National Natural Science Foundation of China (No. 62002321), and Zhejiang Provincial Natural Science Foundation of China (No. Y18E050014).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data are not publicly available due to privacy or ethical.
Conflicts of Interest
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
References
- Genc-Nayebi, N.; Abran, A. A systematic literature review: Opinion mining studies from mobile app store user reviews. J. Syst. Softw. 2017, 125, 207–219. [Google Scholar] [CrossRef]
- Xie, H.; Yang, J.; Chang, C.K.; Liu, L. A statistical analysis approach to predict user’s changing requirements for software service evolution. J. Syst. Softw. 2017, 132, 147–164. [Google Scholar] [CrossRef]
- Jabangwe, R.; Edison, H.; Duc, A.N. Software engineering process models for mobile app development: A systematic literature review. J. Syst. Softw. 2018, 145, 98–111. [Google Scholar] [CrossRef]
- Jha, A.K.; Lee, S.; Lee, W.J. An empirical study of configuration changes and adoption in Android apps. J. Syst. Softw. 2019, 156, 164–180. [Google Scholar] [CrossRef]
- Palomba, F.; Linares-Vásquez, M.; Bavota, G.; Oliveto, R.; Penta, M.D.; Poshyvanyk, D.; Lucia, A.D. Crowdsourcing user reviews to support the evolution of mobile apps. J. Syst. Softw. 2018, 137, 143–162. [Google Scholar] [CrossRef]
- Noei, E.; Zhang, F.; Wang, S.; Zou, Y. Towards prioritizing user-related issue reports of mobile applications. Empir. Softw. Eng. 2019, 24, 1964–1996. [Google Scholar] [CrossRef]
- Pagano, D.; Maalej, W. User feedback in the Appstore: An empirical study. In Proceedings of the 2013 21st IEEE International Requirements Engineering Conference (RE), Rio de Janeiro, Brazil, 15–19 July 2013; pp. 125–134. [Google Scholar]
- Gao, C.; Zeng, J.; Lyu, M.R.; King, I. Online app review analysis for identifying emerging issues. In Proceedings of the 40th International Conference on Software Engineering, Gothenburg, Sweden, 27 May–3 June 2018; pp. 48–58. [Google Scholar]
- Chen, N.; Lin, J.; Hoi, S.C.; Xiao, X.; Zhang, B. AR-miner: Mining informative reviews for developers from mobile app marketplace. In Proceedings of the 36th International Conference on Software Engineering, Hyderabad, India, 31 May–7 June 2014; pp. 767–778. [Google Scholar]
- Li, C.; Huang, L.; Ge, J.; Luo, B.; Ng, V. Automatically classifying user requests in crowdsourcing requirements engineering. J. Syst. Softw. 2018, 138, 108–123. [Google Scholar] [CrossRef]
- Suprayogi, E.; Budi, I.; Mahendra, R. Information extraction for mobile application user review. In Proceedings of the 2018 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Yogyakarta, Indonesia, 27–28 October 2018; pp. 343–348. [Google Scholar]
- Nayebi, M.; Ruhe, G. Optimized functionality for super mobile apps. In Proceedings of the 2017 IEEE 25th International Requirements Engineering Conference (RE), Lisbon, Portugal, 4–8 September 2017; pp. 388–393. [Google Scholar]
- Gu, X.; Kim, S. What parts of your apps are loved by users? In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, USA, 9–13 November 2015; pp. 760–770. [Google Scholar]
- Scalabrino, S.; Bavota, G.; Russo, B.; Penta, M.D.; Oliveto, R. Listening to the crowd for the release planning of mobile Apps. IEEE Trans. Softw. Eng. 2019, 45, 68–86. [Google Scholar] [CrossRef]
- Panichella, S.; Di Sorbo, A.; Guzman, E.; Visaggio, C.A.; Canfora, G.; Gall, H.C. How can I improve my App? Classifying user reviews for software maintenance and evolution. In Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME), Bremen, Germany, 29 September–1 October 2015; pp. 281–290. [Google Scholar]
- Maalej, W.; Kurtanović, Z.; Nabil, H.; Stanik, C. On the automatic classification of App reviews. Requir. Eng. 2016, 21, 311–331. [Google Scholar] [CrossRef]
- McIlroy, S.; Ali, N.; Khalid, H.; Hassan, A.E. Analyzing and automatically labelling the types of user issues that are raised in mobile app reviews. Empir. Softw. Eng. 2016, 21, 1067–1106. [Google Scholar] [CrossRef]
- Guzman, E.; El-Haliby, M.; Bruegge, B. Ensemble methods for App review classification: An approach for software evolution. In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, USA, 9–13 November 2015; pp. 771–776. [Google Scholar]
- Jha, N.; Mahmoud, A. Mining non-functional requirements from App Store reviews. Empir. Softw. Eng. 2019, 24, 3659–3695. [Google Scholar] [CrossRef]
- Cambria, E.; Schuller, B.; Xia, Y.; Havasi, C. New avenues in opinion mining and sentiment analysis. IEEE Intell. Syst. 2013, 28, 15–21. [Google Scholar] [CrossRef] [Green Version]
- Ranjan, S.; Mishra, S. Comparative sentiment analysis of App reviews. In Proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India, 1–3 July 2020; pp. 1–7. [Google Scholar]
- Nayebi, M.; Ruhe, G. Asymmetric release planning: Compromising satisfaction against dissatisfaction. IEEE Trans. Softw. Eng. 2019, 45, 839–857. [Google Scholar] [CrossRef]
- Jo, Y.; Oh, A.H. Aspect and sentiment unification model for online review analysis. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, Hong Kong, China, 9–12 February 2011; pp. 815–824. [Google Scholar]
- Guzman, E.; Maalej, W. How do users like this feature? a fine grained sentiment analysis of app reviews. In Proceedings of the 2014 IEEE 22nd International Requirements Engineering Conference (RE), Karlskrona, Sweden, 25–29 August 2014; pp. 153–162. [Google Scholar]
- Shuyo, N. Language Detection Library for JAVA. Available online: https://github.com/shuyo/language-detection (accessed on 19 April 2021).
- Palomba, F.; Salza, P.; Ciurumelea, A.; Panichella, S.; Gall, H.; Ferrucci, F.; De Lucia, A. Recommending and localizing change requests for mobile apps based on user reviews. In Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), Buenos Aires, Argentina, 20–28 May 2017; pp. 106–117. [Google Scholar]
- Sarro, F.; Al-Subaihin, A.A.; Harman, M.; Jia, Y.; Martin, W.; Zhang, Y. Feature lifecycles as they spread, migrate, remain, and die in app stores. In Proceedings of the 2015 IEEE 23rd International Requirements Engineering Conference (RE), Ottawa, ON, Canada, 24–28 August 2015; pp. 76–85. [Google Scholar]
- Banerjee, S.; Bhattacharyya, S.; Bose, I. Whose online reviews to trust? Understanding reviewer trustworthiness and its impact on business. Decis. Support Syst. 2017, 96, 17–26. [Google Scholar] [CrossRef]
- Zhang, J.; Wang, Y.; Xie, T. Software feature refinement prioritization based on online user review mining. Inf. Softw. Technol. 2019, 108, 30–34. [Google Scholar] [CrossRef]
- Manning, C.; Schutze, H. Foundations of Statistical Natural Language Processing; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
- Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit; O’Reilly Media, Inc.: Newton, MA, USA, 2009. [Google Scholar]
- Cheng, W.; Greaves, C.; Warren, M. From n-gram to skipgram to concgram. Int. J. Corpus Linguist. 2006, 11, 411–433. [Google Scholar] [CrossRef]
- Liang, T.P.; Li, X.; Yang, C.T.; Wang, M. What in consumer reviews affects the sales of mobile apps: A multifacet sentiment analysis approach. Int. J. Electron. Commer. 2015, 20, 236–260. [Google Scholar] [CrossRef]
- Chong, A.Y.L.; Ch’ng, E.; Liu, M.J.; Li, B. Predicting consumer product demands via Big Data: the roles of online promotional marketing and online reviews. Int. J. Prod. Res. 2017, 55, 5142–5156. [Google Scholar] [CrossRef]
- Bouma, G. Normalized (pointwise) mutual information in collocation extraction. In Proceedings of the Biennial GSCL Conference, Potsdam, Germany, 30 September 2009; pp. 31–40. [Google Scholar]
- Islam, A.; Inkpen, D. Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2008, 2, 1–25. [Google Scholar] [CrossRef]
- Rehurek, R.; Sojka, P. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, 22 May 2010; pp. 45–50. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed representations ofwords and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 1–9. [Google Scholar]
- Mikolov, T.; Yih, W.T.; Zweig, G. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA, 9–14 June 2013; pp. 746–751. [Google Scholar]
- Chawla, N.V.; Japkowicz, N.; Kotcz, A. Special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 2004, 6, 1–6. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Maalej, W.; Nabil, H. Bug report, feature request, or simply praise? on automatically classifying app reviews. In Proceedings of the 2015 IEEE 23rd International Requirements Engineering Conference (RE), Ottawa, ON, Canada, 24–28 August 2015; pp. 116–125. [Google Scholar]
- Wang, C.; Zhang, F.; Liang, P.; Daneva, M.; van Sinderen, M. Can app changelogs improve requirements classification from app reviews? an exploratory study. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, Oulu, Finland, 11–12 October 2018; pp. 1–4. [Google Scholar]
- Martens, D.; Maalej, W. Towards understanding and detecting fake reviews in app stores. Empir. Softw. Eng. 2019, 24, 3316–3355. [Google Scholar] [CrossRef] [Green Version]
- Baeza-Yates, R.; Ribeiro-Neto, B. Modern Information Retrieval; ACM Press: New York, NY, USA, 1999; Volume 463. [Google Scholar]
- Carreno, L.V.G.; Winbladh, K. Analysis of user comments: an approach for software requirements evolution. In Proceedings of the 2013 35th International Conference on Software Engineering (ICSE), San Francisco, CA, USA, 18–26 May 2013; pp. 582–591. [Google Scholar]
- Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
- Ho, T.K. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; Volume 1, pp. 278–282. [Google Scholar]
- Waskom, M.L. Seaborn: statistical data visualization. J. Open Source Softw. 2021, 6, 3021. [Google Scholar] [CrossRef]
- Mann, H.B.; Whitney, D.R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 1947, 18, 50–60. [Google Scholar] [CrossRef]
- Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
- Chen, X.W.; Jeong, J.C. Enhanced recursive feature elimination. In Proceedings of the 6th International Conference on Machine Learning and Applications, ICMLA 2007, Cincinnati, OH, USA, 13–15 December 2007; pp. 429–435. [Google Scholar]
- Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
- Hinton, G.E. A practical guide to training restricted Boltzmann machines. In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 2012; pp. 599–619. [Google Scholar]
- Bernard, S.; Heutte, L.; Adam, S. Influence of hyperparameters on random forest accuracy. In Proceedings of the International Workshop on Multiple Classifier Systems, Reykjavik, Iceland, 10–12 June 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 171–180. [Google Scholar]
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).