Time Series Forecasting during Software Project State Analysis
Abstract
:1. Introduction
2. Related Works
- Hosting and management of the software source code, including documentation and the various resources needed for the project;
- Allowing a team of software engineers to co-develop the software. The software repository services use one or more version control systems such as Git [8], Mercurial [9], SVN [10] or CVS [11]. These version control systems allow for branching the code, logging changes in the code, verifying the source code and analyzing it;
- Controlling software development by using task tracking functions and monitoring project metrics that reflect various characteristics of the project and the completed tasks;
- Performing project maintenance for automated testing and quality control, building and deploying the software;
- Storing project documentation as markdown files (such as read.me) and wiki pages.
- The following studies are aimed at analyzing the IT project hosting services:
- Software analysis aimed at improving its architecture and reducing the workload [12];
- Analyzing the performance impact and software development quality of high-level programming languages [13];
- Analyzing the business processes of software development management to validate the development result [16];
- Using deep learning methods to generate source code [17];
- Analyzing the source code to find various defects [18];
- Managing development based on git repository analysis [19].
3. Methodology
- Number of commits: A commit is a small group of meaningful changes in the project, i.e., changed lines of source code, downloaded binaries, etc. This number allows for the estimation of the speed and volume of changes made during the project development;
- The number of commits by a specific author allows for the evaluation of the contribution of a specific developer to the development of a project.
- The number of project dependencies is the number of third-party software components and libraries used. This value allows for the evaluation of the complexity of the project;
- The number of project entity classes. Entity classes are the models of the objects of the problem area in the program system;
- The number of project files allows for the evaluation of the complexity of the project;
- The number of project classes allows for the evaluation of the complexity of the project;
- The number of project interfaces allows for the evaluation of the complexity of the project integrations;
- The number of business processes described in the project classes. The business process reflects the process of data management and transformation.
- A software that can be characterized by an expert throughout the entire development period is chosen;
- We generate time series for the entire set of TS indicators;
- Every time series is smoothed, and the trends are grouped by their direction. This means that if a trend for the first time period was to “shrink”, and the next trend is also to “shrink”, it is considered that the trend from the start of the first time period to the end of the next time period was to “shrink”. The intensity of the trend is not considered currently. The previously completed experiments demonstrated that a rough grouping of the trends allows for the simplifying of the rule base establishment process and makes the base less unwieldy without losing any valuable data from the software projects. ;
- Equally long time periods are extracted from the time series sets grouped by their trends. , and the appropriate time series are formed;
- The expert forms a text description to characterize the project at its current stage. This is carried out for every time period;
- We create a rule according to equation 1 for each pair of time series with a matching time period. For example, if two time series ( and ) have an equal time interval with growing tendencies then we can construct the rule in the following form: . For our approach, we define three types of tendencies: grow, shrink and stable. These approaches roughly approximate the tendencies of the development process indicator but reduce the size of the rule base.
- Forming a set of fuzzy rules based on the expert characterization of an existing project (see also: forming rule base);
- Extracting a set of time series related to the project being analyzed. These time series belong to the set;
- Extracting time series points from the set;
- Forecasting the trends. A third-party program service performs this operation;
- Forecasted and fuzzified trends are used to generate a logical output based on the fuzzy rule base established earlier. The Mamdani Fuzzy Model is used for their generation;
- The result of the source code repository evaluation is a set of expert interments, listed as rule consequents whenever these rules are active.
4. Creating a Training Sample
- Classic exponential smoothing models
- ∘
- w/o trend and seasonality;
- ∘
- w/o trend, with additive seasonality;
- ∘
- w/o trend, with multiplicative seasonality;
- ∘
- additive trend, w/o seasonality;
- ∘
- multiplicative trend, w/o seasonality;
- ∘
- dumping additive trend, w/o seasonality;
- ∘
- dumping multiplicative trend, w/o seasonality;
- ∘
- additive trend with additive seasonality;
- ∘
- additive trend with multiplicative seasonality;
- ∘
- dumping additive trend with additive seasonality;
- ∘
- additive trend with multiplicative seasonality;
- ∘
- dumping multiplicative trend with additive seasonality;
- ∘
- multiplicative trend with additive seasonality;
- ∘
- multiplicative trend with multiplicative seasonality;
- ∘
- dumping additive trend with multiplicative seasonality;
- ∘
- dumping multiplicative trend with multiplicative seasonality.
- Fuzzy models
- ∘
- Fuzzy model Direct Set Assignment;
- ∘
- Fuzzy model following [22], no trend, no seasonality;
- ∘
- Fuzzy model following [22], additive trend, no seasonality;
- ∘
- Fuzzy model following [22], no trend, additive seasonality;
- ∘
- Fuzzy model following [22], additive trend, additive seasonality;
- ∘
- Fuzzy model following [23], no trend, no seasonality;
- ∘
- Fuzzy model following [23], no trend, additive seasonality;
- ∘
- Fuzzy model following [23], no trend, multiplicative seasonality;
- ∘
- Fuzzy model following [23], additive trend, no seasonality;
- ∘
- Fuzzy model following [23], additive trend, additive seasonality;
- ∘
- Fuzzy model following [23], additive trend, multiplicative seasonality;
- ∘
- Fuzzy model following [23], multiplicative trend, no seasonality;
- ∘
- Fuzzy model following [23], multiplicative trend, additive seasonality;
- ∘
- Fuzzy model following [23], multiplicative trend, multiplicative seasonality.
- Normalizing data. It was decided to scale the time series within the [0.5; 1.5] range. This allows the characteristics of the time series with varying structures to have the same ranges. The range was moved from the default [0; 1] to allow for multiplicative forecast model usage. This scaling also makes the minimum and maximum values insignificant;
- Calculating the time series characteristics. For every time series, its characteristics are calculated. These data are the input for the neural network;
- Time series forecasting. All described methods make a forecast for every time series. We evaluate the forecast accuracy based on the test data set. In this work, SMAPE was used. We rank the forecast methods by their accuracy score. These data are the neural network output data set.
5. Experimental Setup and Results
- 28 April 2018–22 November 2018: Project start phase. Students include libraries in the project and create the basis for further development;
- 22 November 2018–23 March 2019: Collective development start phase. There was an active change in the project but students duplicate some code due to lack of knowledge of the project features;
- 23 March 2019–18 April 2019: Active development phase. Students were required to implement all functional requirements and present a demo of their part of the work;
- 18 April 2019–2 November 2020: Phase of correcting defects, removing duplicate code and optimization.
- Determining the complexity of the project based on its commit history. This action is required at the start of the analysis to select a project with a similar history;
- Determining the project condition based on its development stage. The number of opened/closed tasks, the number of developers and the number of developers making commits are determined;
- Determining the structural complexity of the project based on its file analysis (such as number of classes, their types, etc.);
- Determining the actuality of the project based on code change history, task activity and the number of marks the repository received from other users.
6. Conclusions
- -
- Numeric metrics for the source code repositories and IT project hosting services, that are used for analyzing the change dynamic, were determined;
- -
- The algorithms for selecting and extracting time series were developed;
- -
- An algorithm for forming a fuzzy rule base based on marked data was developed;
- -
- An algorithm for evaluating the software repositories was developed.
- Time series might not have enough length to calculate the trends and create a forecast, or some of the time series might not be able to be built for repositories with a short history. This leads to an impossibility for the repository analysis;
- The rules will not reflect the actual values for time series trends if an expert has created a rule base for a project that does not match the trends of the current project;
- There is a large amount of workload for the expert, and prohibitive time costs for large projects with a long history. The analysis time is based on analyzing the source code of every commit in the project history. This also leads to the time series becoming increasingly large in length, forcing the expert to evaluate each time interval separately.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- GitHub. Available online: https://github.com (accessed on 26 October 2023).
- GitLab. Available online: https://gitlab.com (accessed on 26 October 2023).
- Bitbucket. Available online: https://bitbucket.org (accessed on 26 October 2023).
- Rating of Repository Services for Storing Code. Available online: https://tagline.ru/source-code-repository-rating/2016 (accessed on 26 October 2023).
- GitHub Repository to Learn Data Science. Available online: https://levelup.gitconnected.com/top-10-github-repository-to-learn-data-science-892935bcebdb (accessed on 26 October 2023).
- Repository for Research Work at Bauman MSTU. Available online: https://github.com/iu5git/Science (accessed on 26 October 2023).
- Neural-Style. Available online: https://github.com/jcjohnson/neural-style (accessed on 26 October 2023).
- Git. Available online: https://git-scm.com (accessed on 26 October 2023).
- Mercurial. Available online: https://www.mercurial-scm.org (accessed on 26 October 2023).
- Subversion. Available online: https://subversion.apache.org (accessed on 26 October 2023).
- CVS. Available online: https://cvs.nongnu.org (accessed on 26 October 2023).
- Filippov, A.; Romanov, A.; Skalkin, A.; Stroeva, J.; Yarushkina, N. Approach to Formalizing Software Projects for Solving Design Automation and Project Management Tasks. Software 2023, 2, 133–162. [Google Scholar] [CrossRef]
- Muna, A. Assessing programming language impact on software development productivity based on mining oss repositories. ACM SIGSOFT Softw. Eng. Notes 2022, 44, 36–38. [Google Scholar] [CrossRef]
- Abuhamad, M.; Rhim, J.S.; AbuHmed, T.; Ullah, S.; Kang, S.; Nyang, D. Code authorship identification using convolutional neural networks. Future Gener. Comput. Syst. 2019, 95, 104–115. [Google Scholar] [CrossRef]
- Zhang, Y.; Wang, T. CCEyes: An Effective Tool for Code Clone Detec-tion on Large-Scale Open Source Repositories. In Proceedings of the 2021 IEEE International Conference on Information Communication and Software Engineering (ICICSE), Chengdu, China, 19–21 March 2021; pp. 61–70. [Google Scholar]
- Heinze, T.S.; Stefanko, V.; Amme, W. Mining BPMN Processes on GitHub for tool validation and development. In Enterprise, Business-Process and Information Systems Modeling: 21st International Conference, BPMDS 2020, 25th International Conference, EMMSAD 2020, Held at CAiSE 2020, Grenoble, France, 8–9 June 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; Proceedings 21; pp. 193–208. [Google Scholar]
- Le, T.H.M.; Chen, H.; Babar, M.A. Deep learning for source code modeling and generation: Models, applications, and challenges. ACM Comput. Surv. (CSUR) 2020, 53, 1–38. [Google Scholar] [CrossRef]
- Thota, M.K.; Shajin, F.H.; Rajesh, P. Survey on software defect prediction techniques. Int. J. Appl. Sci. Eng. 2020, 17, 331–344. [Google Scholar]
- Arndt, N.; Martin, M. Decentralized collaborative knowledge management using git. In Proceedings of the Companion Proceedings of The 2019 World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 952–953. [Google Scholar]
- Scrum. Available online: https://www.scrum.org/learning-series/what-is-scrum (accessed on 26 October 2023).
- Manifest Agile. Available online: http://agilemanifesto.org/iso/ru/manifesto.html (accessed on 26 October 2023).
- Ge, P.; Wang, J.; Ren, P.; Gao, H.; Luo, Y. A new improved forecasting method integrated fuzzy time series with exponential smoothing method. Int. J. Environ. Pollut. 2013, 51, 206–221. [Google Scholar] [CrossRef]
- Viertl, R. Statistical Methods for Fuzzy Data; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2011; 270p. [Google Scholar]
- CIF Dataset. Available online: https://irafm.osu.cz/cif2015/main.php (accessed on 26 October 2023).
- Romanov, A.A.; Filippov, A.A.; Voronina, V.V.; Guskov, G.; Yarushkina, N.G. Modeling the Context of the Problem Domain of Time Series with Type-2 Fuzzy Sets. Mathematics 2021, 9, 2947. [Google Scholar] [CrossRef]
ARIMA | AutoARIMA | StatsForecastAutoARIMA |
---|---|---|
ExponentialSmoothing | StatsForecastAutoCES | Theta |
FourTheta | StatsForecastAutoTheta | FFT |
KalmanForecaster | Croston | RandomForest |
RegressionModel | LinearRegressionModel | LightGBMModel |
CatBoostModel | XGBModel | RNNModel |
BlockRNNModel | NBEATSModel | NHiTSModel |
TCNModel | TransformerModel | TFTModel |
DLinearModel | NLinearModel | - |
Method | Number of Predicted Time Series |
---|---|
FFT | 27 |
MultTrendMultSeasonality | 12 |
MultTrendAddSeasonality | 10 |
AddTrendMultSeasonality | 5 |
RNNModel (LSTM) | 5 |
RNNModel (Vanilla RNN) | 3 |
NBEATSModel | 2 |
TransformerModel | 2 |
ARIMA | 1 |
CatBoostModel | 1 |
DATrendAddSeasonality | 1 |
DATrendMultSeasonality | 1 |
DLinearModel | 1 |
Fuzzy (Add,Add) | 1 |
FuzzyWithSets (Add,Add) | 1 |
FuzzyWithSets (Add,Mult) | 1 |
LightGBMModel | 1 |
NoTrendMultSeasonality | 1 |
TFTModel | 1 |
Repository Link | Repository Characteristic |
---|---|
https://github.com/killjoy1221/TabbyChat-2 (accessed on 26 October 2023) | An open repository with many developers and small number of commits. Used for testing models. |
https://github.com/helix-editor/helix (accessed on 26 October 2023) | An open repository with many developers. Used for testing models. |
https://github.com/apache/commons-lang (accessed on 26 October 2023) | An open repository with many developers and large number of commits. Used for testing models. |
https://git.athene.tech/romanov73/spring-mvc-example (accessed on 26 October 2023) | A repository with a small number of commits, one developer and a known development process. Used for testing models. |
https://git.athene.tech/romanov73/ng-tracker (accessed on 26 October 2023) | A repository with a known development process. Used for training models. |
https://git.athene.tech/romanov73/git-extractor (accessed on 26 October 2023) | A repository with a known development process. Used for testing models. |
Criteria | Trend | Criteria | Trend | Inference | |||
---|---|---|---|---|---|---|---|
If | Task time series | grow | and | Branch time series | grow | then | Some level of activity, likely starting a project |
If | Task time series | grow | and | Star time series | grow | then | Some level of activity, likely starting a project |
If | Branch time series | grow | and | Star time series | grow | then | Some level of activity, likely starting a project |
If | Branch time series | shrink | and | Star time series | shrink | then | Period of quick fixes |
If | Entity time series | grow | and | Commit time series | grow | then | Testing technology, creating the MVP of an app. |
If | Entity time series | shrink | and | Commit time series | shrink | then | User education |
If | Entity time series | grow | and | Commit time series | grow | then | Active work of several users |
If | Entity time series | shrink | and | Commit time series | shrink | then | Active task completion |
If | Branch time series | stable | and | Star time series | stable | then | Small fixes |
If | Branch time series | stable | and | Task time series | stable | then | Small fixes |
If | Star time series | stable | and | Task time series | stable | then | Small fixes |
Repository Link | Number of Commits | Time for Selecting the Time Intervals | Number of Selected Time Intervals |
---|---|---|---|
https://github.com/killjoy1221/TabbyChat-2 (accessed on 26 October 2023) | 323 | 120 | 2 |
https://github.com/helix-editor/helix (accessed on 26 October 2023) | 4648 | 1963 | 2 |
https://github.com/apache/commons-lang (accessed on 26 October 2023) | 7184 | 4020 | 5 |
https://git.athene.tech/romanov73/spring-mvc-example (accessed on 26 October 2023) | 3 | 10 | 2 |
https://git.athene.tech/romanov73/ng-tracker (accessed on 26 October 2023) | 920 | 20 | 4 |
https://git.athene.tech/romanov73/git-extractor (accessed on 26 October 2023) | 298 | 60 | 7 |
Repository Link | Number of the Operation (Manual/Suggested Way), Seconds | |||
---|---|---|---|---|
1 | 2 | 3 | 4 | |
https://github.com/killjoy1221/TabbyChat-2 (accessed on 26 October 2023) | 5/10 | 10/14 | 25/14 | 20/14 |
https://github.com/helix-editor/helix (accessed on 26 October 2023) | 5/10 | 175/15 | 310/15 | 20/15 |
https://github.com/apache/commons-lang (accessed on 26 October 2023) | 5/10 | 125/15 | 478/15 | 20/15 |
https://git.athene.tech/romanov73/spring-mvc-example (accessed on 26 October 2023) | 5/10 | 5/12 | 5/12 | 20/12 |
https://git.athene.tech/romanov73/ng-tracker (accessed on 26 October 2023) | 5/10 | 45/16 | 55/16 | 20/16 |
https://git.athene.tech/romanov73/git-extractor (accessed on 26 October 2023) | 5/10 | 25/16 | 20/16 | 20/16 |
Total | 5 | 64.1 | 148.8 | 20 |
/ | / | / | / | / |
Average | 10 | 14.6 | 14.6 | 14.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Romanov, A.; Yarushkina, N.; Filippov, A.; Sergeev, P.; Andreev, I.; Kiselev, S. Time Series Forecasting during Software Project State Analysis. Mathematics 2024, 12, 47. https://doi.org/10.3390/math12010047
Romanov A, Yarushkina N, Filippov A, Sergeev P, Andreev I, Kiselev S. Time Series Forecasting during Software Project State Analysis. Mathematics. 2024; 12(1):47. https://doi.org/10.3390/math12010047
Chicago/Turabian StyleRomanov, Anton, Nadezhda Yarushkina, Alexey Filippov, Pavel Sergeev, Ilya Andreev, and Sergey Kiselev. 2024. "Time Series Forecasting during Software Project State Analysis" Mathematics 12, no. 1: 47. https://doi.org/10.3390/math12010047
APA StyleRomanov, A., Yarushkina, N., Filippov, A., Sergeev, P., Andreev, I., & Kiselev, S. (2024). Time Series Forecasting during Software Project State Analysis. Mathematics, 12(1), 47. https://doi.org/10.3390/math12010047