The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT

Baumgartner, Nils; Iyenghar, Padma; Schoemaker, Timo; Pulvermüller, Elke

doi:10.3390/software4010003

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT

¹

Research Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabrück, 49074 Osnabrück, Germany

²

Innotec GmbH, Hornbergstrasse 45, 70794 Filderstadt, Germany

^*

Author to whom correspondence should be addressed.

Software 2025, 4(1), 3; https://doi.org/10.3390/software4010003

Submission received: 13 December 2024 / Revised: 27 January 2025 / Accepted: 31 January 2025 / Published: 2 February 2025

(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)

Download Versions Notes

Abstract

This paper explores a modular pipeline architecture that integrates ChatGPT, a Large Language Model (LLM), to automate the detection and refactoring of data clumps—a prevalent type of code smell that complicates software maintainability. Data clumps refer to clusters of code that are often repeated and should ideally be refactored to improve code quality. The pipeline leverages ChatGPT’s capabilities to understand context and generate structured outputs, making it suitable for addressing complex software refactoring tasks. Through systematic experimentation, our study not only addresses the research questions outlined but also demonstrates that the pipeline can accurately identify data clumps, particularly excelling in cases that require semantic understanding—where localized clumps are embedded within larger codebases. While the solution significantly enhances the refactoring workflow, facilitating the management of distributed clumps across multiple files, it also presents challenges such as occasional compiler errors and high computational costs. Feedback from developers underscores the usefulness of LLMs in software development but also highlights the essential role of human oversight to correct inaccuracies. These findings demonstrate the pipeline’s potential to enhance software maintainability, offering a scalable and efficient solution for addressing code smells in real-world projects, and contributing to the broader goal of enhancing software maintainability in large-scale projects.

Keywords: data clumps; modular pipeline; large language models; ChatGPT; refactoring; scalability; software engineering

Share and Cite

MDPI and ACS Style

Baumgartner, N.; Iyenghar, P.; Schoemaker, T.; Pulvermüller, E. The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT. Software 2025, 4, 3. https://doi.org/10.3390/software4010003

AMA Style

Baumgartner N, Iyenghar P, Schoemaker T, Pulvermüller E. The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT. Software. 2025; 4(1):3. https://doi.org/10.3390/software4010003

Chicago/Turabian Style

Baumgartner, Nils, Padma Iyenghar, Timo Schoemaker, and Elke Pulvermüller. 2025. "The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT" Software 4, no. 1: 3. https://doi.org/10.3390/software4010003

APA Style

Baumgartner, N., Iyenghar, P., Schoemaker, T., & Pulvermüller, E. (2025). The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT. Software, 4(1), 3. https://doi.org/10.3390/software4010003

Article Menu

The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI