1. Introduction
Significant advancements in artificial intelligence (AI) and human–computer interaction have been achieved, particularly with the development of large language models (LLMs). AI, essentially the replication of human intelligence by computer systems, has led to the development of various chatbots. Mainly used in this study are ChatGPT (
https://openai.com/blog/chatgpt) (accessed 2 April 2024) and Gemini (
https://gemini.google.com/app) (accessed 3 April 2024), which exhibit remarkable capabilities in processing and generating human-like text across diverse domains [
1,
2]. ChatGPT is based on the Generative Pre-Trained (GPT) architecture, including GPT-3 and GPT-4. In contrast, Gemini is built on Google’s Pathways Language Model (PaLM2), a sophisticated model that supports both language and multimodal (image and text) tasks. Both chatbots showcase remarkable proficiency in natural language processing (NLP), encompassing tasks such as text generation, translation, and various other language-related tasks, utilizing complex machine learning models, specifically the transformer architecture, and trained on a diverse range of internet texts, including books, scholarly articles, and websites. OpenAI’s ChatGPT applications span across domains including language translation, text summarization, and creative writing assistance, making it an adaptable and scalable tool for researchers exploring complex topics and generating insights from textual data. Meanwhile, Google’s Gemini, a newer entrant, offers broader capabilities as a multimodal AI, integrating language and image processing [
3]. Continuously evolving algorithms ensure that both models improve based on user interactions, providing up-to-date and accurate information. Their interfaces are designed for efficiency and precision, catering to users in need of swift and reliable information. Recently, these chatbots have significant implications across various research domains [
4].
The application of “ChatGPT” is more prominent in computer science, medicine, and the social sciences, constituting a substantial share [
5]. An analysis of the trend in total publications shows an impressive growth rate and highlights the importance of these models in the research field. Similarly, an examination of research articles emphasizing the applications of “Gemini” was also conducted using the Scopus database, revealing only 18 articles published, as it was recently released. And major research is conducted in medicine, the social sciences, and neuroscience [
5]. Saikat et al. conducted a comparative analysis between Google Bard and ChatGPT, focusing on their respective capabilities, strengths, and limitations [
6]. They emphasize how these chatbots differ from traditional search engines like Microsoft Bing in terms of response times, accuracy, and relevance [
6]. Furthermore, Gonzalez et al. examined the potential of large language models as educational tools in advancing knowledge. Their study highlighted the strengths and weaknesses of different AI models in handling IDP-related content [
7]. It is also important to conduct further research on different topics to further analyze the performance of chatbots.
Numerous studies have been conducted to analyze the application of chatbots across various fields. However, a comprehensive examination detailing the performances of AI chatbots and their understanding of biophysical phenomena remains notably absent in the existing literature. Understanding the underlying principles governing liquid–liquid phase separation (LLPS) is essential for understanding the complex mechanisms guiding cellular behavior. LLPS plays a fundamental role in cellular organization and function and involves the spontaneous formation of distinct liquid phases within the cell, important for processes such as gene expression regulation, signal transduction, and stress response [
8,
9]. Studying biophysical phenomena like LLPS in the literature is challenging due to the complexity of the molecular interactions, the interdisciplinary nature of the field, and the rapid evolution of research findings. The integration of AI and big data analysis into LLPS research can enhance information and facilitate informed decision making. AI chatbots can aid by summarizing vast amounts of literature, offering explanations of complex concepts, and providing references to recent studies, thereby making the research process more efficient and accessible. The unique features of these models position them as compelling subjects for comparative analysis within the rapidly evolving AI landscape.
In this study, we conducted a comprehensive comparison of two widely used chatbots, ChatGPT and Gemini, exploring their applications, performance, and capabilities. Our analysis specifically targets chat-based answer bots, with a particular emphasis on their ability to handle inquiries related to biological phenomena such as LLPS. A crucial aspect of our study is the in-depth analysis of performance metrics conducted on these chatbots. Our objective was to assess and compare the accuracy and consistency of the responses generated by both the premium versions of ChatGPT4 and Gemini in response to queries concerning LLPS, including its formation, molecular mechanisms, implications, therapeutic interventions, and its role in drug discovery. We subjected a series of queries among five categories and evaluated the responses generated by ChatGPT and Gemini premium paid versions, emphasizing how these chatbots differ from one another in generating responses. While ChatGPT generally proves reliable, it may occasionally yield inaccurate or irrelevant responses, particularly on highly specialized or niche topics, owing to its response being confined to the scope of its training data. In contrast, Gemini tends to provide more data-driven and less conversational responses. As both models continue to evolve, they promise even more sophisticated applications in the future [
10,
11]. By subjecting these LLMs to a diverse array of queries ranging from basic principles to complex mechanisms of LLPS, we aim to evaluate their capability to deliver precise explanations, highlighting their strengths, limitations, and potential insights into these biological phenomena.
2. Materials and Methods
A query set comprising 30 queries was employed to evaluate the performance of ChatGPT4 and Gemini in responding to LLPS-related inquiries. The selection and evaluation of these questions were aimed at assessing how each chatbot handles specific aspects of LLPS.
Table 1 shows the 30 queries submitted to both chatbots. Details on the 30 queries used in this study, along with their responses, are provided in the
Supplementary Materials. Our comprehensive set of questions were designed to cover diverse aspects of LLPS, encompassing its functions, common misconceptions, challenges, and roles within biological systems. These queries were divided into the following five categories: fundamentals and principles of LLPS, biological and functional implications of LLPS, mechanisms and modulation of LLPS, research techniques and challenges in LLPS, and computational modeling of LLPS and drug discovery. To ensure the chatbots had not been previously exposed to our queries or received any pretraining from the account, we established a new account to avoid potential biases from prior interactions. Each interaction followed a standardized format, and responses were solely based on the initial output generated by the newly created account. Our queries spectrum ranged from basic to intermediate to advanced complexity levels. Additionally, to maintain consistency and rigor, we utilized a systematic approach throughout our engagement with the language models (LLMs). All answers were provided in real time, and we analyzed whether they were correct.
For each outcome, there was a consensus. We assigned scores ranging from 1 to 5 to reflect the accuracy of each answer generated. To ensure objectivity, we conducted a blind review process, whereby the reviewers were not informed of the details of the chatbot. The specific scoring methodology is detailed as follows:
Score 5: Extremely accurate—the AI’s response is spot on and by all current biophysical knowledge and best practices;
Score 4: Reliable—the AI’s response is largely accurate, with only minor inconsistencies;
Score 3: Roughly correct;
Score 2: Absence of data analysis;
Score 1: Wrong—the AI’s response needs to be corrected.
We also conducted a comparative analysis of the referencing and citation capabilities of both AI chatbots in response to specific inquiries (
Table S2 in the Supplementary Materials). Our evaluation involved generating references using both chatbots and verifying their accuracy. We documented the details of the sources and any recommendations provided by the chatbots to ensure the reliability of the references.
Comparative Assessment of Chatbot Responses: Analyzing Word Frequency, Response Time, Length, and Similarity
We examined the word frequency of the responses generated by each chatbot to check interpretative consistency. The number of words generated by each chatbot and the total time taken to generate each response were determined. A follow-up question was given to the chatbots to provide the response length information. A manual count was also performed to remove any erroneous counts. The response time was measured using a stopwatch. Based on these measurements, we plotted the comparison plot for ChatGPT and Gemini. To quantify the similarities between responses of both chatbots, we employed the term frequency–inverse document frequency (TF-IDF) method for vectorization followed by a cosine similarity measurement for quantifying the degree of similarity. The procedure was implemented using Python’s scikit-learn library. The text data were first transformed into a numerical format using the TF-IDF vectorization method. After vectorization, the similarity between the text vectors was computed using the cosine similarity metric. Mathematically, it measures the cosine of the angle between two vectors projected in a multidimensional space [
12]. The cosine similarity is particularly used in positive space, where the outcome is neatly bounded in (0, 1). The cosine similarity index is defined as follows:
where
and
are the TF-IDF vectors for which the similarity is being measured.
3. Results and Discussion
This comparative analysis marks the first systematic assessment of ChatGPT4 and Gemini in understanding the biophysical phenomenon of liquid–liquid phase separation (LLPS). We initiated our study by deploying a set of structured introductory queries (
Table 1), designed to evaluate how effectively each AI model handles queries related to LLPS. Details of the 30 queries characterized under five different categories used in this study, along with their responses, are provided in the
Supplementary Materials. To qualitatively evaluate the performance of ChatGPT4 and Gemini, we generated several key metrics. We first evaluated the accuracy of each response given the accuracy score explained in
Section 2.
Figure 1 shows a heat map of the accuracy score across 30 queries submitted to both AI chatbots.
Figure 2 presents a histogram detailing the frequency distribution of the words in the responses generated by ChatGPT and Gemini. For the quantitative assessment, we measured the response times and lengths.
Figure 3a shows the time taken by each AI model to deliver its response, whereas
Figure 3b displays the word counts of the responses from both models.
Figure 4 illustrates a cosine similarity index (CSI) that compares the concordance between the responses provided by the two models, highlighting their interpretative consistency. The following section provides a detailed explanation of our assessment methods and a comprehensive comparative analysis of the results.
3.1. Performance Evaluation
In this section, an analysis of the performance of the two AI models, ChatGPT and Gemini, over 30 diverse queries from five categories reveals notable fluctuations in their accuracy, suggesting that the effectiveness of each model depends significantly on the specific query (
Figure 1). Both models demonstrated high accuracy on specific queries, such as Queries 21 and 29, for which they scored a “5”, indicating their potential to achieve top performance. However, the results also show inconsistencies, for example, on Queries 1, 5, and 20, Gemini scored a “4” or above, while ChatGPT only managed below “3”, highlighting differences in how the models process and respond to certain queries. In the category of “biological and functional implication of LLPS”, ChatGPT performed better. In all other categories, Gemini performed superiorly. The average scores for both models were above “3”, suggesting generally satisfactory performances. This variability in scores indicates strengths and weaknesses in each model, which could be crucial for users to consider when selecting a model for specific tasks. The evaluation criteria here show that ChatGPT excels in a broader biological context and synthesis, while Gemini is optimized for technical accuracy and specialization in LLPS principles, mechanisms, and research techniques. The identification of queries for which one model outperforms the other could lead to targeted improvements and optimization, enhancing overall efficacy in practical applications.
3.2. Analysis of Word Frequency
In our analysis of the word frequency from the responses provided by ChatGPT and Gemini, we focused on identifying the terms most frequently utilized by each AI model in discussing LLPS (
Figure 2a,b). The ten most frequently used words in the ChatGPT-generated responses were LLPS (243), proteins (129), condensates (100), phase separation (77), cellular (50), molecules (33), cells (26), dynamic (24), RNA (23), and diseases (20). Similarly, the Gemini-generated responses were LLPS (225), phase separation (220), cellular (100), condensates (93), proteins (87), interactions (65), dynamic (45), cells (27), disease (25), and organelles (24). The word “LLPS” appeared most frequently in responses from both ChatGPT and Gemini. Additionally, out of the top 10 most-used words, 5 were common to both models and appeared with similar frequencies, highlighting a significant overlap in the vocabulary used by the two chatbots. However, there was considerable variation in the number of appearances of the other words. An observation of the total word frequency reveals that the responses are scientifically relevant to the topic of the queries.
Word frequency analysis reveals that both ChatGPT and Gemini prioritize core LLPS-related concepts, including ‘LLPS’, ‘proteins’, ‘phase separation’, and ‘cellular’. Our findings expose distinct patterns in language utilization and provide a clear indication of strengths and weaknesses inherent to each model. For instance, where one model may excel at detailing complex interactions, another might better summarize overarching themes or principles, thereby suggesting a complementary utility in scientific inquiry. The measurement of word frequency offers a practical metric for refining AI models and highlights the potential areas of enhancements.
3.3. Comparison of Response Time and Length
Figure 3a,b show the response times and lengths in a comparison between the chatbots. In terms of the response time, ChatGPT exhibited the fastest average response time of 3.6 s, while Gemini lagged with an average response time of 10.2 s. This disparity can be attributed to ChatGPT’s optimized architecture and utilization of pretrained models, enabling it to quickly process and respond to user queries. Gemini, on the other hand, had longer response time due to their less effective architecture and the necessity to train models on new data. ChatGPT represented remarkable efficiency, responding to queries in half time compared to Gemini. When analyzing the response length, ChatGPT exhibited a higher word count generation of an average of 351 words compared to Gemini’s word count average of 272, which shows more concise and accurate response generation.
3.4. Similarity Analysis of Responses
Figure 4 shows the cosine similarity index (CSI) between responses from ChatGPT and Gemini across 30 distinct queries. The CSI values fluctuated significantly across the queries, ranging approximately from 0.5 to 0.8. This suggests a varying degree of similarity in the responses between ChatGPT and Gemini, indicating that while sometimes the models provide highly similar answers, at other times their responses are substantially different. Peaks around Queries 5, 14, 16, and 24 suggest instances where both models provide highly similar responses, possibly due to shared training data or algorithmic similarities. Conversely, lower points at Queries 7, 17, 21, and 30 indicate significant divergence in responses, likely stemming from differing interpretations of the queries or prioritization strategies.
It is important to acknowledge that CSI using TF-IDF might not yield perfect results when comparing similarities between paragraphs. TF-IDF calculates the importance of a word within a document relative to its frequency across a corpus. However, it does not account for semantic meaning or synonyms of words, and it does not capture the semantic meaning or synonyms of words. Despite this limitation, TF-IDF remains an effective method for quantifying text overlaps [
12].
3.5. Review of Information Sources and References in Chatbot Responses
We also examined the referencing and citation capabilities of chatbots, as outlined in
Table S2. It was observed that ChatGPT often generates illustrative and fabricated references, citing a lack of real-time internet source access and suggesting the use of databases like PubMed or Scholarly to obtain accurate references, whereas Gemini provides correct references along with the detailed source information. Therefore, although chatbots are highly advantageous in providing authentic information, the user must be careful when using the associated citation and source information. Overall, the results of this study provide valuable insights into the performance of ChatGPT and Gemini in natural language processing, machine learning, and user experience.