Fully Open-Source Meeting Minutes Generation Tool

Haz, Amma Liesvarastranta; Panduman, Yohanes Yohanie Fridelin; Funabiki, Nobuo; Fajrianti, Evianita Dewi; Sukaridhoto, Sritrusta

doi:10.3390/fi16110429

Open AccessArticle

Fully Open-Source Meeting Minutes Generation Tool

by

Amma Liesvarastranta Haz

¹

,

Yohanes Yohanie Fridelin Panduman

¹

,

Nobuo Funabiki

^1,*

,

Evianita Dewi Fajrianti

¹

and

Sritrusta Sukaridhoto

²

¹

Graduate School of Natural Science and Technology, Okayama University, Okayama 700-8530, Japan

²

Department of Informatic and Computer, Politeknik Elektronika Negeri Surabaya, Surabaya 60111, Indonesia

^*

Author to whom correspondence should be addressed.

Future Internet 2024, 16(11), 429; https://doi.org/10.3390/fi16110429

Submission received: 6 July 2024 / Revised: 4 November 2024 / Accepted: 12 November 2024 / Published: 20 November 2024

(This article belongs to the Special Issue Deep Learning and Natural Language Processing II)

Download

Browse Figures

Versions Notes

Abstract

:

With the increasing use of online meetings, there is a growing need for efficient tools that can automatically generate meeting minutes from recorded sessions. Current solutions often rely on proprietary systems, limiting adaptability and flexibility. This paper investigates whether various open-source models and methods such as audio-to-text conversion, summarization, keyword extraction, and optical character recognition (OCR) can be integrated to create a meeting minutes generation tool for recorded video presentations. For this purpose, a series of evaluations are conducted to identify suitable models. Then, the models are integrated into a system that is modular yet accurate. The utilization of an open-source approach ensures that the tool remains accessible and adaptable to the latest innovations, thereby ensuring continuous improvement over time. Furthermore, this approach also benefits organizations and individuals by providing a cost-effective and flexible alternative. This work contributes to creating a modular and easily extensible open-source framework that integrates several advanced technologies and future new models into a cohesive system. The system was evaluated on ten videos created under controlled conditions, which may not fully represent typical online presentation recordings. It showed strong performance in audio-to-text conversion with a low word-error rate. Summarization and keyword extraction were functional but showed room for improvement in terms of precision and relevance, as gathered from the users’ feedback. These results confirm the system’s effectiveness and efficiency in generating usable meeting minutes from recorded presentation videos, with room for improvement in future works.

Keywords:

meeting minutes generation; presentation video; audio-to-text; summarization; keyword extraction; OCR; SSIM; Whisper model; BART LM

1. Introduction

The shift to online work has made video conferencing essential, leading to a significant increase in online presentations at conferences and meetings. However, the excessive use of online meetings has increased fatigue among participants [1,2]. It is even common practice now to record online meetings for later review, as participants struggle to process all the information during live sessions. Recent advances in artificial intelligence (AI) offer solutions by helping users to retrieve key information efficiently from recorded meetings, reducing the need for time-consuming reviews [3].

AI can capture key information from recorded meetings using audio, visual, and/or text data by applying information extraction (IE) methods [4]. The IE methods convert unstructured data into structured formats in order to highlight relevant information. However, the existing proprietary systems limit flexibility and adaptability in implementing new models and restrict accessibility for key information retrieval, while an open-source approach offers more flexibility in adapting the latest models to achieve the best results. Integrating multiple models and methods into one system requires careful selection. Therefore, in this paper, we investigate whether fully open-source models can be effectively integrated and implemented for the generation of meeting minutes in the context of recorded online meetings.

For this purpose, suitable open-source models for the system need to be identified. Then, the integration of the selected models needs to be conducted in order to develop a modular and adaptable system. Finally, the developed system needs to be compared with existing proprietary systems. The use of open-source models and methods is driven by their wide accessibility, cost effectiveness, and the benefits of continuous community contributions. Coupled with the rapid advancements in natural language processing (NLP), this approach offers the ability to quickly adopt the latest innovations. Thus, it enhances the system performance and improves the quality of generated meeting minutes. This is highly advantageous and beneficial for individuals or organizations in need of such a system.

In this paper, an integrated and modular fully open-source system is proposed by combining multiple methods of video analysis and information extraction. Since online presentations are often comprised of multiple slides, scene-change detection is applied in order to segment the video based on the slide changes. Then, since the speaker’s speech also contains valuable information, an audio-to-text method is employed to capture this content. Since the resulting text data can be extensive, summarization and keyword extraction techniques are used to quickly convey the main points of the presentation. Considering that the video and textual information are processed separately, inaccurate matching could cause confusion for readers. To address this, an information correlation mechanism using OCR and regular expressions (regex) are applied.

The proposed system was evaluated using ten presentation videos that were created under controlled conditions, although they may not fully reflect the complexity of real-world online presentations. The system’s performance was measured in terms of accuracy and efficiency, using statistical metrics appropriate for each respective method. Additionally, user feedback was collected through a questionnaire, focusing on the quality of transcriptions, summaries, keyword extraction, and the intuitiveness of the user interface.

Building upon the explanation of the proposed system, namely a fully open-source model for generating meeting minutes, the contributions of the proposal can be summarized into the following points:

Designing a meeting minutes generation system by combing several methods that have been implemented by open-source programs.
Selecting the most appropriate open-source models for each method in the system from a wide range of available options, ensuring the best fit for each task, and including scene-change detection, audio-to-text conversion, and summarization.
Presenting higher accuracy in transcription and similar performance in summarization and keyword extraction tasks compared to existing proprietary systems.

The paper is structured as follows: Section 2 provides an overview of previous studies and methodologies. Section 3 presents the proposed system. Section 4 presents the selection process used for the open-source models. Section 5 assesses the system through experiments. Section 6 discusses the findings and the implications of the results from the system’s evaluation. Section 7 concludes the paper and summarizes the key outcomes and insights.

2. Related Work in Literature

This section explores related works on the methodologies used in this paper.

2.1. Video-Splitting Methods

The system segments the recorded presentation video using a scene-change detection algorithm to analyze differences between scenes. This can be classified into two approaches: image pixel data, which refers to the color values of each pixel, and image structure data, which considers the spatial arrangement of edges, shapes, and textures [5].

Common algorithms, such as mean square error (MSE) and histogram matching, offer faster scene detection but fail when pixel-level or geometric alterations occur [6,7]. In contrast, structure-based approaches (e.g., texture, contrast, and luminance) better preserve image information during compression [8]. The structural similarity index measure (SSIM) focuses on structural changes rather than pixel differences, assessing image quality based on human visual perception [9].

Given these insights, the SSIM algorithm is ideal for our use case, which involves analyzing compressed presentation videos without geometric alterations. This choice will be tested in subsequent experiments to validate its effectiveness and justify its selection over other methods to ensure that it meets the specific needs of our system.

2.2. Audio-to-Text

Capturing information from audio is achieved by converting the audio signal into text data [10]. Various approaches for audio-to-text conversion exist, but transformer-based models have recently gained attention for their strong performance and ability to maintain long-term dependencies in audio data.

Several transformer-based models, such as Wav2Vec 2.0 and HuBERT, focus on self-supervised learning and are effective for low-resource languages [11,12]. The massively multilingual speech (MMS) model utilizes self-supervised learning to support thousands of languages, offering broad linguistic coverage. However, its extensive coverage comes at the cost of increased computational time [13]. Whisper excels in multilingual and noisy environments, as it is trained on weakly annotated and noisy data [14], although it suffers from occasional hallucinations, where incorrect words or sentences are generated.

Given Whisper’s strengths in handling multilingual and noisy environments, it appears to be a strong candidate for our system. However, further experiments comparing these models are needed to evaluate their suitability against other options and to validate their performance.

2.3. Abstractive Summarization

Abstractive summarization allows for creating summaries with human-like qualities, generating new words and sentence structures while preserving the original meaning of the source text [15]. Several methods have leveraged transformer architecture due to its self-attention mechanism, which could preserve information from lengthy text data.

The text-to-text transfer transformer (T5) handles a wide range of NLP tasks by framing them as text-to-text problems, benefiting from its training on diverse datasets [16]. However, this versatility comes at the cost of sub-optimal performance in specialized abstractive summarization tasks. The pre-training with extracted gap-sentences for abstractive summarization (PEGASUS), explicitly designed for summarization by Google, performs exceptionally well in this domain [17]. However, its specialized nature limits generalization across topics and requires computationally expensive fine-tuning for optimal results. In contrast, bidirectional and auto-regressive transformer (BART) balances generalization and accuracy by leveraging diverse datasets and combining a bidirectional encoder with an auto-regressive decoder [18]. This dual approach improves word generation and preserves context over long sequences, although it increases computational cost and slows down processing due to its complex architecture.

Given the diverse topics covered in the recorded presentation videos and prior evaluation [19], BART was selected for its broad generalizability across different topics, making it suitable for generating summaries in this use case.

2.4. Keywords Extraction

In addition to summarization, the extraction of keywords from individual documents helps to achieve a concise understanding of the presented information. Recent approaches leverage the automatic keyword extraction methods that employ language model embeddings, statistical approaches, graph-based architectures, and unsupervised learning to effectively manage complex relationships between words.

KeyBERT uses bidirectional encoder representative from transformer (BERT) embedding to rank keywords, providing a high accuracy but requiring significant computational resources [20]. TextRank, a graph-based method, measures the influence of words based on their co-occurrence, which is effective but increases the computational load when handling longer texts [21]. RAKE and YAKE are both fast and simple due to their reliance on statistical techniques, although they lack the contextual depth and complexity needed for more complicated tasks [22,23]. Meanwhile, RaKUn2 combines a graph-based approach with unsupervised learning to cluster similar words, then applies load centrality to identify the most influential word clusters [24]. This mechanism reduces the number of graphs and lowers both the computational load and complexity, offering a more scalable solution for large and complex texts.

Given the diverse topics in the recorded presentation videos and the prior evaluation of various keyword extraction methods [25], RaKUn2 was selected for its ability to balance computational efficiency with accurate keyword extraction, making it well suited for managing the complexity of this use case.

2.5. Optical Character Recognition (OCR)

Extracting text data enables a machine to understand the information when subjected to visual data. OCR utilizes the pattern recognition to interpret the characters in an image, enabling the digitization and extraction of texts for various applications.

Recent OCR models leverage machine learning, resulting in varying performance based on datasets and architectures. EasyOCR combines LSTM and CNN for text recognition, offering a simple and fast architecture, but sacrificing accuracy, especially in complex extraction scenarios with varying font sizes [26]. Tesseract, developed by Google, provides broad language support and performs well on high-quality printed or written documents. However, its accuracy diminishes when handling low-resolution or compressed images, or text with small fonts [27]. In contrast, PaddleOCR excels at extracting text from images with complex orientation, compression, or composition due to its multi-stage process and training on diverse image styles. However, this complexity results in higher computational costs, making it less suitable for real-time applications or systems with limited computational power [28].

Given Tesseract’s struggles with small fonts and EasyOCR’s sensitivity to background variations, PaddleOCR offers a more robust solution for extracting text from challenging images, including those that are slightly blurred or low-resolution. While its more complex architecture increases execution time, leveraging cloud computing can mitigate this drawback by enhancing computational power, making PaddleOCR a suitable choice even in environments where performance is critical.

3. Proposal of Meeting Minutes Generation System

This section describes the methodologies workflow of the system.

3.1. System Overview

The proposed system aims to provide users with flexibility in its use of open-source models and accessibility. For this purpose, the proposed system is implemented on top of several functions and deployed as a web application, as shown in the Figure 1. This system architecture consists of the client side and the server side. The client side comprises the input function and the output function. The input function accepts a pair consisting of a recorded online presentation and the document used during the presentation. The output function visualizes the results of the transcribed audio, generates summaries, extracts keywords, and matches slides through the user interface (UI). The server side is responsible for processing the video to generate meeting minutes. Since each slide in the video may contain different information, the system first segments the video by slides. This process uses the split video function, which includes the scene-change detection algorithm and the segmentation mechanism [29]. The scene-change detection algorithm compares visual information between adjacent frames and splits the video at the frame where the visual information changes.

Then, the audio from each segmented video is transcribed into a text using the audio-to-text function [30,31]. This process converts the speaker’s voice into a text format. From the text, the information extraction function generates the summary and extracts the keywords to help users quickly grasp the key points of the speaker’s message [19,32].

At this stage, the system has generated the transcription, summary, and keywords for each video segment. However, these data still need to be linked to the corresponding slide. To achieve this, the information correlation function is applied. This function uses OCR, the python-pptx library, and regular expression (regex) to match the extracted information with the corresponding slide [33]. Finally, the system organizes the correct information through the slide output information function, which is displayed to the user through the client-side interface.

3.2. Input Function

The presentation video contains both visual and audio data, with a single presenter using screen sharing. The video format must be compatible with ffmpeg, and the resolution should be

1280 \times 720

. The presentation document provides the text content of the presentation, used as the ground truth for validations. Currently, the document must be in .pptx format, with no images, shapes, or animations, and a single background color to minimize the complexity. Figure 2 shows an example of an acceptable document layout. 0

3.3. Split-Video Function

As explained above, since the presentation video may contain multiple slides, there is a possibility of information overlap between them. A scene-change detection algorithm and segmentation mechanism are applied to prevent this. Inspired by the work of Bulut F. et al. [34], we adopt the split video function to separate the input video into multiple segments using a scene-change detection algorithm.

This algorithm detects differences between adjacent frames to determine segmentation points. Then, the segmentation mechanism splits the video with these segmentation points. Finally, the function produces segmented videos and segmentation indices. The segmentation index denotes the order of the segments within the input video and serves as the identifier for each segment. These segmented videos are then processed individually in the audio-to-text function to transcribe the audio. The segmentation index is attached to each extracted piece of information produced by the information extraction function.

3.4. Audio-to-Text Function

This function employed the audio-to-text algorithm to convert audio signals into text. Before transcribing the audio from every segmented video, normalization is performed. Resampling by ffmpeg normalizes the video’s sampling rate to match with the standard sampling rate of 16 kHz. Current audio-to-text models use transformer architectures due to the efficiency of handling distant connections in audio data.

Then, pre-processing is performed to convert the raw audio into a Mel spectrogram. The Mel spectrogram is a 2D representation of the audio’s frequency content over time. Some models combine the convolutional feature encoder to extract patterns and reduce the data’s complexity. The core component of the transformer architecture is the context model by the self-attention mechanism. It captures important connections across the entire audio sequence, allowing the model to process relations from short or long word sequence. The sequence order is defined by the positional encoding. Lastly, the post-processing ensures the transcription is coherent, adding punctuation and making grammatical adjustments where needed. The process allows the transformer-based model to process entire sequences in parallel, improving transcription accuracy by focusing on different parts of the input simultaneously.

3.5. Information Extraction Function

The information extraction function produces summaries, extracts keywords from the text, and provides them with the segmentation indices.

3.5.1. Abstractive Summarization

The abstractive summarization generates summaries while retaining the original semantic contents of the source text [35]. In this study, the bidirectional and auto-regressive transformers language model (BART LM) is employed to achieve accurate results in generating coherent summaries [36]. This selection comes from the results of our previous works to measure the performance of abstractive summarization models [19].

The BART LM model is downloaded through the HuggingFace implementation [37]. The BART LM model adopts the bidirectional encoder and the auto-regressive decoder. The bidirectional encoder allows the model to understand the contextual representations between words from before and after the current words in the transcribed text. It encodes these data into numerical vectors and heightened significant and flattened uninfluential data using an attention mechanism. The auto-regressive decoder enables the model to produce words by predicting the most potential words after the current words. This iterative process is performed continuously following the depth of the architecture. Finally, the model transforms these numerical vectors back into the text.

3.5.2. Keywords Extraction

The keyword extraction process automatically ranks and extracts the most relevant words from a source text. It determines the ranks based on word correlations, frequencies, and semantic meaning [38]. The rank-based keyword extraction via unsupervised learning (RaKUn2) model is downloaded through GitHub repositories [39]. In this study, RaKUn2 is employed due to its ability to deliver fast and accurate results across different keyword datasets [25]. This selection aligns with the findings of our previous work, where various keyword extraction methods were evaluated, and the effectiveness of RaKUn2 in diverse scenarios was confirmed [19].

The keyword extraction divides each sentence from the transcribed text into words and characters called tokens. This process is called tokenization. The relationship between tokens is determined by the frequency of their co-occurrence. It was calculated based on the frequency of appearance of each token in a sentence. Then, it creates a graph where the tokens are the nodes, and the edges are the co-occurrence frequencies. It organizes the tokens’ ranks based on the graph’s number of edges. Finally, it performs the post-processing steps to remove the duplicated keyword candidates.

3.6. Information Correlation Function

The information correlation function compares and validates the text data from the recorded presentation video and the extracted information with the original text from the presentation document. This function workflow is visualized in Figure 3.

First, it recognizes the text from each frame in the video using OCR. Since the video was resized, only the text in the title area is recognized and extracted. For the comparison, the original text is gathered from the title text of each slide in the presentation document using the python-pptx package.

Then, it compares the OCR text with the original text using the regex string matching to determine which slide index the OCR text corresponds to. The slide index refers to the order of slides that appear in the presentation document. This process determines the identical strings that appear in both OCR text and the original text.

Finally, after the slide index is known, the extracted information is assigned to the similar segmentation index. This process ensures that the information extracted from the recorded presentation is accurately correlated with the corresponding slides, thus improving the overall accuracy and usefulness of the extracted information and avoiding confusion for the readers.

3.7. Slide Output Information Function

Following the validation of data pairs by the information correlation function, the data needs to be organized for easy selection and management. The data include the extracted information, the word-error rate (WER) score for audio-to-text conversion [40], the character-error rate (CER) score for OCR accuracy [41], and the image for each slide. This function arranges the data in JSON format to ensure efficient handling and presentation. Figure 4 shows an example of the organized data for each slide, ready for display through the UI on the client side.

In Figure 4, the slide_index indicates the order of slides. The slide_image represents the visual information shown in the recorded video. The slide_ocr_text shows the results of OCR. The slide_presentation_text shows the original text in the slide. The slide_cer displays the validation results of information correlation processes. The slide_summary and slide_keywords provide a brief explanation of the information. The slide_convert_text displays the transcripted text from the audio-to-text process. The slide_wer indicates the accuracy of audio-to-text.

3.8. Output

The output displays the structured data in a web application using the Streamlit Python package [42]. Streamlit integrates with the React user interface framework to develop web applications. Figure 5a shows the UI where users upload the recorded presentation video and the associated documents. Two input fields are provided to upload the required data. Once the video is loaded, it is displayed. Figure 5b shows the UI after the uploaded data were processed, and the results are displayed based on the data stored in the JSON format. It displays the corresponding slide image, along with the extracted information, such as the WER and CER scores, the text extracted via OCR, and the title text extracted using the python-pptx library. Additionally, the transcriptions, summaries, and keywords generated from the transcription are also displayed.

4. Selection of Open-Source Models

This section outlines the process for selecting appropriate open-source models for the system.

4.1. Selection of Scene-Change Detection Algorithms

Since the algorithm needs to determine the transition points in video recordings based on scene changes, an accurate and fast technique is essential. However, while speed is important, accuracy must not be compromised, as it would impact the quality and disrupt subsequent processes. To meet these requirements, several algorithms from the OpenCV and Scikit-Image libraries were compared. The results are presented in Table 1.

Table 1 demonstrates the superiority of SSIM over other image comparison algorithms. SSIM achieves perfect precision, recall, and F1 scores. In contrast, Histogram Comparison and MSE struggle, especially with brightness variations and compression artifacts, leading to misclassified frames. Though SSIM requires a longer processing time than MSE, its structural approach results in higher accuracy, since it considers luminance, contrast, and image structure. This precision justifies the extra computation time and makes SSIM the preferred choice for scene-change detection.

4.2. Selection of Audio-to-Text Models

The audio from each segmented video is extracted and converted into text. Selecting an accurate model is essential to ensure that subsequent processes accurately reflect the information conveyed in the audio. Additionally, model efficiency is essential to reduce computational time. Experiments were conducted to measure the WER and the computational duration (in seconds) of models from HuggingFaceHub to identify the most suitable model. Table 2 shows the results.

Table 2 compares the accuracy of WER and the computational duration. The Whisper model achieved the lowest WER and the shortest duration. While Wav2Vec had a relatively short duration, its WER was the highest. Therefore, the Whisper model is the most suitable and, thus, is selected for the system’s audio-to-text conversion.

4.3. Selection of Summarization Models

Each transcribed text from the segmented video is summarized to help readers quickly grasp the presentation’s main points. The summary is generated using an abstractive approach, leveraging the language model (LM) to produce a coherent and concise output. Experiments were conducted to select the most suitable model from HuggingFaceHub. Table 3 shows the results.

Table 3 suggests that the BART model outperforms PEGASUS and T5, achieving the highest ROUGE-1, ROUGE-2, and ROUGE-L scores, and the fastest computational time. BART’s bidirectional encoder and auto-regressive decoder enable it to effectively process information from both the beginning and the end of sentences, allowing it to generate more coherent summaries. Despite its relatively complex architecture, BART’s high ability summarizing ability contributes to its computational efficiency. Based on these results, BART is the most suitable model for generating summaries in the system.

4.4. Selection of Keyword Extraction Models

Keywords are extracted to emphasize the key topics discussed in each transcribed segment. Various models were tested for keyword extraction, and a detailed comparison was made to identify the best-performing model. Table 4 shows the results.

Table 4 highlights the performance differences across various keyword extraction models. RaKUn2 achieved the highest cosine similarity score at 47.70%, suggesting it is relatively effective at identifying relevant keywords within transcribed segments. Based on the results, RaKUn2 emerged as the most suitable model for keyword extraction and is employed in the system. However, this measurement only considers the lexical similarity. Other factors, such as keyword diversity, duplication, and representativeness, are not captured by this metric alone. Therefore, these parameters should be explored further to gain insights into a more accurate keyword extraction model.

4.5. Selection of OCR Programs

The extracted information and slide image are aligned based on the extracted OCR text. Therefore, an accurate OCR program must be selected. To justify the selection, multiple OCR programs are compared, and the lowest CER is chosen as the OCR for the system.

Table 5 shows that PaddleOCR consistently delivers the fastest processing time and maintains high accuracy at any image resolution, showing the best overall performance. Tesseract performs well at higher resolutions but suffers from a significant accuracy drop at lower ones, reaching 100% CER below

320 \times 180

, as denoted by the bold values in the Table 5. EasyOCR is the slowest and least accurate, especially at lower resolutions, where error rates rise sharply. Thus, PaddleOCR is the most reliable and is selected as the best fit for the system.

5. Evaluations

In this section, the implementation of the proposed system is evaluated.

5.1. Experiment Preparation

Ten recorded presentation videos on different topics with varying slide lengths are prepared as experimental materials to test the proposed system’s ability to handle diverse content. The videos were recorded with speakers from five non-native English accents, including Vietnam (V), Congo (C), Singapore (S), Malaysia (M), and Indonesia (I). The recordings were created using the Zoom screen-share function and recorded through its meeting recording function. The output was a video file in .mp4 format, with a resolution of

1280 \times 720

. Table 6 provides the details of each recorded presentation video.

The video recordings have a duration ranging from 4 to 8 minutes and an average bitrate of 611 kpbs, as shown in the Table 6. The proposed system is evaluated through statistical measurements and a user questionnaire. Each model’s and algorithm’s output are compared to reference data. This includes evaluating the F1-score for scene-change detection, WER for audio-to-text, ROUGE-N and ROUGE-L for abstractive summarization, cosine similarity for keyword extraction, and CER for OCR. Then, the user questionnaire assessed the system’s practical performance and the user experience. Each participant’s session lasted about 45 min. Participants watched ten presentation videos covering varied topics and accents to introduce speech pattern diversity. After viewing, users reviewed the system-generated meeting minutes and completed the questionnaire, evaluating the system’s usability, interface intuitiveness, meeting minute quality, audio transcription accuracy, and scene segmentation effectiveness.

5.2. Limitations and Biases of Recorded Videos

The recorded videos used in this study were designed to test the system across varied topics, presentation styles, and speaker accents, though some limitations should be noted. Speakers from Vietnam, Congo, Singapore, Malaysia, and Indonesia introduce accent variations. However, this sample set does not fully capture global English diversity. The topics, focusing on technical fields like AI and biotechnology, present structured, content-rich scenarios that may not depict more casual presentations. Audio quality was managed by recording in quiet rooms using standard laptop microphones, minimizing noise but possibly influencing transcription accuracy. Speech patterns were kept natural, though accents may still introduce phonetic challenges for the system. Presentation slides were standardized in the .pptx format without images or complex formatting, ensuring consistency but limiting the representation of real-world design diversity. These constraints provide a controlled testing environment but may not fully reflect broader real-world variability.

5.3. Performance of Split Video Function

This section evaluates the practical impact of SSIM’s algorithm. It categorizes video frames based on slide transitions as “change” or “same” to accurately segment the video. Video resolution was reduced to

426 \times 240

, and the algorithm was applied to one frame per second instead of every frame to optimize processing time. This approach significantly lowered the execution time, enabling SSIM to segment videos within one minute while maintaining a perfect F1 score, as shown in Table 7. These results indicate that each detected frame change accurately corresponds to the actual scene transition in the video.

5.4. Performance of Audio-to-Text Function

The audio-to-text function utilizes the Whisper model to convert audio-to-text and evaluates performance using WER. Audio is normalized to a sampling rate of 16 kHz. Despite the various speeches from non-English-speaking countries, including Vietnam, Congo, Singapore, Malaysia, and Indonesia, with varied duration, the Whisper model demonstrates averaged low WER, as shown in Table 8.

Table 8 shows the results. It demonstrates that the Whisper model consistently achieved low WER values across different accents on all the slides. This emphasizes Whisper’s robustness in handling varied accented speakers. However, some slides that contain the main information (e.g., slide 2 and slide 3) show slightly higher WER values. This behavior may be attributed to the technical terms used in the presentation. However, the model continues to deliver accurate transcriptions on the other slides that are comparable to professional human transcribers [60].

5.5. Performance of Information Extraction Function

The information extraction function uses BART LM for abstractive summarization and RaKUn2 for keyword extraction. BART LM is integrated into the proposed system due to its strong text generation capabilities, as evidenced by prior selections and reported in several studies [61,62]. Similarly, RaKUn2 was chosen based on its representative keywords’ extraction, supported by previous selection criteria and confirmed by [63].

Table 9 shows that BART LM models perform variably across slides and topics, with ROUGE-1, ROUGE-2, and ROUGE-L scores peaking on slides 2, 3, and 4 across most topics, reflecting the typical presentation structure. ROUGE scores assess summary quality by comparing generated summaries to reference ones. Specifically, ROUGE-1 measures unigram (single-word), ROUGE-2 evaluates fluency through bi-gram (two-word) overlap, and ROUGE-L checks the longest common sub-sequence (continuous adjacent words) to assess structural coherence. High scores across technical and general topics indicate that the system produces coherent, reliable summaries across varied domains. This consistency suggests that BART LM is suitable for various summarization tasks.

The keyword extraction measures the similarity between the extracted and reference keywords using cosine similarity [64]. Table 10 illustrates the cosine similarity of the extracted keywords for each slide from every topic.

Table 10 shows that the RaKUn2 algorithm achieved the average cosine similarity score of 0.509 on all the slides and topics, accurately capturing over half of the reference keywords. While effective for general keyword identification, the cosine similarity relies mainly on lexical overlap, potentially missing deeper semantic nuances, especially in topics requiring complex contextual understanding [65].

5.6. Performance of Information Correlation Function

The information correlation function aligns the extracted text, summaries, and keywords with corresponding presentation slides. Supported by the selection in Table 5, PaddleOCR is used to detect the title text from segmented video frames. It is then matched to the presentation document through regex matching. Since this matching mechanism relies on string-level similarity, the perfect OCR text extraction, as shown in Table 11, ensures error-free matching. However, as the presentations are controlled in this study, these results may not directly generalize to real-world scenarios where design variations are common.

5.7. Performance Comparison with Other Meeting Minutes Systems

This experiment evaluates the performance of our proposed system by comparing it with other existing systems. Due to the proprietary nature of these systems, data were manually gathered from free or trial versions. Thus, the execution time, meeting transcription, summary generation, keyword extraction, interface, and customization capabilities were compared to demonstrate the effectiveness of integrating open-source models relative to professionally developed systems. The key metrics in this comparison included WER for transcription accuracy, ROUGE-N scores on the summary quality, and the cosine similarity for keyword extraction relevance. Table 12 shows the comparison results.

In Table 12, the bolded values represent the highest scores, while the underlined values denote the second-highest scores. These results indicate that the proposed system achieved the lowest WER of 1.88%, outperforming other systems’ transcription accuracy. While MeetingBooster had the fastest execution time, it only provided transcription. In contrast, our system handles transcriptions, summaries, and keywords, achieving the second-best time (1:34). For summarization, Piglyph achieved the highest ROUGE-N metrics, with our system closely following with second-highest scores, 41.06% for ROUGE-1, 23.76% for ROUGE-2, and 32.12% for ROUGE-L. Piglyph led with a cosine similarity of 0.521 in keyword extraction, while our system achieved 0.509 keyword relevance. The browser-based design offers easy access, and the system’s use of customizable open-source models enables model updates, ensuring adaptability to the latest advancements.

Although our system does not consistently achieve the highest scores, its performance closely aligns with professionally developed systems. The system performs best with simple presentation formats, and further testing is needed to confirm its effectiveness with more complex layouts, diverse speaker styles, and varied slide designs.

5.8. Questionnaire Results on Usability and Effectiveness

The usability and effectiveness of the proposed system were evaluated using a questionnaire based on the Performance, Information, Economics, Control and Security, Efficiency, and Service (PIECES) framework [71]. The questionnaire was administered to 31 respondents. Each respondent first reviewed a provided presentation document, listened to the presentation recording, and executed the proposed system. Feedback was subsequently collected through the questionnaire, with the specific questions listed in Table 13.

Each question represents a key aspect of the framework. Questions Q1, Q2, and Q5 focus on usability, such as ease of use, intuitive interface, and likelihood of recommending the system. Specifically, Q1 evaluates the Service aspect, by measuring how quickly users can extract information. Q2 examines the Control and Security aspects, by assessing how intuitively users can manage the system. Q5 checks the Economics and Efficiency aspects, by gauging if users would recommend the system for similar tasks. Q3 and Q4 focus on the effectiveness of the system. Q3 assesses the Information aspect by evaluating the accuracy of the audio-to-text conversion. Q4 addresses the Performance aspect, reflecting the quality of summary and keyword extraction. Table 14 shows their response to the questionnaire.

As shown in Table 14, the system received positive feedback on ease of use, interface intuitiveness, and recommendation likelihood. Most users found the system easy to navigate. Twenty-two respondents agreed that it was simple to extract information from videos, and twenty-one appreciated its intuitive design. These findings align with the performance of the Whisper model, which achieved a low WER. However, mixed responses on the summary and keyword quality, as indicated by the responses to Q4, suggest that while the BART model and RaKUn2 performed well, there is room for improvement. The system accurately aligned extracted information with presentation slides, as indicated by the responses to Q2, utilizing PaddleOCR and regex. Overall, the system is user-friendly, accurate, and effective, with potential areas for enhancement.

6. Discussion

Online meetings and online presentations have become essential with the rise of remote work. They often lead participants to over-fatigue due to information overloads and cramped meetings. To address this, the proposed system extracts, summarizes, and correlates the key information from recorded meetings using fully open-source AI models. The system enables users to review essential contents without replaying entire sessions and to update the system whenever a new advancement of an AI model occurs. Following a modular and task-specific approach, each piece of audio, text, and visual data is processed individually to maximize computational efficiency and accuracy.

The WER by Whisper is very similar to the accuracy of human transcription [14]. Although the ROUGE-N scores, cosine similarity, and user feedback obtained in this study indicate slight limitations in generating cohesive summaries and representative keywords in practical settings, their usefulness often depends on how effectively they capture the critical information discussed in the meeting.

Another limitation of the proposed system is that it was tested with controlled recorded online meeting presentations. These presentations excluded images, figures, and animations and used a single background color with the standard font size and style. The speech pattern was kept uniform, with no variation in speed or intonations.

Considering these limitations, future investigations will focus on varied presentation formats and speech patterns. More complex scenarios will represent real-world challenges. Additional techniques or model fine-tuning may be necessary to ensure the system can generate usable information from more complex inputs.

7. Conclusions

Online meetings and presentations have become essential with the rise of remote work, often leading to participant fatigue in information overload and back-to-back sessions. The proposed system addresses this issue by extracting, summarizing, and correlating key information from recorded meetings. As fully open-source AI models, the system employs Whisper for audio-to-text transcription, BART LM for summarization, RaKUn2 for keyword extraction, SSIM for scene-change detection, and PaddleOCR for extracting and regex for correlating visual–textual data.

The evaluation of ten recorded presentations demonstrates the system’s accuracy, emphasizing its potential to streamline note-taking for meetings, conferences, and seminars. The practical impact could benefit organizations and educational settings that rely on efficient meeting reviews. However, accommodating diverse and dynamic presentation designs and varied speech patterns remains an area for future investigation.

Author Contributions

Conceptualization, A.L.H., N.F. and S.S.; methodology, A.L.H.; software, A.L.H.; validation, A.L.H.; resources, S.S.; data curation, E.D.F.; writing—original draft preparation, A.L.H.; writing—review and editing, N.F., Y.Y.F.P. and S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors thank the reviewers for their thorough reading and helpful comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Standaert, W.; Thunus, S.; Schoenaers, F. Virtual meetings and wellbeing: Insights from the COVID-19 pandemic. Inf. Technol. People 2023, 36, 1766–1789. [Google Scholar] [CrossRef]
Bergmann, R.; Rintel, S.; Baym, N.; Sarkar, A.; Borowiec, D.; Wong, P.; Sellen, A. Meeting (the) pandemic: Videoconferencing fatigue and evolving tensions of sociality in enterprise video meetings during COVID-19. Comput. Support. Coop. Work. (CSCW) 2023, 32, 347–383. [Google Scholar] [CrossRef] [PubMed]
Apostolidis, E.; Adamantidou, E.; Metsai, A.I.; Mezaris, V.; Patras, I. Video summarization using deep neural networks: A survey. arXiv 2021, arXiv:2101.06072. [Google Scholar] [CrossRef]
Mannai, M.; Karâa, W.B.A.; Ghezala, H.H.B. Information extraction approaches: A survey. In Proceedings of the Information and Communication Technology, Bangkok, Thailand, 12–13 December 2016; pp. 289–297. [Google Scholar]
Yu, Y.; Wang, C.; Fu, Q.; Kou, R.; Huang, F.; Yang, B.; Yang, T.; Gao, M. Techniques and challenges of image segmentation: A review. Electronics 2023, 12, 1199. [Google Scholar] [CrossRef]
Fuad, M.; Ernawan, F.; Hui, L. Video scene change detection based on histogram analysis for hiding message. J. Phys. Conf. Ser. IOP Publ. 2021, 1918, 042141. [Google Scholar] [CrossRef]
Bhuiyan, M.A.A.; Khan, A.R. Image quality assessment employing RMS contrast and histogram similarity. Int. Arab J. Inf. Technol. 2018, 15, 983–989. [Google Scholar]
Gore, A.; Gupta, S. Full reference image quality metrics for JPEG compressed images. AEU Int. J. Electron. Commun. 2015, 69, 604–608. [Google Scholar] [CrossRef]
Shen, J.; Jiang, X.; Zhong, J.; Yao, S. Scene change detection based on sequence statistics using structural similarity. In Proceedings of the 2022 4th International Academic Exchange Conference on Science and Technology Innovation (IAECST), Guangzhou, China, 9–11 December 2022; pp. 1179–1182. [Google Scholar]
Alharbi, S.; Alrazgan, M.; Alrashed, A.; Alnomasi, T.; Almojel, R.; Alharbi, R.; Alharbi, S.; Alturki, S.; Alshehri, F.; Almojil, M. Automatic speech recognition: Systematic literature review. IEEE Access 2021, 9, 131858–131876. [Google Scholar] [CrossRef]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Pratap, V.; Tjandra, A.; Shi, B.; Tomasello, P.; Babu, A.; Kundu, S.; Elkahky, A.; Ni, Z.; Vyas, A.; Fazel-Zarandi, M.; et al. Scaling speech technology to 1000+ languages. J. Mach. Learn. Res. 2024, 25, 1–52. [Google Scholar]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning. PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
Jangra, A.; Mukherjee, S.; Jatowt, A.; Saha, S.; Hasanuzzaman, M. A survey on multi-modal summarization. ACM Comput. Surv. 2023, 55, 1–36. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual Event, 13–18 July 2020; pp. 11328–11339. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
Haz, A.L.; Funabiki, N.; Fajrianti, E.D.; Sukaridhoto, S. A Study of Summarization and Keyword Extraction Function in Meeting Note Generation System from Voice Records. In Proceedings of the 2023 12th International Conference on Networks, Communication and Computing, Osaka, Japan, 15–17 December 2023; pp. 106–112. [Google Scholar]
Grootendorst, M. KeyBERT: Minimal Keyword Extraction with BERT. 2020. Available online: https://doi.org/10.5281/zenodo.4461265 (accessed on 26 October 2024).
Mihalcea, R.; Tarau, P. Textrank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 404–411. [Google Scholar]
Rose, S.; Engel, D.; Cramer, N.; Cowley, W. Automatic keyword extraction from individual documents. In Text Mining: Applications and Theory; Wiley: Hoboken, NJ, USA, 2010; pp. 1–20. [Google Scholar]
Campos, R.; Mangaravite, V.; Pasquali, A.; Jorge, A.; Nunes, C.; Jatowt, A. YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. 2020, 509, 257–289. [Google Scholar] [CrossRef]
Škrlj, B.; Koloski, B.; Pollak, S. Retrieval-efficiency trade-off of Unsupervised Keyword Extraction. In Proceedings of the International Conference on Discovery Science, Montpellier, France, 10–12 October 2022; pp. 379–393. [Google Scholar]
Škrlj, B.; Repar, A.; Pollak, S. RaKUn: Rank-based Keyword extraction via Un supervised learning and meta vertex aggregation. In Proceedings of the Statistical Language and Speech Processing: 7th International Conference, SLSP 2019, Ljubljana, Slovenia, 14–16 October 2019; pp. 311–323. [Google Scholar]
Salehudin, M.; Basah, S.; Yazid, H.; Basaruddin, K.; Safar, M.; Som, M.M.; Sidek, K. Analysis of Optical Character Recognition using EasyOCR under Image Degradation. J. Phys. Conf. Ser. IOP Publ. 2023, 2641, 012001. [Google Scholar] [CrossRef]
de Luna, R.G. A Tesseract-based Optical Character Recognition for a Text-to-Braille Code Conversion. Int. J. Adv. Sci. Eng. Inf. Technol. 2020, 10, 128–136. [Google Scholar] [CrossRef]
Shahin, M.; Chen, F.F.; Hosseinzadeh, A. Machine-based identification system via optical character recognition. Flex. Serv. Manuf. J. 2023, 1–28. [Google Scholar] [CrossRef]
Mohajer, M.M.; Hassanpour, H. Fast Exam Video Summarization Using Targeted Evaluation of Scene Changes Based on User Behavior. In Proceedings of the 2023 9th International Conference on Signal Processing and Intelligent Systems (ICSPIS), Bali, Indonesia, 14–15 December 2023; pp. 1–5. [Google Scholar]
Haz, A.L.; Fajrianti, E.D.; Funabiki, N.; Sukaridhoto, S. A Study of Audio-to-Text Conversion Software Using Whispers Model. In Proceedings of the 2023 Sixth International Conference on Vocational Education and Electrical Engineering (ICVEE), Bali, Indonesia, 14–15 December 2023; pp. 268–273. [Google Scholar]
Nguyen, Q.; Nguyen, N.; Dang, T.; Tran, V. Vietnamese Voice2Text: A Web Application for Whisper Implementation in Vietnamese Automatic Speech Recognition Tasks: Vietnamese Voice2Text. In Proceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence, Beijing, China, 8–10 December 2023; pp. 312–318. [Google Scholar]
Saxena, P.; El-Haj, M. Exploring Abstractive Text Summarisation for Podcasts: A Comparative Study of BART and T5 Models. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, Varna, Bulgaria, 4–6 September 2023; pp. 1023–1033. [Google Scholar]
Tang, Z.; Yang, Z.; Wang, G.; Fang, Y.; Liu, Y.; Zhu, C.; Zeng, M.; Zhang, C.; Bansal, M. Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19254–19264. [Google Scholar]
Bulut, F.; Osmanı, S. Scene Change Detection using Different Color Pallets and Performance Comparison. Balk. J. Electr. Comput. Eng. 2017, 5, 66–72. [Google Scholar] [CrossRef]
Widyassari, A.P.; Rustad, S.; Shidik, G.F.; Noersasongko, E.; Syukur, A.; Affandy, A.; Setiadi, D.R.I.M. Review of automatic text summarization techniques & methods. J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 1029–1046. [Google Scholar]
Chen, Y.; Song, Q. News text summarization method based on bart-textrank model. In Proceedings of the 2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 12–14 March 2021; pp. 2005–2010. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. Facebook/Bart-large · Hugging Face. 2019. Available online: https://huggingface.co/facebook/bart-large (accessed on 24 October 2024).
Nasar, Z.; Jaffry, S.W.; Malik, M.K. Textual keyword extraction and summarization: State-of-the-art. Inf. Process. Manag. 2019, 56, 102088. [Google Scholar] [CrossRef]
Blaž, Š.; Koloski, B.; Pollak, S. Retrieval-Efficiency Trade-Off of Unsupervised Keyword Extraction. In Discovery Science; Springer: Cham, Switzerland, 2022; Volume 13601. [Google Scholar]
Von Neumann, T.; Boeddeker, C.; Kinoshita, K.; Delcroix, M.; Haeb-Umbach, R. On word error rate definitions and their efficient computation for multi-speaker speech recognition systems. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Schaefer, R.; Neudecker, C. A two-step approach for automatic OCR post-correction. In Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Barcelona, Spain, 12 December 2020; pp. 52–57. [Google Scholar]
Streamlit. Streamlit. A Faster Way to Build and Share Data Apps. Available online: https://github.com/SkBlaz/rakun2/tree/main (accessed on 26 October 2024).
Scikit-Image Contributors. Structural Similarity Index. 2024. Available online: https://scikit-image.org/docs/stable/auto_examples/transform/plot_ssim.html (accessed on 20 October 2024).
openCV Contributors. Histogram Comparison. 2024. Available online: https://docs.opencv.org/4.x/d8/dc8/tutorial_histogram_comparison.html (accessed on 20 October 2024).
Scikit-Learn Developers. Mean_Squared_Error. 2024. Available online: https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.mean_squared_error.html (accessed on 20 October 2024).
Facebook-AI. Wav2Vec2 Base 960h. 2021. Available online: https://huggingface.co/facebook/wav2vec2-base-960h (accessed on 23 October 2024).
Facebook-AI. HuBERT Large LS960 Fine-Tuned. 2021. Available online: https://huggingface.co/facebook/hubert-large-ls960-ft (accessed on 23 October 2024).
Meta-AI. MMS-1B-FL102. 2023. Available online: https://huggingface.co/facebook/mms-1b-fl102 (accessed on 23 October 2024).
Distil-Whisper. Distil-Whisper Medium English. 2023. Available online: https://huggingface.co/distil-whisper/distil-medium.en (accessed on 23 October 2024).
Clivillé, J. flan-t5-3b-summarizer. 2023. Available online: https://huggingface.co/jordiclive/flan-t5-3b-summarizer (accessed on 24 October 2024).
Google-Research. PEGASUS-XSum. 2020. Available online: https://huggingface.co/google/pegasus-xsum (accessed on 24 October 2024).
Neelamohan, K.K. MEETING_SUMMARY. 2022. Available online: https://huggingface.co/knkarthick/MEETING_SUMMARY (accessed on 24 October 2024).
Grootendorst, M. KeyBERT: Minimal Keyword Extraction with BERT. 2020. GitHub Repository. Available online: https://github.com/MaartenGr/KeyBERT (accessed on 23 October 2024).
Surfer, C. RAKE-NLTK: Rapid Automatic Keyword Extraction using NLTK. 2018. GitHub Repository. Available online: https://github.com/csurfer/rake-nltk (accessed on 26 October 2024).
Campos, R.; Mangaravite, V.; Pasquali, A.; Jorge, A.M.; Nunes, C.; Jatowt, A. YAKE: Keyword Extraction from Single Documents Using Multiple Features. 2020. GitHub Repository. Available online: https://github.com/LIAAD/yake (accessed on 26 October 2024).
Nathan, P. PyTextRank Python Implementation of TextRank for Phrase Extraction and Summarization. 2020. GitHub Repository. Available online: https://github.com/DerwenAI/pytextrank (accessed on 26 October 2024).
SkBlaz. RaKUn2—Rake Unsupervised Keyword Extraction. 2023. GitHub Repository. Available online: https://github.com/SkBlaz/rakun2 (accessed on 26 October 2024).
Hu, S.; He, C.; Zhang, C.; Tan, Z.; Ge, B.; Zhou, X. Efficient scene text recognition model built with PaddlePaddle framework. In Proceedings of the 2021 7th International Conference on Big Data and Information Analytics (BigDIA), Chongqing, China, 4–10 June 2023; pp. 139–142. [Google Scholar]
Smith, R. An Overview of the Tesseract OCR Engine. In Proceedings of the ICDAR ’07: Proceedings of the Ninth International Conference on Document Analysis and Recognition, Washington, DC, USA, 23–26 September 2007; pp. 629–633. [Google Scholar]
Graham, C.; Roll, N. Evaluating OpenAI’s Whisper ASR: Performance analysis across diverse accents and speaker traits. JASA Express Lett. 2024, 4, 025206. [Google Scholar] [CrossRef] [PubMed]
Zhang, T.; Irsan, I.C.; Thung, F.; Han, D.; Lo, D.; Jiang, L. iTiger: An automatic issue title generation tool. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore, 14–18 November 2022; pp. 1637–1641. [Google Scholar]
Raju, R.; Pati, P.B.; Gandheesh, S.; Sannala, G.S.; Suriya, K. Grammatical versus Spelling Error Correction: An Investigation into the Responsiveness of Transformer-Based Language Models Using BART and MarianMT. arXiv 2024, arXiv:2403.16655. [Google Scholar] [CrossRef]
Škrlj, B.; Jukič, M.; Eržen, N.; Pollak, S.; Lavrač, N. Prioritization of COVID-19-related literature via unsupervised keyphrase extraction and document representation learning. In Proceedings of the Discovery Science: 24th International Conference, Halifax, NS, Canada, 11–13 October 2021; pp. 204–217. [Google Scholar]
Saha, S.; Ghosh, M.; Ghosh, S.; Sen, S.; Singh, P.K.; Geem, Z.W.; Sarkar, R. Feature selection for facial emotion recognition using cosine similarity-based harmony search algorithm. Appl. Sci. 2020, 10, 2816. [Google Scholar] [CrossRef]
Sarwar, T.B.; Noor, N.M.; Miah, M.S.U. Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding. PeerJ Comput. Sci. 2022, 8, e1024. [Google Scholar] [CrossRef]
MeetingBooster. Meeting Management Software: Meetingbooster. Available online: https://www.meetingbooster.com/ (accessed on 20 October 2024).
Fellow. Fellow Resources. 2023. Available online: https://fellow.app/ (accessed on 20 October 2024).
Beenote. Meeting Management Solution: Agenda, Minutes. 2022. Available online: https://www.beenote.io/ (accessed on 20 October 2024).
Piglyph. Interactive Whiteboard for Co-Creation Through Real-Time Visualization: Ricoh. Available online: https://piglyph.com/ (accessed on 20 October 2024).
Tactiq. AI Meeting Transcripts for Google Meet, Zoom & Teams. Available online: https://tactiq.io/ (accessed on 20 October 2024).
Fatoni, A.; Adi, K.; Widodo, A.P. PIECES framework and importance performance analysis method to evaluate the implementation of information systems. In Proceedings of the E3S Web of Conferences, Online Conference, 12–13 August 2020; Volume 202, p. 15007. [Google Scholar]

Figure 1. Overview of meeting minutes generation system.

Figure 2. Acceptable document format.

Figure 3. Workflow for information correlation function.

Figure 4. Sample of organized data for each slide.

Figure 5. (a) Input field for recorded video and presentation document; (b) extracted results for each slide.

Table 1. Comparison of image comparison algorithms.

Metric	SSIM [43]	Histogram Comparison [44]			MSE [45]
Metric	SSIM [43]	Bhattacharyya	Chi-Square	Correlation	MSE [45]
Precision	100%	35.29%	26.32%	38.46%	11.36%
Recall	100%	60.00%	100.00%	50.00%	100.00%
F1 Score	100%	44.44%	41.67%	43.48%	20.41%
Duration (s)	180.85	296.91	309.79	297.13	173.16

Table 2. Comparison of WER and duration among audio-to-text models.

Component	Wav2Vec [46]	HuBERT [47]	MMS [48]	Whisper [49]
WER	51.23%	26.92%	19.38%	4.41%
Duration (s)	6.484	16.044	104.832	4.912

Table 3. Comparison of ROUGE-N and duration among summarization models.

Metric	T5 [50]	PEGASUS [51]	BART [52]
ROUGE-1	30.66%	30.95%	39.34%
ROUGE-2	15.05%	16.30%	22.43%
ROUGE-L	25.56%	26.16%	35.35%
Duration (s)	4.448	1.374	0.962

Table 4. Comparison of cosine similarity among keyword extraction models.

Metric	KeyBERT [53]	RAKE [54]	YAKE [55]	TextRank [56]	RaKUn2 [57]
Cosine Similarity	28.46%	27.83%	34.01%	33.97%	47.70%

Table 5. Comparison of OCR algorithms.

Size	PaddleOCR [58]		Tesseract [59]		EasyOCR [26]
Size	Duration (s)	CER	Duration (s)	CER	Duration (s)	CER
$1280 \times 720$	0.3442	0.00%	1.2262	0.00%	4.1956	0.00%
$960 \times 540$	0.1951	0.00%	0.9713	0.00%	2.9337	0.00%
$854 \times 480$	0.2672	0.00%	0.8805	0.00%	2.7653	0.00%
$640 \times 360$	0.1892	0.00%	0.6989	0.00%	2.9206	0.00%
$426 \times 240$	0.1278	0.00%	0.1531	0.00%	3.6544	0.00%
$320 \times 180$	0.0991	5.22%	0.1338	4.26%	0.9936	8.88%
$256 \times 144$	0.0773	7.02%	0.1204	100.00%	0.2197	55.56%
$160 \times 90$	0.0594	15.40%	0.1010	100.00%	0.3462	100.00%

Note: bold values indicate the smallest image size before the CER goes above 0.0%.

Table 6. Recorded presentation videos for evaluations.

Topic	Artificial Intelligence	Biotechnology	Blockchain	Cybersecurity	Renewable Energy	Environmental	Healthcare	Quantum Computing	Data Science	Space Exploration
Duration (m:s)	5:50	5:29	6:44	6:37	6:53	6:05	5:28	4:32	6:13	8:10
Bitrate (kbps)	577	632	557	574	497	664	562	846	517	691

Table 7. SSIM algorithm results for per-second approach.

Topic	Seconds	F1 Score	Execution Time (s)
Artificial Intelligence	350	100%	38.57
Biotechnology	329	100%	50.47
Blockchain	404	100%	47.17
Cybersecurity	397	100%	42.16
Renewable Energy	413	100%	42.37
Environmental	365	100%	39.62
Healthcare	328	100%	35.01
Quantum Computing	272	100%	27.56
Data Science	373	100%	39.86
Space Exploration	490	100%	46.96

Table 8. Averaged WER on each slide from different English accents.

Accents	Slide 1	Slide 2	Slide 3	Slide 4	Slide 5
Vietnam	1.30%	1.58%	2.17%	0.82%	1.31%
Congo	1.32%	1.75%	2.45%	1.04%	1.60%
Singapore	0.89%	1.74%	2.37%	0.98%	1.43%
Malaysia	1.35%	1.57%	2.30%	0.98%	1.64%
Indonesia	1.36%	1.76%	2.41%	1.79%	1.40%

Table 9. ROUGE-1, ROUGE-2, and ROUGE-L for each slide and topic.

Metric	Topic	Artificial Intelligence	Biotechnology	Blockchain	Cybersecurity	Renewable Energy	Environmental	Healthcare	Quantum Computing	Data Science	Space Exploration
ROUGE-1	Slide 1	27.4%	49.9%	33.9%	44.4%	49.9%	49.9%	59.5%	46.1%	0%	35.2%
	Slide 2	36.7%	45.7%	4.1%	60.8%	42.4%	69.1%	60.2%	46.8%	47.4%	33.5%
	Slide 3	68.4%	30.6%	45.3%	43.7%	52.9%	29.8%	37.9%	38%	50.6%	31.9%
	Slide 4	43.2%	48.1%	51.7%	55.3%	32.9%	41%	72.7%	46.8%	43.9%	55.7%
	Slide 5	41.5%	-	46.3%	38.2%	19%	39.6%	62.7%	45.3%	43.9%	49%
ROUGE-2	Slide 1	14%	19.6%	6.3%	9.8%	21.8%	27.1%	40.7%	20.3%	0%	22.9%
	Slide 2	5.9%	9.5%	21.7%	33.3%	16.2%	44.7%	33.6%	22.2%	19.1%	12.8%
	Slide 3	37.5%	8.6%	6.8%	18.8%	23.2%	9.3%	14.7%	17.1%	28.8%	15.5%
	Slide 4	13.1%	20.8%	17.9%	20.2%	11.7%	8.1%	44.4%	15.9%	20.6%	37.8%
	Slide 5	18.9%	-	16.2%	8.8%	5.8%	10.7%	20.3%	21.9%	10.2%	29.8%
ROUGE-L	Slide 1	23.5%	38.4%	22.6%	37%	45.8%	49.9%	59.5%	46.2%	0%	35.2%
	Slide 2	22.9%	25.7%	39.5%	55.7%	34.3%	64.2%	55.9%	36.3%	27.4%	19.1%
	Slide 3	50.6%	27%	34.6%	37.4%	44.1%	20.8%	31%	30.9%	45.8%	23.9%
	Slide 4	23.4%	40.5%	34.4%	33.8%	21.1%	24.6%	62.3%	36.3%	26.3%	39.3%
	Slide 5	23.7%	-	37.6%	27.6%	9.5%	31.6%	46.5%	34.6%	31.7%	43.1%

Table 10. Keywords’ cosine similarity from each slide and each topic.

Topic	Artificial Intelligence	Biotechnology	Blockchain	Cybersecurity	Renewable Energy	Environmental	Healthcare	Quantum Computing	Data Science	Space Exploration	Average (per slide)
Slide 1	0.714	0.503	0.669	0.462	0.487	0.721	0.874	0.801	0.394	0.566	0.619
Slide 2	0.504	0.597	0.453	0.423	0.481	0.356	0.447	0.496	0.429	0.499	0.469
Slide 3	0.478	0.577	0.516	0.484	0.549	0.499	0.522	0.592	0.509	0.462	0.519
Slide 4	0.571	0.559	0.526	0.577	0.649	0.539	0.547	0.416	0.725	0.362	0.547
Slide 5	0.306	-	0.316	0.340	0.360	0.399	0.416	0.547	0.269	0.441	0.377
Average (per topic)	0.515	0.559	0.496	0.457	0.505	0.503	0.561	0.570	0.465	0.466	0.509

Table 11. CER results from each slide and each topic from resized video using PaddleOCR.

Topic	Artificial Intelligence	Biotechnology	Blockchain	Cybersecurity	Renewable Energy	Environmental	Healthcare	Quantum Computing	Data Science	Space Exploration
Slide 1	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%
Slide 2	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%
Slide 3	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%
Slide 4	0%	0%	0%	0%	0%	0%	0%	0%	0%	0%
Slide 5	0%	-	0%	0%	0%	0%	0%	0%	0%	0%

Table 12. Performance comparison with other existing systems.

Application	Time	Transcription (WER)	Summary			Keywords (Cosine- Similarity)	Runs on	Custom
Application	Time	Transcription (WER)	ROUGE-1	ROUGE-2	ROUGE-3	Keywords (Cosine- Similarity)	Runs on	Custom
Meeting Booster [66]	1:18	2.24%	-	-	-	-	Browser	No
Fellow [67]	2:18	2.03%	40.47%	23.11%	31.53%	0.511	Browser, Desktop	No
Beenotes [68]	5:48	2.91%	38.09%	21.07%	29.47%	0.496	Desktop	No
Piglyph [69]	3:02	2.72%	42.03%	25.08%	33.12%	0.521	Desktop	No
Tactiq [70]	3:54	2.81%	39.14%	22.17%	30.61%	0.427	Browser	No
Our proposal	1:34	1.88%	41.06%	23.76%	32.12%	0.509	Browser	Yes

Note: bold values are the highest, and the underlined values are the second highest.

Table 13. Questionnaire questions on usability and effectiveness.

Question ID	Questionnaire Questions
Q1	The system was easy to use in extracting textual information from presentation videos.
Q2	The system’s interface was intuitive and easy to navigate while correlating information between presentation videos and PowerPoint files.
Q3	I am satisfied with the accuracy of the system in converting audio to text and correlating textual information from both presentation videos and PowerPoint files.
Q4	I am satisfied with the quality of the generated summary and the extracted keywords.
Q5	I would recommend this system to others for similar tasks requiring information correlation between presentation videos and PowerPoint files.

Table 14. The answers to the questions in the questionnaire.

Question ID	Strongly Disagree	Disagree	Neutral	Agree	Strongly Agree
Q1	3	0	6	14	8
Q2	0	1	9	19	2
Q3	0	0	11	10	10
Q4	0	3	16	4	8
Q5	2	4	3	14	8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Haz, A.L.; Panduman, Y.Y.F.; Funabiki, N.; Fajrianti, E.D.; Sukaridhoto, S. Fully Open-Source Meeting Minutes Generation Tool. Future Internet 2024, 16, 429. https://doi.org/10.3390/fi16110429

AMA Style

Haz AL, Panduman YYF, Funabiki N, Fajrianti ED, Sukaridhoto S. Fully Open-Source Meeting Minutes Generation Tool. Future Internet. 2024; 16(11):429. https://doi.org/10.3390/fi16110429

Chicago/Turabian Style

Haz, Amma Liesvarastranta, Yohanes Yohanie Fridelin Panduman, Nobuo Funabiki, Evianita Dewi Fajrianti, and Sritrusta Sukaridhoto. 2024. "Fully Open-Source Meeting Minutes Generation Tool" Future Internet 16, no. 11: 429. https://doi.org/10.3390/fi16110429

APA Style

Haz, A. L., Panduman, Y. Y. F., Funabiki, N., Fajrianti, E. D., & Sukaridhoto, S. (2024). Fully Open-Source Meeting Minutes Generation Tool. Future Internet, 16(11), 429. https://doi.org/10.3390/fi16110429

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fully Open-Source Meeting Minutes Generation Tool

Abstract

1. Introduction

2. Related Work in Literature

2.1. Video-Splitting Methods

2.2. Audio-to-Text

2.3. Abstractive Summarization

2.4. Keywords Extraction

2.5. Optical Character Recognition (OCR)

3. Proposal of Meeting Minutes Generation System

3.1. System Overview

3.2. Input Function

3.3. Split-Video Function

3.4. Audio-to-Text Function

3.5. Information Extraction Function

3.5.1. Abstractive Summarization

3.5.2. Keywords Extraction

3.6. Information Correlation Function

3.7. Slide Output Information Function

3.8. Output

4. Selection of Open-Source Models

4.1. Selection of Scene-Change Detection Algorithms

4.2. Selection of Audio-to-Text Models

4.3. Selection of Summarization Models

4.4. Selection of Keyword Extraction Models

4.5. Selection of OCR Programs

5. Evaluations

5.1. Experiment Preparation

5.2. Limitations and Biases of Recorded Videos

5.3. Performance of Split Video Function

5.4. Performance of Audio-to-Text Function

5.5. Performance of Information Extraction Function

5.6. Performance of Information Correlation Function

5.7. Performance Comparison with Other Meeting Minutes Systems

5.8. Questionnaire Results on Usability and Effectiveness

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI