1. Introduction
The shift to online work has made video conferencing essential, leading to a significant increase in online presentations at conferences and meetings. However, the excessive use of online meetings has increased fatigue among participants [
1,
2]. It is even common practice now to record online meetings for later review, as participants struggle to process all the information during live sessions. Recent advances in
artificial intelligence (AI) offer solutions by helping users to retrieve key information efficiently from recorded meetings, reducing the need for time-consuming reviews [
3].
AI can capture key information from recorded meetings using audio, visual, and/or text data by applying
information extraction (IE) methods [
4]. The
IE methods convert unstructured data into structured formats in order to highlight relevant information. However, the existing proprietary systems limit flexibility and adaptability in implementing new models and restrict accessibility for key information retrieval, while an open-source approach offers more flexibility in adapting the latest models to achieve the best results. Integrating multiple models and methods into one system requires careful selection. Therefore, in this paper, we investigate whether fully open-source models can be effectively integrated and implemented for the generation of meeting minutes in the context of recorded online meetings.
For this purpose, suitable open-source models for the system need to be identified. Then, the integration of the selected models needs to be conducted in order to develop a modular and adaptable system. Finally, the developed system needs to be compared with existing proprietary systems. The use of open-source models and methods is driven by their wide accessibility, cost effectiveness, and the benefits of continuous community contributions. Coupled with the rapid advancements in natural language processing (NLP), this approach offers the ability to quickly adopt the latest innovations. Thus, it enhances the system performance and improves the quality of generated meeting minutes. This is highly advantageous and beneficial for individuals or organizations in need of such a system.
In this paper, an integrated and modular fully open-source system is proposed by combining multiple methods of video analysis and information extraction. Since online presentations are often comprised of multiple slides, scene-change detection is applied in order to segment the video based on the slide changes. Then, since the speaker’s speech also contains valuable information, an audio-to-text method is employed to capture this content. Since the resulting text data can be extensive, summarization and keyword extraction techniques are used to quickly convey the main points of the presentation. Considering that the video and textual information are processed separately, inaccurate matching could cause confusion for readers. To address this, an information correlation mechanism using OCR and regular expressions (regex) are applied.
The proposed system was evaluated using ten presentation videos that were created under controlled conditions, although they may not fully reflect the complexity of real-world online presentations. The system’s performance was measured in terms of accuracy and efficiency, using statistical metrics appropriate for each respective method. Additionally, user feedback was collected through a questionnaire, focusing on the quality of transcriptions, summaries, keyword extraction, and the intuitiveness of the user interface.
Building upon the explanation of the proposed system, namely a fully open-source model for generating meeting minutes, the contributions of the proposal can be summarized into the following points:
Designing a meeting minutes generation system by combing several methods that have been implemented by open-source programs.
Selecting the most appropriate open-source models for each method in the system from a wide range of available options, ensuring the best fit for each task, and including scene-change detection, audio-to-text conversion, and summarization.
Presenting higher accuracy in transcription and similar performance in summarization and keyword extraction tasks compared to existing proprietary systems.
The paper is structured as follows:
Section 2 provides an overview of previous studies and methodologies.
Section 3 presents the proposed system.
Section 4 presents the selection process used for the open-source models.
Section 5 assesses the system through experiments.
Section 6 discusses the findings and the implications of the results from the system’s evaluation.
Section 7 concludes the paper and summarizes the key outcomes and insights.
2. Related Work in Literature
This section explores related works on the methodologies used in this paper.
2.1. Video-Splitting Methods
The system segments the recorded presentation video using a
scene-change detection algorithm to analyze differences between scenes. This can be classified into two approaches: image pixel data, which refers to the color values of each pixel, and image structure data, which considers the spatial arrangement of edges, shapes, and textures [
5].
Common algorithms, such as
mean square error (MSE) and histogram matching, offer faster scene detection but fail when pixel-level or geometric alterations occur [
6,
7]. In contrast, structure-based approaches (e.g., texture, contrast, and luminance) better preserve image information during compression [
8]. The
structural similarity index measure (SSIM) focuses on structural changes rather than pixel differences, assessing image quality based on human visual perception [
9].
Given these insights, the SSIM algorithm is ideal for our use case, which involves analyzing compressed presentation videos without geometric alterations. This choice will be tested in subsequent experiments to validate its effectiveness and justify its selection over other methods to ensure that it meets the specific needs of our system.
2.2. Audio-to-Text
Capturing information from audio is achieved by converting the audio signal into text data [
10]. Various approaches for
audio-to-text conversion exist, but transformer-based models have recently gained attention for their strong performance and ability to maintain long-term dependencies in audio data.
Several transformer-based models, such as
Wav2Vec 2.0 and
HuBERT, focus on self-supervised learning and are effective for low-resource languages [
11,
12]. The
massively multilingual speech (MMS) model utilizes self-supervised learning to support thousands of languages, offering broad linguistic coverage. However, its extensive coverage comes at the cost of increased computational time [
13].
Whisper excels in multilingual and noisy environments, as it is trained on weakly annotated and noisy data [
14], although it suffers from occasional hallucinations, where incorrect words or sentences are generated.
Given Whisper’s strengths in handling multilingual and noisy environments, it appears to be a strong candidate for our system. However, further experiments comparing these models are needed to evaluate their suitability against other options and to validate their performance.
2.3. Abstractive Summarization
Abstractive summarization allows for creating summaries with human-like qualities, generating new words and sentence structures while preserving the original meaning of the source text [
15]. Several methods have leveraged transformer architecture due to its self-attention mechanism, which could preserve information from lengthy text data.
The
text-to-text transfer transformer (T5) handles a wide range of
NLP tasks by framing them as text-to-text problems, benefiting from its training on diverse datasets [
16]. However, this versatility comes at the cost of sub-optimal performance in specialized abstractive summarization tasks. The
pre-training with extracted gap-sentences for abstractive summarization (PEGASUS), explicitly designed for
summarization by Google, performs exceptionally well in this domain [
17]. However, its specialized nature limits generalization across topics and requires computationally expensive fine-tuning for optimal results. In contrast,
bidirectional and auto-regressive transformer (BART) balances generalization and accuracy by leveraging diverse datasets and combining a bidirectional encoder with an auto-regressive decoder [
18]. This dual approach improves word generation and preserves context over long sequences, although it increases computational cost and slows down processing due to its complex architecture.
Given the diverse topics covered in the recorded presentation videos and prior evaluation [
19],
BART was selected for its broad generalizability across different topics, making it suitable for generating summaries in this use case.
2.4. Keywords Extraction
In addition to summarization, the extraction of keywords from individual documents helps to achieve a concise understanding of the presented information. Recent approaches leverage the automatic keyword extraction methods that employ language model embeddings, statistical approaches, graph-based architectures, and unsupervised learning to effectively manage complex relationships between words.
KeyBERT uses
bidirectional encoder representative from transformer (BERT) embedding to rank keywords, providing a high accuracy but requiring significant computational resources [
20].
TextRank, a graph-based method, measures the influence of words based on their co-occurrence, which is effective but increases the computational load when handling longer texts [
21].
RAKE and
YAKE are both fast and simple due to their reliance on statistical techniques, although they lack the contextual depth and complexity needed for more complicated tasks [
22,
23]. Meanwhile,
RaKUn2 combines a graph-based approach with unsupervised learning to cluster similar words, then applies load centrality to identify the most influential word clusters [
24]. This mechanism reduces the number of graphs and lowers both the computational load and complexity, offering a more scalable solution for large and complex texts.
Given the diverse topics in the recorded presentation videos and the prior evaluation of various keyword extraction methods [
25],
RaKUn2 was selected for its ability to balance computational efficiency with accurate keyword extraction, making it well suited for managing the complexity of this use case.
2.5. Optical Character Recognition (OCR)
Extracting text data enables a machine to understand the information when subjected to visual data. OCR utilizes the pattern recognition to interpret the characters in an image, enabling the digitization and extraction of texts for various applications.
Recent OCR models leverage machine learning, resulting in varying performance based on datasets and architectures.
EasyOCR combines
LSTM and
CNN for text recognition, offering a simple and fast architecture, but sacrificing accuracy, especially in complex extraction scenarios with varying font sizes [
26].
Tesseract, developed by Google, provides broad language support and performs well on high-quality printed or written documents. However, its accuracy diminishes when handling low-resolution or compressed images, or text with small fonts [
27]. In contrast,
PaddleOCR excels at extracting text from images with complex orientation, compression, or composition due to its multi-stage process and training on diverse image styles. However, this complexity results in higher computational costs, making it less suitable for real-time applications or systems with limited computational power [
28].
Given Tesseract’s struggles with small fonts and EasyOCR’s sensitivity to background variations, PaddleOCR offers a more robust solution for extracting text from challenging images, including those that are slightly blurred or low-resolution. While its more complex architecture increases execution time, leveraging cloud computing can mitigate this drawback by enhancing computational power, making PaddleOCR a suitable choice even in environments where performance is critical.
3. Proposal of Meeting Minutes Generation System
This section describes the methodologies workflow of the system.
3.1. System Overview
The proposed system aims to provide users with flexibility in its use of open-source models and accessibility. For this purpose, the proposed system is implemented on top of several functions and deployed as a web application, as shown in the
Figure 1. This system architecture consists of the
client side and the
server side. The
client side comprises the input function and the output function. The input function accepts a pair consisting of a recorded online presentation and the document used during the presentation. The output function visualizes the results of the transcribed audio, generates summaries, extracts keywords, and matches slides through the
user interface (UI). The
server side is responsible for processing the video to generate meeting minutes. Since each slide in the video may contain different information, the system first segments the video by slides. This process uses the
split video function, which includes the
scene-change detection algorithm and the segmentation mechanism [
29]. The
scene-change detection algorithm compares visual information between adjacent frames and splits the video at the frame where the visual information changes.
Then, the audio from each segmented video is transcribed into a text using the
audio-to-text function [
30,
31]. This process converts the speaker’s voice into a text format. From the text, the
information extraction function generates the summary and extracts the keywords to help users quickly grasp the key points of the speaker’s message [
19,
32].
At this stage, the system has generated the transcription, summary, and keywords for each video segment. However, these data still need to be linked to the corresponding slide. To achieve this, the
information correlation function is applied. This function uses
OCR, the
python-pptx library, and
regular expression (regex) to match the extracted information with the corresponding slide [
33]. Finally, the system organizes the correct information through the slide output information function, which is displayed to the user through the
client-side interface.
3.2. Input Function
The presentation video contains both visual and audio data, with a single presenter using screen sharing. The video format must be compatible with
ffmpeg, and the resolution should be
. The presentation document provides the text content of the presentation, used as the ground truth for validations. Currently, the document must be in
.pptx format, with no images, shapes, or animations, and a single background color to minimize the complexity.
Figure 2 shows an example of an acceptable document layout. 0
3.3. Split-Video Function
As explained above, since the presentation video may contain multiple slides, there is a possibility of information overlap between them. A scene-change detection algorithm and segmentation mechanism are applied to prevent this. Inspired by the work of Bulut F. et al. [
34], we adopt the
split video function to separate the input video into multiple segments using a
scene-change detection algorithm.
This algorithm detects differences between adjacent frames to determine segmentation points. Then, the segmentation mechanism splits the video with these segmentation points. Finally, the function produces segmented videos and segmentation indices. The segmentation index denotes the order of the segments within the input video and serves as the identifier for each segment. These segmented videos are then processed individually in the audio-to-text function to transcribe the audio. The segmentation index is attached to each extracted piece of information produced by the information extraction function.
3.4. Audio-to-Text Function
This function employed the audio-to-text algorithm to convert audio signals into text. Before transcribing the audio from every segmented video, normalization is performed. Resampling by ffmpeg normalizes the video’s sampling rate to match with the standard sampling rate of 16 kHz. Current audio-to-text models use transformer architectures due to the efficiency of handling distant connections in audio data.
Then, pre-processing is performed to convert the raw audio into a Mel spectrogram. The Mel spectrogram is a 2D representation of the audio’s frequency content over time. Some models combine the convolutional feature encoder to extract patterns and reduce the data’s complexity. The core component of the transformer architecture is the context model by the self-attention mechanism. It captures important connections across the entire audio sequence, allowing the model to process relations from short or long word sequence. The sequence order is defined by the positional encoding. Lastly, the post-processing ensures the transcription is coherent, adding punctuation and making grammatical adjustments where needed. The process allows the transformer-based model to process entire sequences in parallel, improving transcription accuracy by focusing on different parts of the input simultaneously.
3.5. Information Extraction Function
The information extraction function produces summaries, extracts keywords from the text, and provides them with the segmentation indices.
3.5.1. Abstractive Summarization
The
abstractive summarization generates summaries while retaining the original semantic contents of the source text [
35]. In this study, the
bidirectional and auto-regressive transformers language model (BART LM) is employed to achieve accurate results in generating coherent summaries [
36]. This selection comes from the results of our previous works to measure the performance of abstractive summarization models [
19].
The
BART LM model is downloaded through the
HuggingFace implementation [
37]. The
BART LM model adopts the
bidirectional encoder and the
auto-regressive decoder. The
bidirectional encoder allows the model to understand the contextual representations between words from before and after the current words in the transcribed text. It encodes these data into numerical vectors and heightened significant and flattened uninfluential data using an attention mechanism. The
auto-regressive decoder enables the model to produce words by predicting the most potential words after the current words. This iterative process is performed continuously following the depth of the architecture. Finally, the model transforms these numerical vectors back into the text.
3.5.2. Keywords Extraction
The
keyword extraction process automatically ranks and extracts the most relevant words from a source text. It determines the ranks based on word correlations, frequencies, and semantic meaning [
38]. The
rank-based keyword extraction via unsupervised learning (RaKUn2) model is downloaded through GitHub repositories [
39]. In this study,
RaKUn2 is employed due to its ability to deliver fast and accurate results across different keyword datasets [
25]. This selection aligns with the findings of our previous work, where various keyword extraction methods were evaluated, and the effectiveness of
RaKUn2 in diverse scenarios was confirmed [
19].
The keyword extraction divides each sentence from the transcribed text into words and characters called tokens. This process is called tokenization. The relationship between tokens is determined by the frequency of their co-occurrence. It was calculated based on the frequency of appearance of each token in a sentence. Then, it creates a graph where the tokens are the nodes, and the edges are the co-occurrence frequencies. It organizes the tokens’ ranks based on the graph’s number of edges. Finally, it performs the post-processing steps to remove the duplicated keyword candidates.
3.6. Information Correlation Function
The
information correlation function compares and validates the text data from the recorded presentation video and the extracted information with the original text from the presentation document. This function workflow is visualized in
Figure 3.
First, it recognizes the text from each frame in the video using OCR. Since the video was resized, only the text in the title area is recognized and extracted. For the comparison, the original text is gathered from the title text of each slide in the presentation document using the python-pptx package.
Then, it compares the OCR text with the original text using the regex string matching to determine which slide index the OCR text corresponds to. The slide index refers to the order of slides that appear in the presentation document. This process determines the identical strings that appear in both OCR text and the original text.
Finally, after the slide index is known, the extracted information is assigned to the similar segmentation index. This process ensures that the information extracted from the recorded presentation is accurately correlated with the corresponding slides, thus improving the overall accuracy and usefulness of the extracted information and avoiding confusion for the readers.
3.7. Slide Output Information Function
Following the validation of data pairs by the
information correlation function, the data needs to be organized for easy selection and management. The data include the extracted information, the
word-error rate (WER) score for
audio-to-text conversion [
40], the
character-error rate (CER) score for
OCR accuracy [
41], and the image for each slide. This function arranges the data in
JSON format to ensure efficient handling and presentation.
Figure 4 shows an example of the organized data for each slide, ready for display through the
UI on the
client side.
In
Figure 4, the
slide_index indicates the order of slides. The slide_image represents the visual information shown in the recorded video. The
slide_ocr_text shows the results of OCR. The
slide_presentation_text shows the original text in the slide. The
slide_cer displays the validation results of information correlation processes. The
slide_summary and
slide_keywords provide a brief explanation of the information. The
slide_convert_text displays the transcripted text from the audio-to-text process. The
slide_wer indicates the accuracy of
audio-to-text.
3.8. Output
The
output displays the structured data in a web application using the
Streamlit Python package [
42].
Streamlit integrates with the React user interface framework to develop web applications.
Figure 5a shows the
UI where users upload the recorded presentation video and the associated documents. Two input fields are provided to upload the required data. Once the video is loaded, it is displayed.
Figure 5b shows the
UI after the uploaded data were processed, and the results are displayed based on the data stored in the
JSON format. It displays the corresponding slide image, along with the extracted information, such as the
WER and
CER scores, the text extracted via
OCR, and the title text extracted using the
python-pptx library. Additionally, the transcriptions, summaries, and keywords generated from the transcription are also displayed.
4. Selection of Open-Source Models
This section outlines the process for selecting appropriate open-source models for the system.
4.1. Selection of Scene-Change Detection Algorithms
Since the algorithm needs to determine the transition points in video recordings based on scene changes, an accurate and fast technique is essential. However, while speed is important, accuracy must not be compromised, as it would impact the quality and disrupt subsequent processes. To meet these requirements, several algorithms from the
OpenCV and
Scikit-Image libraries were compared. The results are presented in
Table 1.
Table 1 demonstrates the superiority of
SSIM over other image comparison algorithms.
SSIM achieves perfect precision, recall, and
F1 scores. In contrast,
Histogram Comparison and
MSE struggle, especially with brightness variations and compression artifacts, leading to misclassified frames. Though
SSIM requires a longer processing time than
MSE, its structural approach results in higher accuracy, since it considers luminance, contrast, and image structure. This precision justifies the extra computation time and makes
SSIM the preferred choice for scene-change detection.
4.2. Selection of Audio-to-Text Models
The audio from each segmented video is extracted and converted into text. Selecting an accurate model is essential to ensure that subsequent processes accurately reflect the information conveyed in the audio. Additionally, model efficiency is essential to reduce computational time. Experiments were conducted to measure the
WER and the computational duration (in seconds) of models from
HuggingFaceHub to identify the most suitable model.
Table 2 shows the results.
Table 2 compares the accuracy of
WER and the computational duration. The
Whisper model achieved the lowest
WER and the shortest duration. While
Wav2Vec had a relatively short duration, its
WER was the highest. Therefore, the
Whisper model is the most suitable and, thus, is selected for the system’s
audio-to-text conversion.
4.3. Selection of Summarization Models
Each transcribed text from the segmented video is summarized to help readers quickly grasp the presentation’s main points. The summary is generated using an abstractive approach, leveraging the
language model (LM) to produce a coherent and concise output. Experiments were conducted to select the most suitable model from
HuggingFaceHub.
Table 3 shows the results.
Table 3 suggests that the
BART model outperforms
PEGASUS and
T5, achieving the highest
ROUGE-1,
ROUGE-2, and
ROUGE-L scores, and the fastest computational time.
BART’s
bidirectional encoder and
auto-regressive decoder enable it to effectively process information from both the beginning and the end of sentences, allowing it to generate more coherent summaries. Despite its relatively complex architecture,
BART’s high ability summarizing ability contributes to its computational efficiency. Based on these results,
BART is the most suitable model for generating summaries in the system.
4.4. Selection of Keyword Extraction Models
Keywords are extracted to emphasize the key topics discussed in each transcribed segment. Various models were tested for keyword extraction, and a detailed comparison was made to identify the best-performing model.
Table 4 shows the results.
Table 4 highlights the performance differences across various keyword extraction models.
RaKUn2 achieved the highest
cosine similarity score at 47.70%, suggesting it is relatively effective at identifying relevant keywords within transcribed segments. Based on the results,
RaKUn2 emerged as the most suitable model for keyword extraction and is employed in the system. However, this measurement only considers the lexical similarity. Other factors, such as keyword diversity, duplication, and representativeness, are not captured by this metric alone. Therefore, these parameters should be explored further to gain insights into a more accurate keyword extraction model.
4.5. Selection of OCR Programs
The extracted information and slide image are aligned based on the extracted OCR text. Therefore, an accurate OCR program must be selected. To justify the selection, multiple OCR programs are compared, and the lowest CER is chosen as the OCR for the system.
Table 5 shows that
PaddleOCR consistently delivers the fastest processing time and maintains high accuracy at any image resolution, showing the best overall performance.
Tesseract performs well at higher resolutions but suffers from a significant accuracy drop at lower ones, reaching 100%
CER below
, as denoted by the bold values in the
Table 5.
EasyOCR is the slowest and least accurate, especially at lower resolutions, where error rates rise sharply. Thus,
PaddleOCR is the most reliable and is selected as the best fit for the system.
5. Evaluations
In this section, the implementation of the proposed system is evaluated.
5.1. Experiment Preparation
Ten recorded presentation videos on different topics with varying slide lengths are prepared as experimental materials to test the proposed system’s ability to handle diverse content. The videos were recorded with speakers from five non-native English accents, including Vietnam (V), Congo (C), Singapore (S), Malaysia (M), and Indonesia (I). The recordings were created using the
Zoom screen-share function and recorded through its meeting recording function. The output was a video file in
.mp4 format, with a resolution of
.
Table 6 provides the details of each recorded presentation video.
The video recordings have a duration ranging from 4 to 8 minutes and an average bitrate of 611 kpbs, as shown in the
Table 6. The proposed system is evaluated through statistical measurements and a user questionnaire. Each model’s and algorithm’s output are compared to reference data. This includes evaluating the
F1-score for
scene-change detection,
WER for
audio-to-text,
ROUGE-N and
ROUGE-L for
abstractive summarization,
cosine similarity for
keyword extraction, and
CER for
OCR. Then, the user questionnaire assessed the system’s practical performance and the user experience. Each participant’s session lasted about 45 min. Participants watched ten presentation videos covering varied topics and accents to introduce speech pattern diversity. After viewing, users reviewed the system-generated meeting minutes and completed the questionnaire, evaluating the system’s usability, interface intuitiveness, meeting minute quality, audio transcription accuracy, and scene segmentation effectiveness.
5.2. Limitations and Biases of Recorded Videos
The recorded videos used in this study were designed to test the system across varied topics, presentation styles, and speaker accents, though some limitations should be noted. Speakers from Vietnam, Congo, Singapore, Malaysia, and Indonesia introduce accent variations. However, this sample set does not fully capture global English diversity. The topics, focusing on technical fields like AI and biotechnology, present structured, content-rich scenarios that may not depict more casual presentations. Audio quality was managed by recording in quiet rooms using standard laptop microphones, minimizing noise but possibly influencing transcription accuracy. Speech patterns were kept natural, though accents may still introduce phonetic challenges for the system. Presentation slides were standardized in the .pptx format without images or complex formatting, ensuring consistency but limiting the representation of real-world design diversity. These constraints provide a controlled testing environment but may not fully reflect broader real-world variability.
5.3. Performance of Split Video Function
This section evaluates the practical impact of
SSIM’s algorithm. It categorizes video frames based on slide transitions as “change” or “same” to accurately segment the video. Video resolution was reduced to
, and the algorithm was applied to one frame per second instead of every frame to optimize processing time. This approach significantly lowered the execution time, enabling
SSIM to segment videos within one minute while maintaining a perfect
F1 score, as shown in
Table 7. These results indicate that each detected frame change accurately corresponds to the actual scene transition in the video.
5.4. Performance of Audio-to-Text Function
The
audio-to-text function utilizes the
Whisper model to convert
audio-to-text and evaluates performance using
WER. Audio is normalized to a sampling rate of 16 kHz. Despite the various speeches from non-English-speaking countries, including Vietnam, Congo, Singapore, Malaysia, and Indonesia, with varied duration, the
Whisper model demonstrates averaged low
WER, as shown in
Table 8.
Table 8 shows the results. It demonstrates that the
Whisper model consistently achieved low
WER values across different accents on all the slides. This emphasizes
Whisper’s robustness in handling varied accented speakers. However, some slides that contain the main information (e.g., slide 2 and slide 3) show slightly higher
WER values. This behavior may be attributed to the technical terms used in the presentation. However, the model continues to deliver accurate transcriptions on the other slides that are comparable to professional human transcribers [
60].
5.5. Performance of Information Extraction Function
The
information extraction function uses
BART LM for
abstractive summarization and
RaKUn2 for
keyword extraction.
BART LM is integrated into the proposed system due to its strong text generation capabilities, as evidenced by prior selections and reported in several studies [
61,
62]. Similarly,
RaKUn2 was chosen based on its representative keywords’ extraction, supported by previous selection criteria and confirmed by [
63].
Table 9 shows that
BART LM models perform variably across slides and topics, with
ROUGE-1,
ROUGE-2, and
ROUGE-L scores peaking on slides 2, 3, and 4 across most topics, reflecting the typical presentation structure.
ROUGE scores assess summary quality by comparing generated summaries to reference ones. Specifically,
ROUGE-1 measures unigram (single-word),
ROUGE-2 evaluates fluency through bi-gram (two-word) overlap, and
ROUGE-L checks the longest common sub-sequence (continuous adjacent words) to assess structural coherence. High scores across technical and general topics indicate that the system produces coherent, reliable summaries across varied domains. This consistency suggests that
BART LM is suitable for various summarization tasks.
The keyword extraction measures the similarity between the extracted and reference keywords using
cosine similarity [
64].
Table 10 illustrates the cosine similarity of the extracted keywords for each slide from every topic.
Table 10 shows that the
RaKUn2 algorithm achieved the average
cosine similarity score of 0.509 on all the slides and topics, accurately capturing over half of the reference keywords. While effective for general keyword identification, the cosine similarity relies mainly on lexical overlap, potentially missing deeper semantic nuances, especially in topics requiring complex contextual understanding [
65].
5.6. Performance of Information Correlation Function
The
information correlation function aligns the extracted text, summaries, and keywords with corresponding presentation slides. Supported by the selection in
Table 5,
PaddleOCR is used to detect the title text from segmented video frames. It is then matched to the presentation document through
regex matching. Since this matching mechanism relies on string-level similarity, the perfect
OCR text extraction, as shown in
Table 11, ensures error-free matching. However, as the presentations are controlled in this study, these results may not directly generalize to real-world scenarios where design variations are common.
5.7. Performance Comparison with Other Meeting Minutes Systems
This experiment evaluates the performance of our proposed system by comparing it with other existing systems. Due to the proprietary nature of these systems, data were manually gathered from free or trial versions. Thus, the execution time, meeting transcription, summary generation, keyword extraction, interface, and customization capabilities were compared to demonstrate the effectiveness of integrating open-source models relative to professionally developed systems. The key metrics in this comparison included
WER for transcription accuracy,
ROUGE-N scores on the summary quality, and the
cosine similarity for keyword extraction relevance.
Table 12 shows the comparison results.
In
Table 12, the bolded values represent the highest scores, while the underlined values denote the second-highest scores. These results indicate that the proposed system achieved the lowest
WER of 1.88%, outperforming other systems’ transcription accuracy. While MeetingBooster had the fastest execution time, it only provided transcription. In contrast, our system handles transcriptions, summaries, and keywords, achieving the second-best time (1:34). For summarization, Piglyph achieved the highest
ROUGE-N metrics, with our system closely following with second-highest scores, 41.06% for
ROUGE-1, 23.76% for
ROUGE-2, and 32.12% for
ROUGE-L. Piglyph led with a
cosine similarity of 0.521 in keyword extraction, while our system achieved 0.509 keyword relevance. The browser-based design offers easy access, and the system’s use of customizable open-source models enables model updates, ensuring adaptability to the latest advancements.
Although our system does not consistently achieve the highest scores, its performance closely aligns with professionally developed systems. The system performs best with simple presentation formats, and further testing is needed to confirm its effectiveness with more complex layouts, diverse speaker styles, and varied slide designs.
5.8. Questionnaire Results on Usability and Effectiveness
The usability and effectiveness of the proposed system were evaluated using a questionnaire based on the
Performance, Information, Economics, Control and Security, Efficiency, and Service (PIECES) framework [
71]. The questionnaire was administered to 31 respondents. Each respondent first reviewed a provided presentation document, listened to the presentation recording, and executed the proposed system. Feedback was subsequently collected through the questionnaire, with the specific questions listed in
Table 13.
Each question represents a key aspect of the framework. Questions Q1, Q2, and Q5 focus on usability, such as ease of use, intuitive interface, and likelihood of recommending the system. Specifically, Q1 evaluates the
Service aspect, by measuring how quickly users can extract information. Q2 examines the
Control and
Security aspects, by assessing how intuitively users can manage the system. Q5 checks the
Economics and
Efficiency aspects, by gauging if users would recommend the system for similar tasks. Q3 and Q4 focus on the effectiveness of the system. Q3 assesses the
Information aspect by evaluating the accuracy of the audio-to-text conversion. Q4 addresses the
Performance aspect, reflecting the quality of summary and keyword extraction.
Table 14 shows their response to the questionnaire.
As shown in
Table 14, the system received positive feedback on ease of use, interface intuitiveness, and recommendation likelihood. Most users found the system easy to navigate. Twenty-two respondents agreed that it was simple to extract information from videos, and twenty-one appreciated its intuitive design. These findings align with the performance of the
Whisper model, which achieved a low
WER. However, mixed responses on the summary and keyword quality, as indicated by the responses to Q4, suggest that while the
BART model and
RaKUn2 performed well, there is room for improvement. The system accurately aligned extracted information with presentation slides, as indicated by the responses to Q2, utilizing
PaddleOCR and
regex. Overall, the system is user-friendly, accurate, and effective, with potential areas for enhancement.
6. Discussion
Online meetings and online presentations have become essential with the rise of remote work. They often lead participants to over-fatigue due to information overloads and cramped meetings. To address this, the proposed system extracts, summarizes, and correlates the key information from recorded meetings using fully open-source AI models. The system enables users to review essential contents without replaying entire sessions and to update the system whenever a new advancement of an AI model occurs. Following a modular and task-specific approach, each piece of audio, text, and visual data is processed individually to maximize computational efficiency and accuracy.
The
WER by
Whisper is very similar to the accuracy of human transcription [
14]. Although the
ROUGE-N scores,
cosine similarity, and user feedback obtained in this study indicate slight limitations in generating cohesive summaries and representative keywords in practical settings, their usefulness often depends on how effectively they capture the critical information discussed in the meeting.
Another limitation of the proposed system is that it was tested with controlled recorded online meeting presentations. These presentations excluded images, figures, and animations and used a single background color with the standard font size and style. The speech pattern was kept uniform, with no variation in speed or intonations.
Considering these limitations, future investigations will focus on varied presentation formats and speech patterns. More complex scenarios will represent real-world challenges. Additional techniques or model fine-tuning may be necessary to ensure the system can generate usable information from more complex inputs.
7. Conclusions
Online meetings and presentations have become essential with the rise of remote work, often leading to participant fatigue in information overload and back-to-back sessions. The proposed system addresses this issue by extracting, summarizing, and correlating key information from recorded meetings. As fully open-source AI models, the system employs Whisper for audio-to-text transcription, BART LM for summarization, RaKUn2 for keyword extraction, SSIM for scene-change detection, and PaddleOCR for extracting and regex for correlating visual–textual data.
The evaluation of ten recorded presentations demonstrates the system’s accuracy, emphasizing its potential to streamline note-taking for meetings, conferences, and seminars. The practical impact could benefit organizations and educational settings that rely on efficient meeting reviews. However, accommodating diverse and dynamic presentation designs and varied speech patterns remains an area for future investigation.
Author Contributions
Conceptualization, A.L.H., N.F. and S.S.; methodology, A.L.H.; software, A.L.H.; validation, A.L.H.; resources, S.S.; data curation, E.D.F.; writing—original draft preparation, A.L.H.; writing—review and editing, N.F., Y.Y.F.P. and S.S. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.
Acknowledgments
The authors thank the reviewers for their thorough reading and helpful comments.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Standaert, W.; Thunus, S.; Schoenaers, F. Virtual meetings and wellbeing: Insights from the COVID-19 pandemic. Inf. Technol. People 2023, 36, 1766–1789. [Google Scholar] [CrossRef]
- Bergmann, R.; Rintel, S.; Baym, N.; Sarkar, A.; Borowiec, D.; Wong, P.; Sellen, A. Meeting (the) pandemic: Videoconferencing fatigue and evolving tensions of sociality in enterprise video meetings during COVID-19. Comput. Support. Coop. Work. (CSCW) 2023, 32, 347–383. [Google Scholar] [CrossRef] [PubMed]
- Apostolidis, E.; Adamantidou, E.; Metsai, A.I.; Mezaris, V.; Patras, I. Video summarization using deep neural networks: A survey. arXiv 2021, arXiv:2101.06072. [Google Scholar] [CrossRef]
- Mannai, M.; Karâa, W.B.A.; Ghezala, H.H.B. Information extraction approaches: A survey. In Proceedings of the Information and Communication Technology, Bangkok, Thailand, 12–13 December 2016; pp. 289–297. [Google Scholar]
- Yu, Y.; Wang, C.; Fu, Q.; Kou, R.; Huang, F.; Yang, B.; Yang, T.; Gao, M. Techniques and challenges of image segmentation: A review. Electronics 2023, 12, 1199. [Google Scholar] [CrossRef]
- Fuad, M.; Ernawan, F.; Hui, L. Video scene change detection based on histogram analysis for hiding message. J. Phys. Conf. Ser. IOP Publ. 2021, 1918, 042141. [Google Scholar] [CrossRef]
- Bhuiyan, M.A.A.; Khan, A.R. Image quality assessment employing RMS contrast and histogram similarity. Int. Arab J. Inf. Technol. 2018, 15, 983–989. [Google Scholar]
- Gore, A.; Gupta, S. Full reference image quality metrics for JPEG compressed images. AEU Int. J. Electron. Commun. 2015, 69, 604–608. [Google Scholar] [CrossRef]
- Shen, J.; Jiang, X.; Zhong, J.; Yao, S. Scene change detection based on sequence statistics using structural similarity. In Proceedings of the 2022 4th International Academic Exchange Conference on Science and Technology Innovation (IAECST), Guangzhou, China, 9–11 December 2022; pp. 1179–1182. [Google Scholar]
- Alharbi, S.; Alrazgan, M.; Alrashed, A.; Alnomasi, T.; Almojel, R.; Alharbi, R.; Alharbi, S.; Alturki, S.; Alshehri, F.; Almojil, M. Automatic speech recognition: Systematic literature review. IEEE Access 2021, 9, 131858–131876. [Google Scholar] [CrossRef]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
- Pratap, V.; Tjandra, A.; Shi, B.; Tomasello, P.; Babu, A.; Kundu, S.; Elkahky, A.; Ni, Z.; Vyas, A.; Fazel-Zarandi, M.; et al. Scaling speech technology to 1000+ languages. J. Mach. Learn. Res. 2024, 25, 1–52. [Google Scholar]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning. PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
- Jangra, A.; Mukherjee, S.; Jatowt, A.; Saha, S.; Hasanuzzaman, M. A survey on multi-modal summarization. ACM Comput. Surv. 2023, 55, 1–36. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual Event, 13–18 July 2020; pp. 11328–11339. [Google Scholar]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
- Haz, A.L.; Funabiki, N.; Fajrianti, E.D.; Sukaridhoto, S. A Study of Summarization and Keyword Extraction Function in Meeting Note Generation System from Voice Records. In Proceedings of the 2023 12th International Conference on Networks, Communication and Computing, Osaka, Japan, 15–17 December 2023; pp. 106–112. [Google Scholar]
- Grootendorst, M. KeyBERT: Minimal Keyword Extraction with BERT. 2020. Available online: https://doi.org/10.5281/zenodo.4461265 (accessed on 26 October 2024).
- Mihalcea, R.; Tarau, P. Textrank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 404–411. [Google Scholar]
- Rose, S.; Engel, D.; Cramer, N.; Cowley, W. Automatic keyword extraction from individual documents. In Text Mining: Applications and Theory; Wiley: Hoboken, NJ, USA, 2010; pp. 1–20. [Google Scholar]
- Campos, R.; Mangaravite, V.; Pasquali, A.; Jorge, A.; Nunes, C.; Jatowt, A. YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. 2020, 509, 257–289. [Google Scholar] [CrossRef]
- Škrlj, B.; Koloski, B.; Pollak, S. Retrieval-efficiency trade-off of Unsupervised Keyword Extraction. In Proceedings of the International Conference on Discovery Science, Montpellier, France, 10–12 October 2022; pp. 379–393. [Google Scholar]
- Škrlj, B.; Repar, A.; Pollak, S. RaKUn: Rank-based Keyword extraction via Un supervised learning and meta vertex aggregation. In Proceedings of the Statistical Language and Speech Processing: 7th International Conference, SLSP 2019, Ljubljana, Slovenia, 14–16 October 2019; pp. 311–323. [Google Scholar]
- Salehudin, M.; Basah, S.; Yazid, H.; Basaruddin, K.; Safar, M.; Som, M.M.; Sidek, K. Analysis of Optical Character Recognition using EasyOCR under Image Degradation. J. Phys. Conf. Ser. IOP Publ. 2023, 2641, 012001. [Google Scholar] [CrossRef]
- de Luna, R.G. A Tesseract-based Optical Character Recognition for a Text-to-Braille Code Conversion. Int. J. Adv. Sci. Eng. Inf. Technol. 2020, 10, 128–136. [Google Scholar] [CrossRef]
- Shahin, M.; Chen, F.F.; Hosseinzadeh, A. Machine-based identification system via optical character recognition. Flex. Serv. Manuf. J. 2023, 1–28. [Google Scholar] [CrossRef]
- Mohajer, M.M.; Hassanpour, H. Fast Exam Video Summarization Using Targeted Evaluation of Scene Changes Based on User Behavior. In Proceedings of the 2023 9th International Conference on Signal Processing and Intelligent Systems (ICSPIS), Bali, Indonesia, 14–15 December 2023; pp. 1–5. [Google Scholar]
- Haz, A.L.; Fajrianti, E.D.; Funabiki, N.; Sukaridhoto, S. A Study of Audio-to-Text Conversion Software Using Whispers Model. In Proceedings of the 2023 Sixth International Conference on Vocational Education and Electrical Engineering (ICVEE), Bali, Indonesia, 14–15 December 2023; pp. 268–273. [Google Scholar]
- Nguyen, Q.; Nguyen, N.; Dang, T.; Tran, V. Vietnamese Voice2Text: A Web Application for Whisper Implementation in Vietnamese Automatic Speech Recognition Tasks: Vietnamese Voice2Text. In Proceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence, Beijing, China, 8–10 December 2023; pp. 312–318. [Google Scholar]
- Saxena, P.; El-Haj, M. Exploring Abstractive Text Summarisation for Podcasts: A Comparative Study of BART and T5 Models. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, Varna, Bulgaria, 4–6 September 2023; pp. 1023–1033. [Google Scholar]
- Tang, Z.; Yang, Z.; Wang, G.; Fang, Y.; Liu, Y.; Zhu, C.; Zeng, M.; Zhang, C.; Bansal, M. Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19254–19264. [Google Scholar]
- Bulut, F.; Osmanı, S. Scene Change Detection using Different Color Pallets and Performance Comparison. Balk. J. Electr. Comput. Eng. 2017, 5, 66–72. [Google Scholar] [CrossRef]
- Widyassari, A.P.; Rustad, S.; Shidik, G.F.; Noersasongko, E.; Syukur, A.; Affandy, A.; Setiadi, D.R.I.M. Review of automatic text summarization techniques & methods. J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 1029–1046. [Google Scholar]
- Chen, Y.; Song, Q. News text summarization method based on bart-textrank model. In Proceedings of the 2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 12–14 March 2021; pp. 2005–2010. [Google Scholar]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. Facebook/Bart-large · Hugging Face. 2019. Available online: https://huggingface.co/facebook/bart-large (accessed on 24 October 2024).
- Nasar, Z.; Jaffry, S.W.; Malik, M.K. Textual keyword extraction and summarization: State-of-the-art. Inf. Process. Manag. 2019, 56, 102088. [Google Scholar] [CrossRef]
- Blaž, Š.; Koloski, B.; Pollak, S. Retrieval-Efficiency Trade-Off of Unsupervised Keyword Extraction. In Discovery Science; Springer: Cham, Switzerland, 2022; Volume 13601. [Google Scholar]
- Von Neumann, T.; Boeddeker, C.; Kinoshita, K.; Delcroix, M.; Haeb-Umbach, R. On word error rate definitions and their efficient computation for multi-speaker speech recognition systems. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Schaefer, R.; Neudecker, C. A two-step approach for automatic OCR post-correction. In Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Barcelona, Spain, 12 December 2020; pp. 52–57. [Google Scholar]
- Streamlit. Streamlit. A Faster Way to Build and Share Data Apps. Available online: https://github.com/SkBlaz/rakun2/tree/main (accessed on 26 October 2024).
- Scikit-Image Contributors. Structural Similarity Index. 2024. Available online: https://scikit-image.org/docs/stable/auto_examples/transform/plot_ssim.html (accessed on 20 October 2024).
- openCV Contributors. Histogram Comparison. 2024. Available online: https://docs.opencv.org/4.x/d8/dc8/tutorial_histogram_comparison.html (accessed on 20 October 2024).
- Scikit-Learn Developers. Mean_Squared_Error. 2024. Available online: https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.mean_squared_error.html (accessed on 20 October 2024).
- Facebook-AI. Wav2Vec2 Base 960h. 2021. Available online: https://huggingface.co/facebook/wav2vec2-base-960h (accessed on 23 October 2024).
- Facebook-AI. HuBERT Large LS960 Fine-Tuned. 2021. Available online: https://huggingface.co/facebook/hubert-large-ls960-ft (accessed on 23 October 2024).
- Meta-AI. MMS-1B-FL102. 2023. Available online: https://huggingface.co/facebook/mms-1b-fl102 (accessed on 23 October 2024).
- Distil-Whisper. Distil-Whisper Medium English. 2023. Available online: https://huggingface.co/distil-whisper/distil-medium.en (accessed on 23 October 2024).
- Clivillé, J. flan-t5-3b-summarizer. 2023. Available online: https://huggingface.co/jordiclive/flan-t5-3b-summarizer (accessed on 24 October 2024).
- Google-Research. PEGASUS-XSum. 2020. Available online: https://huggingface.co/google/pegasus-xsum (accessed on 24 October 2024).
- Neelamohan, K.K. MEETING_SUMMARY. 2022. Available online: https://huggingface.co/knkarthick/MEETING_SUMMARY (accessed on 24 October 2024).
- Grootendorst, M. KeyBERT: Minimal Keyword Extraction with BERT. 2020. GitHub Repository. Available online: https://github.com/MaartenGr/KeyBERT (accessed on 23 October 2024).
- Surfer, C. RAKE-NLTK: Rapid Automatic Keyword Extraction using NLTK. 2018. GitHub Repository. Available online: https://github.com/csurfer/rake-nltk (accessed on 26 October 2024).
- Campos, R.; Mangaravite, V.; Pasquali, A.; Jorge, A.M.; Nunes, C.; Jatowt, A. YAKE: Keyword Extraction from Single Documents Using Multiple Features. 2020. GitHub Repository. Available online: https://github.com/LIAAD/yake (accessed on 26 October 2024).
- Nathan, P. PyTextRank Python Implementation of TextRank for Phrase Extraction and Summarization. 2020. GitHub Repository. Available online: https://github.com/DerwenAI/pytextrank (accessed on 26 October 2024).
- SkBlaz. RaKUn2—Rake Unsupervised Keyword Extraction. 2023. GitHub Repository. Available online: https://github.com/SkBlaz/rakun2 (accessed on 26 October 2024).
- Hu, S.; He, C.; Zhang, C.; Tan, Z.; Ge, B.; Zhou, X. Efficient scene text recognition model built with PaddlePaddle framework. In Proceedings of the 2021 7th International Conference on Big Data and Information Analytics (BigDIA), Chongqing, China, 4–10 June 2023; pp. 139–142. [Google Scholar]
- Smith, R. An Overview of the Tesseract OCR Engine. In Proceedings of the ICDAR ’07: Proceedings of the Ninth International Conference on Document Analysis and Recognition, Washington, DC, USA, 23–26 September 2007; pp. 629–633. [Google Scholar]
- Graham, C.; Roll, N. Evaluating OpenAI’s Whisper ASR: Performance analysis across diverse accents and speaker traits. JASA Express Lett. 2024, 4, 025206. [Google Scholar] [CrossRef] [PubMed]
- Zhang, T.; Irsan, I.C.; Thung, F.; Han, D.; Lo, D.; Jiang, L. iTiger: An automatic issue title generation tool. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore, 14–18 November 2022; pp. 1637–1641. [Google Scholar]
- Raju, R.; Pati, P.B.; Gandheesh, S.; Sannala, G.S.; Suriya, K. Grammatical versus Spelling Error Correction: An Investigation into the Responsiveness of Transformer-Based Language Models Using BART and MarianMT. arXiv 2024, arXiv:2403.16655. [Google Scholar] [CrossRef]
- Škrlj, B.; Jukič, M.; Eržen, N.; Pollak, S.; Lavrač, N. Prioritization of COVID-19-related literature via unsupervised keyphrase extraction and document representation learning. In Proceedings of the Discovery Science: 24th International Conference, Halifax, NS, Canada, 11–13 October 2021; pp. 204–217. [Google Scholar]
- Saha, S.; Ghosh, M.; Ghosh, S.; Sen, S.; Singh, P.K.; Geem, Z.W.; Sarkar, R. Feature selection for facial emotion recognition using cosine similarity-based harmony search algorithm. Appl. Sci. 2020, 10, 2816. [Google Scholar] [CrossRef]
- Sarwar, T.B.; Noor, N.M.; Miah, M.S.U. Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding. PeerJ Comput. Sci. 2022, 8, e1024. [Google Scholar] [CrossRef]
- MeetingBooster. Meeting Management Software: Meetingbooster. Available online: https://www.meetingbooster.com/ (accessed on 20 October 2024).
- Fellow. Fellow Resources. 2023. Available online: https://fellow.app/ (accessed on 20 October 2024).
- Beenote. Meeting Management Solution: Agenda, Minutes. 2022. Available online: https://www.beenote.io/ (accessed on 20 October 2024).
- Piglyph. Interactive Whiteboard for Co-Creation Through Real-Time Visualization: Ricoh. Available online: https://piglyph.com/ (accessed on 20 October 2024).
- Tactiq. AI Meeting Transcripts for Google Meet, Zoom & Teams. Available online: https://tactiq.io/ (accessed on 20 October 2024).
- Fatoni, A.; Adi, K.; Widodo, A.P. PIECES framework and importance performance analysis method to evaluate the implementation of information systems. In Proceedings of the E3S Web of Conferences, Online Conference, 12–13 August 2020; Volume 202, p. 15007. [Google Scholar]
Figure 1.
Overview of meeting minutes generation system.
Figure 1.
Overview of meeting minutes generation system.
Figure 2.
Acceptable document format.
Figure 2.
Acceptable document format.
Figure 3.
Workflow for information correlation function.
Figure 3.
Workflow for information correlation function.
Figure 4.
Sample of organized data for each slide.
Figure 4.
Sample of organized data for each slide.
Figure 5.
(a) Input field for recorded video and presentation document; (b) extracted results for each slide.
Figure 5.
(a) Input field for recorded video and presentation document; (b) extracted results for each slide.
Table 1.
Comparison of image comparison algorithms.
Table 1.
Comparison of image comparison algorithms.
Metric | SSIM [43] | Histogram Comparison [44] | MSE [45] |
---|
Bhattacharyya | Chi-Square | Correlation |
---|
Precision | 100% | 35.29% | 26.32% | 38.46% | 11.36% |
Recall | 100% | 60.00% | 100.00% | 50.00% | 100.00% |
F1 Score | 100% | 44.44% | 41.67% | 43.48% | 20.41% |
Duration (s) | 180.85 | 296.91 | 309.79 | 297.13 | 173.16 |
Table 2.
Comparison of WER and duration among audio-to-text models.
Table 2.
Comparison of WER and duration among audio-to-text models.
Component | Wav2Vec [46] | HuBERT [47] | MMS [48] | Whisper [49] |
---|
WER | 51.23% | 26.92% | 19.38% | 4.41% |
Duration (s) | 6.484 | 16.044 | 104.832 | 4.912 |
Table 3.
Comparison of ROUGE-N and duration among summarization models.
Table 3.
Comparison of ROUGE-N and duration among summarization models.
Metric | T5 [50] | PEGASUS [51] | BART [52] |
---|
ROUGE-1 | 30.66% | 30.95% | 39.34% |
ROUGE-2 | 15.05% | 16.30% | 22.43% |
ROUGE-L | 25.56% | 26.16% | 35.35% |
Duration (s) | 4.448 | 1.374 | 0.962 |
Table 4.
Comparison of cosine similarity among keyword extraction models.
Table 4.
Comparison of cosine similarity among keyword extraction models.
Metric | KeyBERT [53] | RAKE [54] | YAKE [55] | TextRank [56] | RaKUn2 [57] |
---|
Cosine Similarity | 28.46% | 27.83% | 34.01% | 33.97% | 47.70% |
Table 5.
Comparison of OCR algorithms.
Table 5.
Comparison of OCR algorithms.
Size | PaddleOCR [58] | Tesseract [59] | EasyOCR [26] |
---|
Duration (s) | CER | Duration (s) | CER | Duration (s) | CER |
---|
| 0.3442 | 0.00% | 1.2262 | 0.00% | 4.1956 | 0.00% |
| 0.1951 | 0.00% | 0.9713 | 0.00% | 2.9337 | 0.00% |
| 0.2672 | 0.00% | 0.8805 | 0.00% | 2.7653 | 0.00% |
| 0.1892 | 0.00% | 0.6989 | 0.00% | 2.9206 | 0.00% |
| 0.1278 | 0.00% | 0.1531 | 0.00% | 3.6544 | 0.00% |
| 0.0991 | 5.22% | 0.1338 | 4.26% | 0.9936 | 8.88% |
| 0.0773 | 7.02% | 0.1204 | 100.00% | 0.2197 | 55.56% |
| 0.0594 | 15.40% | 0.1010 | 100.00% | 0.3462 | 100.00% |
Table 6.
Recorded presentation videos for evaluations.
Table 6.
Recorded presentation videos for evaluations.
Topic | Artificial
Intelligence | Biotechnology | Blockchain | Cybersecurity | Renewable
Energy | Environmental | Healthcare | Quantum
Computing | Data
Science | Space
Exploration |
---|
Duration (m:s) | 5:50 | 5:29 | 6:44 | 6:37 | 6:53 | 6:05 | 5:28 | 4:32 | 6:13 | 8:10 |
Bitrate (kbps) | 577 | 632 | 557 | 574 | 497 | 664 | 562 | 846 | 517 | 691 |
Table 7.
SSIM algorithm results for per-second approach.
Table 7.
SSIM algorithm results for per-second approach.
Topic | Seconds | F1 Score | Execution Time (s) |
---|
Artificial Intelligence | 350 | 100% | 38.57 |
Biotechnology | 329 | 100% | 50.47 |
Blockchain | 404 | 100% | 47.17 |
Cybersecurity | 397 | 100% | 42.16 |
Renewable Energy | 413 | 100% | 42.37 |
Environmental | 365 | 100% | 39.62 |
Healthcare | 328 | 100% | 35.01 |
Quantum Computing | 272 | 100% | 27.56 |
Data Science | 373 | 100% | 39.86 |
Space Exploration | 490 | 100% | 46.96 |
Table 8.
Averaged WER on each slide from different English accents.
Table 8.
Averaged WER on each slide from different English accents.
Accents | Slide 1 | Slide 2 | Slide 3 | Slide 4 | Slide 5 |
---|
Vietnam | 1.30% | 1.58% | 2.17% | 0.82% | 1.31% |
Congo | 1.32% | 1.75% | 2.45% | 1.04% | 1.60% |
Singapore | 0.89% | 1.74% | 2.37% | 0.98% | 1.43% |
Malaysia | 1.35% | 1.57% | 2.30% | 0.98% | 1.64% |
Indonesia | 1.36% | 1.76% | 2.41% | 1.79% | 1.40% |
Table 9.
ROUGE-1, ROUGE-2, and ROUGE-L for each slide and topic.
Table 9.
ROUGE-1, ROUGE-2, and ROUGE-L for each slide and topic.
Metric | Topic | Artificial
Intelligence | Biotechnology | Blockchain | Cybersecurity | Renewable
Energy | Environmental | Healthcare | Quantum
Computing | Data
Science | Space
Exploration |
---|
ROUGE-1 | Slide 1 | 27.4% | 49.9% | 33.9% | 44.4% | 49.9% | 49.9% | 59.5% | 46.1% | 0% | 35.2% |
Slide 2 | 36.7% | 45.7% | 4.1% | 60.8% | 42.4% | 69.1% | 60.2% | 46.8% | 47.4% | 33.5% |
Slide 3 | 68.4% | 30.6% | 45.3% | 43.7% | 52.9% | 29.8% | 37.9% | 38% | 50.6% | 31.9% |
Slide 4 | 43.2% | 48.1% | 51.7% | 55.3% | 32.9% | 41% | 72.7% | 46.8% | 43.9% | 55.7% |
Slide 5 | 41.5% | - | 46.3% | 38.2% | 19% | 39.6% | 62.7% | 45.3% | 43.9% | 49% |
ROUGE-2 | Slide 1 | 14% | 19.6% | 6.3% | 9.8% | 21.8% | 27.1% | 40.7% | 20.3% | 0% | 22.9% |
Slide 2 | 5.9% | 9.5% | 21.7% | 33.3% | 16.2% | 44.7% | 33.6% | 22.2% | 19.1% | 12.8% |
Slide 3 | 37.5% | 8.6% | 6.8% | 18.8% | 23.2% | 9.3% | 14.7% | 17.1% | 28.8% | 15.5% |
Slide 4 | 13.1% | 20.8% | 17.9% | 20.2% | 11.7% | 8.1% | 44.4% | 15.9% | 20.6% | 37.8% |
Slide 5 | 18.9% | - | 16.2% | 8.8% | 5.8% | 10.7% | 20.3% | 21.9% | 10.2% | 29.8% |
ROUGE-L | Slide 1 | 23.5% | 38.4% | 22.6% | 37% | 45.8% | 49.9% | 59.5% | 46.2% | 0% | 35.2% |
Slide 2 | 22.9% | 25.7% | 39.5% | 55.7% | 34.3% | 64.2% | 55.9% | 36.3% | 27.4% | 19.1% |
Slide 3 | 50.6% | 27% | 34.6% | 37.4% | 44.1% | 20.8% | 31% | 30.9% | 45.8% | 23.9% |
Slide 4 | 23.4% | 40.5% | 34.4% | 33.8% | 21.1% | 24.6% | 62.3% | 36.3% | 26.3% | 39.3% |
Slide 5 | 23.7% | - | 37.6% | 27.6% | 9.5% | 31.6% | 46.5% | 34.6% | 31.7% | 43.1% |
Table 10.
Keywords’ cosine similarity from each slide and each topic.
Table 10.
Keywords’ cosine similarity from each slide and each topic.
Topic | Artificial
Intelligence | Biotechnology | Blockchain | Cybersecurity | Renewable
Energy | Environmental | Healthcare | Quantum
Computing | Data
Science | Space
Exploration | Average
(per slide) |
---|
Slide 1 | 0.714 | 0.503 | 0.669 | 0.462 | 0.487 | 0.721 | 0.874 | 0.801 | 0.394 | 0.566 | 0.619 |
Slide 2 | 0.504 | 0.597 | 0.453 | 0.423 | 0.481 | 0.356 | 0.447 | 0.496 | 0.429 | 0.499 | 0.469 |
Slide 3 | 0.478 | 0.577 | 0.516 | 0.484 | 0.549 | 0.499 | 0.522 | 0.592 | 0.509 | 0.462 | 0.519 |
Slide 4 | 0.571 | 0.559 | 0.526 | 0.577 | 0.649 | 0.539 | 0.547 | 0.416 | 0.725 | 0.362 | 0.547 |
Slide 5 | 0.306 | - | 0.316 | 0.340 | 0.360 | 0.399 | 0.416 | 0.547 | 0.269 | 0.441 | 0.377 |
Average
(per topic) | 0.515 | 0.559 | 0.496 | 0.457 | 0.505 | 0.503 | 0.561 | 0.570 | 0.465 | 0.466 | 0.509 |
Table 11.
CER results from each slide and each topic from resized video using PaddleOCR.
Table 11.
CER results from each slide and each topic from resized video using PaddleOCR.
Topic | Artificial
Intelligence | Biotechnology | Blockchain | Cybersecurity | Renewable
Energy | Environmental | Healthcare | Quantum
Computing | Data
Science | Space
Exploration |
---|
Slide 1 | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
Slide 2 | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
Slide 3 | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
Slide 4 | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
Slide 5 | 0% | - | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
Table 12.
Performance comparison with other existing systems.
Table 12.
Performance comparison with other existing systems.
Application | Time | Transcription (WER) | Summary | Keywords (Cosine- Similarity) | Runs on | Custom |
---|
ROUGE-1 | ROUGE-2 | ROUGE-3 |
---|
Meeting Booster [66] | 1:18 | 2.24% | - | - | - | - | Browser | No |
Fellow [67] | 2:18 | 2.03% | 40.47% | 23.11% | 31.53% | 0.511 | Browser,
Desktop | No |
Beenotes [68] | 5:48 | 2.91% | 38.09% | 21.07% | 29.47% | 0.496 | Desktop | No |
Piglyph [69] | 3:02 | 2.72% | 42.03% | 25.08% | 33.12% | 0.521 | Desktop | No |
Tactiq [70] | 3:54 | 2.81% | 39.14% | 22.17% | 30.61% | 0.427 | Browser | No |
Our proposal | 1:34 | 1.88% | 41.06% | 23.76% | 32.12% | 0.509 | Browser | Yes |
Table 13.
Questionnaire questions on usability and effectiveness.
Table 13.
Questionnaire questions on usability and effectiveness.
Question ID | Questionnaire Questions |
---|
Q1 | The system was easy to use in extracting textual information from
presentation videos. |
Q2 | The system’s interface was intuitive and easy to navigate while correlating information between presentation videos and PowerPoint files. |
Q3 | I am satisfied with the accuracy of the system in converting audio to text and correlating textual information from both presentation videos and PowerPoint files. |
Q4 | I am satisfied with the quality of the generated summary and the extracted keywords. |
Q5 | I would recommend this system to others for similar tasks requiring information correlation between presentation videos and PowerPoint files. |
Table 14.
The answers to the questions in the questionnaire.
Table 14.
The answers to the questions in the questionnaire.
Question ID | Strongly Disagree | Disagree | Neutral | Agree | Strongly Agree |
---|
Q1 | 3 | 0 | 6 | 14 | 8 |
Q2 | 0 | 1 | 9 | 19 | 2 |
Q3 | 0 | 0 | 11 | 10 | 10 |
Q4 | 0 | 3 | 16 | 4 | 8 |
Q5 | 2 | 4 | 3 | 14 | 8 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).