Topic Editors

Department of Electrical & Electronic Engineering, University of Bristol, Bristol, UK
Institute of Computer Science, University of Rostock, 18051 Rostock, Germany

Methods for Data Labelling for Intelligent Systems

Abstract submission deadline
closed (31 May 2023)
Manuscript submission deadline
closed (31 July 2023)
Viewed by
53985

Topic Information

Dear Colleagues,

In our everyday life, we produce large quantities of data in different forms: sensor, video, audio textual data, etc. Labelling these data is a central part of the design and evaluation of intelligent systems that aim to understand and support the user. It is essential, both in designing and training a system, to recognize and reason about the situation and context, either through the design of new sensing modalities, the definition of suitable observation and semantic models in knowledge-driven applications, or though the preparation of training data for learning tasks in data-driven models. Hence, the quality of annotations can have a significant impact on the performance of the derived systems. Labelling is also vital for validating and quantifying the performance of intelligent applications, as well as for selecting the best performing setup of input modalities and configurations. Yet, high-quality annotations can have significant costs, often requiring significant time, expertise and funding, and so labelling tasks must be approached, with a pragmatic balance between quality, cost and use case in mind. With intelligent systems relying increasingly on large datasets with multiple heterogenous data sources, the process of data labelling is becoming a major concern for the community. To address the above challenges, this topic focuses on the following aspects of annotation: (1) Intelligent and interactive tools and automated methods for annotating large datasets, including the aspect of continuous learning and adaptation and life long learning, (2) The role and impact of annotations and annotations’ structures in designing intelligent systems, (3) The process of labelling, and the requirements to produce high-quality annotations, especially in the context of big data, (4) Methods for standardisation and normalisation in annotation practices. We invite you to submit works that offer new empirical or theoretical insights into the challenges and innovative solutions associated with data labelling, as well as on the impact that labelling choices have on the user and the developed system. The topics of interest include, but are not limited to:

  • Methods and intelligent tools for annotating heterogenous data; 
  • Methods for standardisation and normalisation in annotation practices; 
  • Influence of interface on annotation; 
  • Processes of and best practices in annotating heterogenous data; 
  • Case studies in annotation for specific areas such as, but not limited to, wearable and ubiquitous computing, factory automation, medicine, healthcare, linguistics and law; 
  • Methods towards automation of the annotation process; 
  • Methods for improving and evaluating the quality of annotations; 
  • Ethical and privacy issues concerning data annotation; 
  • Beyond the labels: ontologies for semantic annotation of user data; 
  • High-quality and resuable annotation for publicly available datasets; 
  • Impact of annotation on a system’s performance; 
  • Building machine learning models that are capable of dealing with multiple (noisy) annotations and/or making use of taxonomies/ontologies; 
  • The potential value of incorporating modelling of the annotators into predictive models.

Dr. Emma Tonkin
Dr. Kristina Yordanova
Topic Editors

Keywords

  • annotation
  • labelling
  • coding
  • intelligent systems
  • automated methods
  • machine learning

Participating Journals

Journal Name Impact Factor CiteScore Launched Year First Decision (median) APC
AI
ai
3.1 7.2 2020 17.6 Days CHF 1600
Applied Sciences
applsci
2.5 5.3 2011 17.8 Days CHF 2400
Data
data
2.2 4.3 2016 27.7 Days CHF 1600
Sensors
sensors
3.4 7.3 2001 16.8 Days CHF 2600
Systems
systems
2.3 2.8 2013 17.3 Days CHF 2400

Preprints.org is a multidiscipline platform providing preprint service that is dedicated to sharing your research from the start and empowering your research journey.

MDPI Topics is cooperating with Preprints.org and has built a direct connection between MDPI journals and Preprints.org. Authors are encouraged to enjoy the benefits by posting a preprint at Preprints.org prior to publication:

  1. Immediately share your ideas ahead of publication and establish your research priority;
  2. Protect your idea from being stolen with this time-stamped preprint article;
  3. Enhance the exposure and impact of your research;
  4. Receive feedback from your peers in advance;
  5. Have it indexed in Web of Science (Preprint Citation Index), Google Scholar, Crossref, SHARE, PrePubMed, Scilit and Europe PMC.

Published Papers (17 papers)

Order results
Result details
Journals
Select all
Export citation of selected articles as:
26 pages, 12907 KiB  
Article
CORTO: The Celestial Object Rendering TOol at DART Lab
by Mattia Pugliatti, Carmine Buonagura and Francesco Topputo
Sensors 2023, 23(23), 9595; https://doi.org/10.3390/s23239595 - 3 Dec 2023
Cited by 2 | Viewed by 1878
Abstract
The Celestial Object Rendering TOol (CORTO) offers a powerful solution for generating synthetic images of celestial bodies, catering to the needs of space mission design, algorithm development, and validation. Through rendering, noise modeling, hardware-in-the-loop testing, and post-processing functionalities, CORTO creates realistic scenarios. It [...] Read more.
The Celestial Object Rendering TOol (CORTO) offers a powerful solution for generating synthetic images of celestial bodies, catering to the needs of space mission design, algorithm development, and validation. Through rendering, noise modeling, hardware-in-the-loop testing, and post-processing functionalities, CORTO creates realistic scenarios. It offers a versatile and comprehensive solution for generating synthetic images of celestial bodies, aiding the development and validation of image processing and navigation algorithms for space missions. This work illustrates its functionalities in detail for the first time. The importance of a robust validation pipeline to test the tool’s accuracy against real mission images using metrics like normalized cross-correlation and structural similarity is also illustrated. CORTO is a valuable asset for advancing space exploration and navigation algorithm development and has already proven effective in various projects, including CubeSat design, lunar missions, and deep learning applications. While the tool currently covers a range of celestial body simulations, mainly focused on minor bodies and the Moon, future enhancements could broaden its capabilities to encompass additional planetary phenomena and environments. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

12 pages, 328 KiB  
Data Descriptor
eMailMe: A Method to Build Datasets of Corporate Emails in Portuguese
by Akira A. de Moura Galvão Uematsu and Anarosa A. F. Brandão
Data 2023, 8(8), 127; https://doi.org/10.3390/data8080127 - 31 Jul 2023
Viewed by 1788
Abstract
One of the areas in which knowledge management has application is in companies that are concerned with maintaining and disseminating their practices among their members. However, studies involving these two domains may end up suffering from the issue of data confidentiality. Furthermore, it [...] Read more.
One of the areas in which knowledge management has application is in companies that are concerned with maintaining and disseminating their practices among their members. However, studies involving these two domains may end up suffering from the issue of data confidentiality. Furthermore, it is difficult to find data regarding organizations processes and associated knowledge. Therefore, this paper presents a method to support the generation of a labeled dataset composed of texts that simulate corporate emails containing sensitive information regarding disclosure, written in Portuguese. The method begins with the definition of the dataset’s size and content distribution; the structure of its emails’ texts; and the guidelines for specialists to build the emails’ texts. It aims to create datasets that can be used in the validation of a tacit knowledge extraction process considering the 5W1H approach for the resulting base. The method was applied to create a dataset with content related to several domains, such as Federal Court and Registry Office and Marketing, giving it diversity and realism, while simulating real-world situations in the specialists’ professional life. The dataset generated is available in an open-access repository so that it can be downloaded and, eventually, expanded. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

20 pages, 1010 KiB  
Article
Automatically Detecting Incoherent Written Math Answers of Fourth-Graders
by Felipe Urrutia and Roberto Araya
Systems 2023, 11(7), 353; https://doi.org/10.3390/systems11070353 - 10 Jul 2023
Cited by 2 | Viewed by 1754
Abstract
Arguing and communicating are basic skills in the mathematics curriculum. Making arguments in written form facilitates rigorous reasoning. It allows peers to review arguments, and to receive feedback about them. Even though it requires additional cognitive effort in the calculation process, it enhances [...] Read more.
Arguing and communicating are basic skills in the mathematics curriculum. Making arguments in written form facilitates rigorous reasoning. It allows peers to review arguments, and to receive feedback about them. Even though it requires additional cognitive effort in the calculation process, it enhances long-term retention and facilitates deeper understanding. However, developing these competencies in elementary school classrooms is a great challenge. It requires at least two conditions: all students write and all receive immediate feedback. One solution is to use online platforms. However, this is very demanding for the teacher. The teacher must review 30 answers in real time. To facilitate the revision, it is necessary to automatize the detection of incoherent responses. Thus, the teacher can immediately seek to correct them. In this work, we analyzed 14,457 responses to open-ended questions written by 974 fourth graders on the ConectaIdeas online platform. A total of 13% of the answers were incoherent. Using natural language processing and machine learning algorithms, we built an automatic classifier. Then, we tested the classifier on an independent set of written responses to different open-ended questions. We found that the classifier achieved an F1-score = 79.15% for incoherent detection, which is better than baselines using different heuristics. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

16 pages, 19867 KiB  
Data Descriptor
A Semantically Annotated 15-Class Ground Truth Dataset for Substation Equipment to Train Semantic Segmentation Models
by Andreas Anael Pereira Gomes, Francisco Itamarati Secolo Ganacim, Fabiano Gustavo Silveira Magrin, Nara Bobko, Leonardo Göbel Fernandes, Anselmo Pombeiro and Eduardo Félix Ribeiro Romaneli
Data 2023, 8(7), 118; https://doi.org/10.3390/data8070118 - 5 Jul 2023
Cited by 1 | Viewed by 2361
Abstract
The lack of annotated semantic segmentation datasets for electrical substations in the literature poses a significant problem for machine learning tasks; before training a model, a dataset is needed. This paper presents a new dataset of electric substations with 1660 images annotated with [...] Read more.
The lack of annotated semantic segmentation datasets for electrical substations in the literature poses a significant problem for machine learning tasks; before training a model, a dataset is needed. This paper presents a new dataset of electric substations with 1660 images annotated with 15 classes, including insulators, disconnect switches, transformers and other equipment commonly found in substation environments. The images were captured using a combination of human, fixed and AGV-mounted cameras at different times of the day, providing a diverse set of training and testing data for algorithm development. In total, 50,705 annotations were created by a team of experienced annotators, using a standardized process to ensure accuracy across the dataset. The resulting dataset provides a valuable resource for researchers and practitioners working in the fields of substation automation, substation monitoring and computer vision. Its availability has the potential to advance the state of the art in this important area. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

13 pages, 7719 KiB  
Data Descriptor
Deep Learning with Northern Australian Savanna Tree Species: A Novel Dataset
by Andrew J. Jansen, Jaylen D. Nicholson, Andrew Esparon, Timothy Whiteside, Michael Welch, Matthew Tunstill, Harinandanan Paramjyothi, Varma Gadhiraju, Steve van Bodegraven and Renee E. Bartolo
Data 2023, 8(2), 44; https://doi.org/10.3390/data8020044 - 20 Feb 2023
Cited by 2 | Viewed by 2576
Abstract
The classification of savanna woodland tree species from high-resolution Remotely Piloted Aircraft Systems (RPAS) imagery is a complex and challenging task. Difficulties for both traditional remote sensing algorithms and human observers arise due to low interspecies variability (species difficult to discriminate because they [...] Read more.
The classification of savanna woodland tree species from high-resolution Remotely Piloted Aircraft Systems (RPAS) imagery is a complex and challenging task. Difficulties for both traditional remote sensing algorithms and human observers arise due to low interspecies variability (species difficult to discriminate because they are morphologically similar) and high intraspecies variability (individuals of the same species varying to the extent that they can be misclassified), and the loss of some taxonomic features commonly used for identification when observing trees from above. Deep neural networks are increasingly being used to overcome challenges in image recognition tasks. However, supervised deep learning algorithms require high-quality annotated and labelled training data that must be verified by subject matter experts. While training datasets for trees have been generated and made publicly available, they are mostly acquired in the Northern Hemisphere and lack species-level information. We present a training dataset of tropical Northern Australia savanna woodland tree species that was generated using RPAS and on-ground surveys to confirm species labels. RPAS-derived imagery was annotated, resulting in 2547 polygons representing 36 tree species. A baseline dataset was produced consisting of: (i) seven orthomosaics that were used for in-field labelling; (ii) a tiled dataset at 1024 × 1024 pixel size in Common Objects in Context (COCO) format that can be used for deep learning model training; (iii) and the annotations. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

21 pages, 5048 KiB  
Article
Segmentation and Classification of Zn-Al-Mg-Sn SEM BSE Microstructure
by Daniel Kuchar, Peter Gogola, Zuzana Gabalcova, Andrea Nemethova and Martin Nemeth
Appl. Sci. 2023, 13(2), 1045; https://doi.org/10.3390/app13021045 - 12 Jan 2023
Cited by 1 | Viewed by 1881
Abstract
The microstructure of materials is shaped not only by their chemical composition, but also by the thermomechanical processes used during the processing of a specific piece. The correct interpretation of the microstructure gives a rich source of information. This consists of several related [...] Read more.
The microstructure of materials is shaped not only by their chemical composition, but also by the thermomechanical processes used during the processing of a specific piece. The correct interpretation of the microstructure gives a rich source of information. This consists of several related steps, such as segmentation. Successful segmentation enables the qualitative as well as quantitative analysis of the individual microstructure components. The current paper deals with the segmentation and classification of four basic microstructure components of the Zn-Al-Mg-Sn alloy system. This is attempted with the help of several image processing techniques, where thresholding is the main one used. The investigated samples are the cast and annealed Zn-Al-Mg-Sn alloy bulks. The input data for this analysis are the SEM BSE images. These were taken for all alloys with a varying Sn content, covering a significant area of each investigated sample at different zoom levels. A semiautomatic algorithm running under Matlab is introduced. It addresses several tasks, such as preprocessing, noise filtering and decision methods. For the individual procedures, the time requirements for their execution are also indicated. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

15 pages, 5438 KiB  
Article
Learned Semantic Index Structure Using Knowledge Graph Embedding and Density-Based Spatial Clustering Techniques
by Yuxiang Sun, Seok-Ju Chun and Yongju Lee
Appl. Sci. 2022, 12(13), 6713; https://doi.org/10.3390/app12136713 - 2 Jul 2022
Cited by 6 | Viewed by 2254
Abstract
Recently, a pragmatic approach toward achieving semantic search has made significant progress with knowledge graph embedding (KGE). Although many standards, methods, and technologies are applicable to the linked open data (LOD) cloud, there are still several ongoing problems in this area. As LOD [...] Read more.
Recently, a pragmatic approach toward achieving semantic search has made significant progress with knowledge graph embedding (KGE). Although many standards, methods, and technologies are applicable to the linked open data (LOD) cloud, there are still several ongoing problems in this area. As LOD are modeled as resource description framework (RDF) graphs, we cannot directly adopt existing solutions from database management or information retrieval systems. This study addresses the issue of efficient LOD annotation organization, retrieval, and evaluation. We propose a hybrid strategy between the index and distributed approaches based on KGE to increase join query performance. Using a learned semantic index structure for semantic search, we can efficiently discover interlinked data distributed across multiple resources. Because this approach rapidly prunes numerous false hits, the performance of join query processing is remarkably improved. The performance of the proposed index structure is compared with some existing methods on real RDF datasets. As a result, the proposed indexing method outperforms existing methods due to its ability to prune a lot of unnecessary data scanned during semantic searching. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

15 pages, 424 KiB  
Article
Context Sensitive Verb Similarity Dataset for Legal Information Extraction
by Gathika Ratnayaka, Nisansa de Silva, Amal Shehan Perera, Gayan Kavirathne, Thirasara Ariyarathna and Anjana Wijesinghe
Data 2022, 7(7), 87; https://doi.org/10.3390/data7070087 - 28 Jun 2022
Cited by 1 | Viewed by 4443
Abstract
Existing literature demonstrates that verbs are pivotal in legal information extraction tasks due to their semantic and argumentative properties. However, granting computers the ability to interpret the meaning of a verb and its semantic properties in relation to a given context can be [...] Read more.
Existing literature demonstrates that verbs are pivotal in legal information extraction tasks due to their semantic and argumentative properties. However, granting computers the ability to interpret the meaning of a verb and its semantic properties in relation to a given context can be considered as a challenging task, mainly due to the polysemic and domain specific behaviours of verbs. Therefore, developing mechanisms to identify behaviors of verbs and evaluate how artificial models detect the domain specific and polysemic behaviours of verbs can be considered as tasks with significant importance. In this regard, a comprehensive dataset that can be used as an evaluation resource, as well as a training data set, can be considered as a major requirement. In this paper, we introduce LeCoVe, which is a verb similarity dataset intended towards facilitating the process of identifying verbs with similar meanings in a legal domain specific context. Using the dataset, we evaluated both domain specific and domain generic embedding models, which were developed using state-of-the-art word representation and language modelling techniques. As a part of the experiments carried out using the announced dataset, Sense2Vec and BERT models were trained using a corpus of legal opinion texts in order to capture domain specific behaviours. In addition to LeCoVe, we demonstrate that a neural network model, which was developed by combining semantic, syntactic, and contextual features that can be obtained from the outputs of embedding models, can perform comparatively well, even in a low resource scenario. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

17 pages, 6715 KiB  
Article
Deep-Learning-Based Floor Path Model for Route Tracking of Autonomous Vehicles
by Mustafa Erginli and Ibrahim Cil
Systems 2022, 10(3), 83; https://doi.org/10.3390/systems10030083 - 15 Jun 2022
Cited by 2 | Viewed by 2384
Abstract
Real-time route tracking is an important research topic for autonomous vehicles used in industrial facilities. Traditional methods such as copper line tracking on the ground, wireless guidance systems, and laser systems are still used in route tracking. In this study, a deep-learning-based floor [...] Read more.
Real-time route tracking is an important research topic for autonomous vehicles used in industrial facilities. Traditional methods such as copper line tracking on the ground, wireless guidance systems, and laser systems are still used in route tracking. In this study, a deep-learning-based floor path model for route tracking of autonomous vehicles is proposed. A deep-learning floor path model and algorithm have been developed for highly accurate route tracking, which avoids collisions of vehicles and follows the shortest route to reach the destination. The floor path model consists of markers. Routes in the floor path model are created by using these markers. The floor path model is transmitted to autonomous vehicles as a vector by a central server. The server dispatches the target marker address to the vehicle to move. The vehicle calculates all possible routes to this address and chooses the shortest one. Marker images on the selected route are processed using image processing and classified with a pre-trained deep-CNN model. If the classified image and the image on the selected route are the same, the vehicle proceeds toward its destination. While the vehicle moves on the route, it sends the last classified marker to the server. Other autonomous vehicles use this marker to determine the location of this vehicle. Other vehicles on the route wait to avoid a collision. As a result of the experimental studies we have carried out, the route tracking of the vehicles has been successfully achieved. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

20 pages, 1001 KiB  
Article
An Unsupervised Transfer Learning Framework for Visible-Thermal Pedestrian Detection
by Chengjin Lyu, Patrick Heyer, Bart Goossens and Wilfried Philips
Sensors 2022, 22(12), 4416; https://doi.org/10.3390/s22124416 - 10 Jun 2022
Cited by 6 | Viewed by 2362
Abstract
Dual cameras with visible-thermal multispectral pairs provide both visual and thermal appearance, thereby enabling detecting pedestrians around the clock in various conditions and applications, including autonomous driving and intelligent transportation systems. However, due to the greatly varying real-world scenarios, the performance of a [...] Read more.
Dual cameras with visible-thermal multispectral pairs provide both visual and thermal appearance, thereby enabling detecting pedestrians around the clock in various conditions and applications, including autonomous driving and intelligent transportation systems. However, due to the greatly varying real-world scenarios, the performance of a detector trained on a source dataset might change dramatically when evaluated on another dataset. A large amount of training data is often necessary to guarantee the detection performance in a new scenario. Typically, human annotators need to conduct the data labeling work, which is time-consuming, labor-intensive and unscalable. To overcome the problem, we propose a novel unsupervised transfer learning framework for multispectral pedestrian detection, which adapts a multispectral pedestrian detector to the target domain based on pseudo training labels. In particular, auxiliary detectors are utilized and different label fusion strategies are introduced according to the estimated environmental illumination level. Intermediate domain images are generated by translating the source images to mimic the target ones, acting as a better starting point for the parameter update of the pedestrian detector. The experimental results on the KAIST and FLIR ADAS datasets demonstrate that the proposed method achieves new state-of-the-art performance without any manual training annotations on the target data. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

21 pages, 6479 KiB  
Article
Semi-Automatic Method of Extracting Road Networks from High-Resolution Remote-Sensing Images
by Kaili Yang, Weihong Cui, Shu Shi, Yu Liu, Yuanjin Li and Mengyu Ge
Appl. Sci. 2022, 12(9), 4705; https://doi.org/10.3390/app12094705 - 7 May 2022
Cited by 3 | Viewed by 2115
Abstract
Road network extraction plays a critical role in data updating, urban development, and decision support. To improve the efficiency of labeling road datasets and addressing the problems of traditional methods of manually extracting road networks from high-resolution images, such as their slow speed [...] Read more.
Road network extraction plays a critical role in data updating, urban development, and decision support. To improve the efficiency of labeling road datasets and addressing the problems of traditional methods of manually extracting road networks from high-resolution images, such as their slow speed and heavy workload, this paper proposes a semi-automatic method of road network extraction from high-resolution remote-sensing images. The proposed method needs only a few points to extract a single road in the image. After the roads are extracted one by one, the road network is generated according to the width of each road and the spatial relationships among the roads. For this purpose, we use regional growth, morphology, vector tracking, vector simplification, endpoint modification, road connections, and intersection connections to generate road networks. Experiments on four images with different terrains and different resolutions show that this method has high extraction accuracy under different image conditions. The comparisons with the semi-automatic GVF-snake method based on regional growth also showed its advantages and potentiality. The proposed method is a novel form of semi-automatic road network extraction, and it significantly increases the efficiency of road network extraction. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

13 pages, 1626 KiB  
Article
Multi-Aspect Oriented Sentiment Classification: Prior Knowledge Topic Modelling and Ensemble Learning Classifier Approach
by Najwa AlGhamdi, Shaheen Khatoon and Majed Alshamari
Appl. Sci. 2022, 12(8), 4066; https://doi.org/10.3390/app12084066 - 18 Apr 2022
Cited by 7 | Viewed by 2865
Abstract
User-generated content on numerous sites is indicative of users’ sentiment towards many issues, from daily food intake to using new products. Amid the active usage of social networks and micro-blogs, notably during the COVID-19 pandemic, we may glean insights into any product or [...] Read more.
User-generated content on numerous sites is indicative of users’ sentiment towards many issues, from daily food intake to using new products. Amid the active usage of social networks and micro-blogs, notably during the COVID-19 pandemic, we may glean insights into any product or service through users’ feedback and opinions. Thus, it is often difficult and time consuming to go through all the reviews and analyse them in order to recognize the notion of the overall goodness or badness of the reviews before making any decision. To overcome this challenge, sentiment analysis has been used as an effective rapid way to automatically gauge consumers’ opinions. Large reviews will possibly encompass both positive and negative opinions on different features of a product/service in the same review. Therefore, this paper proposes an aspect-oriented sentiment classification using a combination of the prior knowledge topic model algorithm (SA-LDA), automatic labelling (SentiWordNet) and ensemble method (Stacking). The framework is evaluated using the dataset from different domains. The results have shown that the proposed SA-LDA outperformed the standard LDA. In addition, the suggested ensemble learning classifier has increased the accuracy of the classifier by more than ~3% when it is compared to baseline classification algorithms. The study concluded that the proposed approach is equally adaptable across multi-domain applications. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

23 pages, 6564 KiB  
Article
Classification of Building Types in Germany: A Data-Driven Modeling Approach
by Abhilash Bandam, Eedris Busari, Chloi Syranidou, Jochen Linssen and Detlef Stolten
Data 2022, 7(4), 45; https://doi.org/10.3390/data7040045 - 9 Apr 2022
Cited by 18 | Viewed by 6536
Abstract
Details on building levels play an essential part in a number of real-world application models. Energy systems, telecommunications, disaster management, the internet-of-things, health care, and marketing are a few of the many applications that require building information. The essential variables that most of [...] Read more.
Details on building levels play an essential part in a number of real-world application models. Energy systems, telecommunications, disaster management, the internet-of-things, health care, and marketing are a few of the many applications that require building information. The essential variables that most of these models require are building type, house type, area of living space, and number of residents. In order to acquire some of this information, this paper introduces a methodology and generates corresponding data. The study was conducted for specific applications in energy system modeling. Nonetheless, these data can also be used in other applications. Building locations and some of their details are openly available in the form of map data from OpenStreetMap (OSM). However, data regarding building types (i.e., residential, industrial, office, single-family house, multi-family house, etc.) are only partially available in the OSM dataset. Therefore, a machine learning classification algorithm for predicting the building types on the basis of the OSM buildings’ data was introduced. Although the OSM dataset is the fundamental and most crucial one used for modeling, the machine learning algorithm’s training was performed on a dataset that was prepared by combining several features from three other datasets. The generated dataset consists of approximately 29 million buildings, of which about 19 million are residential, with 72% being single-family houses and the rest multi-family ones that include two-family houses and apartment buildings. Furthermore, the results were validated through a comparison with publicly available statistical data. The comparison of the resulting data with official statistics reveals that there is a percentage error of 3.64% for residential buildings, 13.14% for single-family houses, and −15.38% for multi-family houses classification. Nevertheless, by incorporating the building types, this dataset is able to complement existing building information in studies in which building type information is crucial. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

18 pages, 550 KiB  
Article
Rule-Enhanced Active Learning for Semi-Automated Weak Supervision
by David Kartchner, Davi Nakajima An, Wendi Ren, Chao Zhang and Cassie S. Mitchell
AI 2022, 3(1), 211-228; https://doi.org/10.3390/ai3010013 - 16 Mar 2022
Cited by 4 | Viewed by 4224
Abstract
A major bottleneck preventing the extension of deep learning systems to new domains is the prohibitive cost of acquiring sufficient training labels. Alternatives such as weak supervision, active learning, and fine-tuning of pretrained models reduce this burden but require substantial human input to [...] Read more.
A major bottleneck preventing the extension of deep learning systems to new domains is the prohibitive cost of acquiring sufficient training labels. Alternatives such as weak supervision, active learning, and fine-tuning of pretrained models reduce this burden but require substantial human input to select a highly informative subset of instances or to curate labeling functions. REGAL (Rule-Enhanced Generative Active Learning) is an improved framework for weakly supervised text classification that performs active learning over labeling functions rather than individual instances. REGAL interactively creates high-quality labeling patterns from raw text, enabling a single annotator to accurately label an entire dataset after initialization with three keywords for each class. Experiments demonstrate that REGAL extracts up to 3 times as many high-accuracy labeling functions from text as current state-of-the-art methods for interactive weak supervision, enabling REGAL to dramatically reduce the annotation burden of writing labeling functions for weak supervision. Statistical analysis reveals REGAL performs equal or significantly better than interactive weak supervision for five of six commonly used natural language processing (NLP) baseline datasets. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

16 pages, 5532 KiB  
Article
CarFree: Hassle-Free Object Detection Dataset Generation Using Carla Autonomous Driving Simulator
by Jaesung Jang, Hyeongyu Lee and Jong-Chan Kim
Appl. Sci. 2022, 12(1), 281; https://doi.org/10.3390/app12010281 - 28 Dec 2021
Cited by 9 | Viewed by 6153
Abstract
For safe autonomous driving, deep neural network (DNN)-based perception systems play essential roles, where a vast amount of driving images should be manually collected and labeled with ground truth (GT) for training and validation purposes. After observing the manual GT generation’s high cost [...] Read more.
For safe autonomous driving, deep neural network (DNN)-based perception systems play essential roles, where a vast amount of driving images should be manually collected and labeled with ground truth (GT) for training and validation purposes. After observing the manual GT generation’s high cost and unavoidable human errors, this study presents an open-source automatic GT generation tool, CarFree, based on the Carla autonomous driving simulator. By that, we aim to democratize the daunting task of (in particular) object detection dataset generation, which was only possible by big companies or institutes due to its high cost. CarFree comprises (i) a data extraction client that automatically collects relevant information from the Carla simulator’s server and (ii) a post-processing software that produces precise 2D bounding boxes of vehicles and pedestrians on the gathered driving images. Our evaluation results show that CarFree can generate a considerable amount of realistic driving images along with their GTs in a reasonable time. Moreover, using the synthesized training images with artificially made unusual weather and lighting conditions, which are difficult to obtain in real-world driving scenarios, CarFree significantly improves the object detection accuracy in the real world, particularly in the case of harsh environments. With CarFree, we expect its users to generate a variety of object detection datasets in hassle-free ways. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

19 pages, 8381 KiB  
Article
3D Vehicle Trajectory Extraction Using DCNN in an Overlapping Multi-Camera Crossroad Scene
by Jinyeong Heo and Yongjin (James) Kwon
Sensors 2021, 21(23), 7879; https://doi.org/10.3390/s21237879 - 26 Nov 2021
Cited by 2 | Viewed by 2378
Abstract
The 3D vehicle trajectory in complex traffic conditions such as crossroads and heavy traffic is practically very useful in autonomous driving. In order to accurately extract the 3D vehicle trajectory from a perspective camera in a crossroad where the vehicle has an angular [...] Read more.
The 3D vehicle trajectory in complex traffic conditions such as crossroads and heavy traffic is practically very useful in autonomous driving. In order to accurately extract the 3D vehicle trajectory from a perspective camera in a crossroad where the vehicle has an angular range of 360 degrees, problems such as the narrow visual angle in single-camera scene, vehicle occlusion under conditions of low camera perspective, and lack of vehicle physical information must be solved. In this paper, we propose a method for estimating the 3D bounding boxes of vehicles and extracting trajectories using a deep convolutional neural network (DCNN) in an overlapping multi-camera crossroad scene. First, traffic data were collected using overlapping multi-cameras to obtain a wide range of trajectories around the crossroad. Then, 3D bounding boxes of vehicles were estimated and tracked in each single-camera scene through DCNN models (YOLOv4, multi-branch CNN) combined with camera calibration. Using the abovementioned information, the 3D vehicle trajectory could be extracted on the ground plane of the crossroad by calculating results obtained from the overlapping multi-camera with a homography matrix. Finally, in experiments, the errors of extracted trajectories were corrected through a simple linear interpolation and regression, and the accuracy of the proposed method was verified by calculating the difference with ground-truth data. Compared with other previously reported methods, our approach is shown to be more accurate and more practical. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

14 pages, 554 KiB  
Article
Hierarchical Concept Learning by Fuzzy Semantic Cells
by Linna Zhu, Wei Li and Yongchuan Tang
Appl. Sci. 2021, 11(22), 10723; https://doi.org/10.3390/app112210723 - 13 Nov 2021
Viewed by 1736
Abstract
Concept modeling and learning have been important research topics in artificial intelligence and knowledge discovery. This paper studies a hierarchical concept learning method that requires a small amount of data to achieve competitive performances. The method starts from a set of fuzzy prototypes [...] Read more.
Concept modeling and learning have been important research topics in artificial intelligence and knowledge discovery. This paper studies a hierarchical concept learning method that requires a small amount of data to achieve competitive performances. The method starts from a set of fuzzy prototypes called Fuzzy Semantic Cells (FSCs). As a result of FSC parameter optimization, it creates a hierarchical structure of data–prototype–concept. Experiments are conducted to demonstrate the effectiveness of our approach in a classification problem. In particular, when faced with limited training data, our proposed method is comparable with traditional techniques in terms of robustness and generalization ability. Full article
(This article belongs to the Topic Methods for Data Labelling for Intelligent Systems)
Show Figures

Figure 1

Back to TopTop