Figure 1.
Examples of strong and weak labels. A strong label consists of a sequence of sound event types and onset and offset times in the given recording. A weak label consists of sound event types only.
Figure 1.
Examples of strong and weak labels. A strong label consists of a sequence of sound event types and onset and offset times in the given recording. A weak label consists of sound event types only.
Figure 2.
Weakly supervised sound event detection framework. The outputs of convolutional layers are segmentation maps and interpreted as audio tagging and sound event detection results. Because no ground truth for detection is provided, detection loss is not used in training the network. Only classification loss is computed.
Figure 2.
Weakly supervised sound event detection framework. The outputs of convolutional layers are segmentation maps and interpreted as audio tagging and sound event detection results. Because no ground truth for detection is provided, detection loss is not used in training the network. Only classification loss is computed.
Figure 3.
U-Net architecture for weakly supervised sound event detection with full upsampling in the deconvolutional layers. The left encoder part reduces the input size to a smaller size with a number of convolutional and pooling layers. The right half, the decoder part, restores the bottom output map of a reduced size to the original input size. The number of channels at the final output layer is the same as the number of classes (a positive integer “C” in this example). For each output channel, a global pooling layer is applied to derive the audio tagging outputs, so that the number of target nodes is the same as the number of tagging classes.
Figure 3.
U-Net architecture for weakly supervised sound event detection with full upsampling in the deconvolutional layers. The left encoder part reduces the input size to a smaller size with a number of convolutional and pooling layers. The right half, the decoder part, restores the bottom output map of a reduced size to the original input size. The number of channels at the final output layer is the same as the number of classes (a positive integer “C” in this example). For each output channel, a global pooling layer is applied to derive the audio tagging outputs, so that the number of target nodes is the same as the number of tagging classes.
Figure 4.
Postprocessing procedures for audio tagging and sound event detection. The spectro-temporal 2D audio feature map, denoted by , is converted to the 2D segmentation map of the same size () and 1D class prediction vector of length C (the number of event classes) to compute the prediction loss for audio tagging and model training. The segmentation maps are converted to C detection maps of length T (time), and thresholding is performed to find the onset and offset of the sound events ().
Figure 4.
Postprocessing procedures for audio tagging and sound event detection. The spectro-temporal 2D audio feature map, denoted by , is converted to the 2D segmentation map of the same size () and 1D class prediction vector of length C (the number of event classes) to compute the prediction loss for audio tagging and model training. The segmentation maps are converted to C detection maps of length T (time), and thresholding is performed to find the onset and offset of the sound events ().
Figure 5.
U-Net architecture with limited upsampling in the deconvolutional layers. The left encoder part is the same as U-Net, but in the right, the decoder part, it only upsamples along the time axis to match the input time range.
Figure 5.
U-Net architecture with limited upsampling in the deconvolutional layers. The left encoder part is the same as U-Net, but in the right, the decoder part, it only upsamples along the time axis to match the input time range.
Figure 6.
Training data generation procedure. Two audio samples are mixed with a background sound. Audio samples are clipped to a given length, normalized, and then mixed with normalized background noise. The x-axis is time in seconds, and for the y-axis frequency bin index, the larger the higher it is.
Figure 6.
Training data generation procedure. Two audio samples are mixed with a background sound. Audio samples are clipped to a given length, normalized, and then mixed with normalized background noise. The x-axis is time in seconds, and for the y-axis frequency bin index, the larger the higher it is.
Figure 7.
Comparison of temporal overlaps of different mixing polices. (a) Original policy with no overlap; (b) longer clipping policy allowing some overlaps between events by using longer clips; (c) random onset policy allowing events to start at any time; some samples overlap, but some other samples do not due to the randomness; (d) mixed policies in (b,c). Most overlaps are observed.
Figure 7.
Comparison of temporal overlaps of different mixing polices. (a) Original policy with no overlap; (b) longer clipping policy allowing some overlaps between events by using longer clips; (c) random onset policy allowing events to start at any time; some samples overlap, but some other samples do not due to the randomness; (d) mixed policies in (b,c). Most overlaps are observed.
Figure 8.
Sound event detection examples with various models and global pooling methods. The vertical axis in (b–h) represents the class number. (a) Spectrogram of the input mixture sample generated by the original synthetic policy. The x-axis is the time in seconds (10 s long), and the y-axis is the frequency (only are shown). (b) Ground truth labels. There are 3 distinctive sound events, represented by bright red lines. For (b–h), the x-axis is the time in seconds aligned with the x-axis of the spectrogram in (a), and the y-axis represents the event labels. (c–h) display sound event labels predicted by (c) CNN with GWRP, (d) U-Net with GWRP, (e) LUU-Net with GWRP, (f) LUU-Net with AlphaMEX, (g) LUU-Net with MEX, and (h) LUU-Net with GTAP.
Figure 8.
Sound event detection examples with various models and global pooling methods. The vertical axis in (b–h) represents the class number. (a) Spectrogram of the input mixture sample generated by the original synthetic policy. The x-axis is the time in seconds (10 s long), and the y-axis is the frequency (only are shown). (b) Ground truth labels. There are 3 distinctive sound events, represented by bright red lines. For (b–h), the x-axis is the time in seconds aligned with the x-axis of the spectrogram in (a), and the y-axis represents the event labels. (c–h) display sound event labels predicted by (c) CNN with GWRP, (d) U-Net with GWRP, (e) LUU-Net with GWRP, (f) LUU-Net with AlphaMEX, (g) LUU-Net with MEX, and (h) LUU-Net with GTAP.
Figure 9.
Sound event detection examples with various models and global pooling methods. The vertical axis in (b–h) represents the class number. (a) Spectrogram of the input mixture sample generated by the longer clipping policy. (b) Ground truth labels. (c–h) display sound event labels predicted by (c) CNN with GWRP, (d) U-Net with GWRP, LUU-Net with (e) GWRP, (f) AlphaMEX, (g) MEX, and (h) GTAP. The x-axis is the time in seconds, and the y-axis represents the frequency (a) or event labels (b–h). Sound event detection examples with CNN, U-Net, and LUU-Net with GWRP.
Figure 9.
Sound event detection examples with various models and global pooling methods. The vertical axis in (b–h) represents the class number. (a) Spectrogram of the input mixture sample generated by the longer clipping policy. (b) Ground truth labels. (c–h) display sound event labels predicted by (c) CNN with GWRP, (d) U-Net with GWRP, LUU-Net with (e) GWRP, (f) AlphaMEX, (g) MEX, and (h) GTAP. The x-axis is the time in seconds, and the y-axis represents the frequency (a) or event labels (b–h). Sound event detection examples with CNN, U-Net, and LUU-Net with GWRP.
Figure 10.
Sound event detection examples with various models and global pooling methods. The vertical axis in (b–h) represents the class number. (a) Spectrogram of the input mixture sample generated by the random onset policy. (b) Ground truth labels. (c–h) display sound event labels predicted by (c) CNN with GWRP, (d) U-Net with GWRP, LUU-Net with (e) GWRP, (f) AlphaMEX, (g) MEX, and (h) GTAP. The x-axis is the time in seconds, and the y-axis represents the frequency (a) or event labels (b–h).
Figure 10.
Sound event detection examples with various models and global pooling methods. The vertical axis in (b–h) represents the class number. (a) Spectrogram of the input mixture sample generated by the random onset policy. (b) Ground truth labels. (c–h) display sound event labels predicted by (c) CNN with GWRP, (d) U-Net with GWRP, LUU-Net with (e) GWRP, (f) AlphaMEX, (g) MEX, and (h) GTAP. The x-axis is the time in seconds, and the y-axis represents the frequency (a) or event labels (b–h).
Figure 11.
Sound event detection examples with various models and global pooling methods. The vertical axis in (b–h) represents the class number. (a) Spectrogram of the input mixture sample generated by the mixed synthetic policy. (b) Ground truth labels. (c–h) display sound event labels predicted by (c) CNN with GWRP, (d) U-Net with GWRP, LUU-Net with (e) GWRP, (f) AlphaMEX, (g) MEX, and (h) GTAP. The x-axis is the time in seconds, and the y-axis represents the frequency (a) or event labels (b–h).
Figure 11.
Sound event detection examples with various models and global pooling methods. The vertical axis in (b–h) represents the class number. (a) Spectrogram of the input mixture sample generated by the mixed synthetic policy. (b) Ground truth labels. (c–h) display sound event labels predicted by (c) CNN with GWRP, (d) U-Net with GWRP, LUU-Net with (e) GWRP, (f) AlphaMEX, (g) MEX, and (h) GTAP. The x-axis is the time in seconds, and the y-axis represents the frequency (a) or event labels (b–h).
Figure 12.
Comparison of the number of steps per unit second. The first two sets of bars represent the CNN with GWRP and GTAP, with 4 bars measured from the original, longer clips, random onsets, and mixed synthetic policies. The second two sets of bars represent U-Net, and the third two sets of bars are for LUU-Net.
Figure 12.
Comparison of the number of steps per unit second. The first two sets of bars represent the CNN with GWRP and GTAP, with 4 bars measured from the original, longer clips, random onsets, and mixed synthetic policies. The second two sets of bars represent U-Net, and the third two sets of bars are for LUU-Net.
Figure 13.
Audio tagging (AT) performance comparison with various deep learning models, pooling methods, and training data generation policies. Average F1 scores drawn; the x-axis is pooling methods; lines are CNN, U-Net, and LUU-Net. The individual charts are the results with the original, longer clipping, random onsets, and mixed synthetic policies.
Figure 13.
Audio tagging (AT) performance comparison with various deep learning models, pooling methods, and training data generation policies. Average F1 scores drawn; the x-axis is pooling methods; lines are CNN, U-Net, and LUU-Net. The individual charts are the results with the original, longer clipping, random onsets, and mixed synthetic policies.
Figure 14.
Sound event detection (SED) performance comparison with various deep learning models, pooling methods, and training data generation policies by average F1 scores.
Figure 14.
Sound event detection (SED) performance comparison with various deep learning models, pooling methods, and training data generation policies by average F1 scores.
Figure 15.
Illustrations of audio tagging and sound event detection performances by varying audio mixing conditions. The subfigures on the left are graphical charts of average precision scores of GTAP with . The x-axis is the audio synthetic policies. The upper chart is audio tagging performances, and the lower one is sound event detection performances. The subfigures at the center are graphical charts of the average recall scores, and those on the right are the average F1 scores.
Figure 15.
Illustrations of audio tagging and sound event detection performances by varying audio mixing conditions. The subfigures on the left are graphical charts of average precision scores of GTAP with . The x-axis is the audio synthetic policies. The upper chart is audio tagging performances, and the lower one is sound event detection performances. The subfigures at the center are graphical charts of the average recall scores, and those on the right are the average F1 scores.
Table 1.
Audio synthetic polices. All numbers are in seconds. The column names mean: onset times are the beginning of the events; max clip is the maximum clipping length; mean and std are the average and standard deviation of the clipping lengths. The row names: original is the policy suggested by the DCASE Challenge; longer clipping uses a longer maximum clipping length; random onset is varying the onset times randomly; mixed is the mixed policy of random onset and longer clipping.
Table 1.
Audio synthetic polices. All numbers are in seconds. The column names mean: onset times are the beginning of the events; max clip is the maximum clipping length; mean and std are the average and standard deviation of the clipping lengths. The row names: original is the policy suggested by the DCASE Challenge; longer clipping uses a longer maximum clipping length; random onset is varying the onset times randomly; mixed is the mixed policy of random onset and longer clipping.
| Onset Iimes | Max Clip | Mean | Std |
---|
original | , , | | | |
longer clipping | , , | | | |
random onset | uniformly random in | | | |
mixed | uniformly random in | | | |
Table 2.
Basic convolutional blocks used in SED model construction. There are two types of convolutional layers, and , with and kernel sizes, respectively. There are also two types of deconvolutional layers. and have strides and , respectively. BN-ReLU is batch normalization and rectified linear unit activation at the output.
Table 2.
Basic convolutional blocks used in SED model construction. There are two types of convolutional layers, and , with and kernel sizes, respectively. There are also two types of deconvolutional layers. and have strides and , respectively. BN-ReLU is batch normalization and rectified linear unit activation at the output.
Name | Kernel Size | Strides | Output Channels | Post Processing |
---|
| | | K | BN-ReLU |
| | |
| | |
| | |
Table 3.
Pooling blocks and dropout layer used in SED model construction. The layer reduces the sizes by half in both the x- and y-axes, but reduces the y-axis by a factor of f, resizing the frequency axis, but not the time axis. is usually added after a convolutional layer, and is used in concatenating the output maps of different sizes in U-Net.
Table 3.
Pooling blocks and dropout layer used in SED model construction. The layer reduces the sizes by half in both the x- and y-axes, but reduces the y-axis by a factor of f, resizing the frequency axis, but not the time axis. is usually added after a convolutional layer, and is used in concatenating the output maps of different sizes in U-Net.
Name | Description | Input Size | Output Size |
---|
| average pooling, stride | | |
| average pooling, stride | |
| dropout with probability p | | |
Table 4.
Baseline CNN design. It is composed of 4 convolutional layers with kernel size , followed by a convolutional layer. The output of the last layer is for sound event detection.
Table 4.
Baseline CNN design. It is composed of 4 convolutional layers with kernel size , followed by a convolutional layer. The output of the last layer is for sound event detection.
Name | Input Shape | Output Shape | Output Size |
---|
| | | |
| | | 1,273,856 |
| | | 2,547,712 |
| | | 2,547,712 |
| | | 19,904 |
total output size | 7,006,208 + 19,904 |
Table 5.
U-Net design for sound event detection. It is divided into the encoder and decoder. The encoder consists of 3 convolutional blocks with average pooling, followed by a convolutional layer with dropout. The decoder is composed of 3 deconvolutional blocks with skip connections to the encoder feature maps, and the final convolutional layer is for event classification.
Table 5.
U-Net design for sound event detection. It is divided into the encoder and decoder. The encoder consists of 3 convolutional blocks with average pooling, followed by a convolutional layer with dropout. The decoder is composed of 3 deconvolutional blocks with skip connections to the encoder feature maps, and the final convolutional layer is for event classification.
| Name | Input Shape | Output Shape | Output Size |
---|
encoder | | | | |
| | | 79,872 |
| | | |
| | | 39,936 |
| | | |
| | | 19,968 |
| | | |
| | | 39,936 |
decoder | | | | |
| | | 79,872 |
| | | |
| | | 159,744 |
| | | |
| | | 319,488 |
| | | 19,968 |
total output size | 738,816 + 19,968 |
Table 6.
The proposed LUU-Net (U-Net with limited upsampling) design for sound event detection. The encoder blocks are identical to U-Net, but the decoder used , which upsamples along the time axis, but not along the frequency axis. Therefore, the vertical size does not change in the decoder, all 8. In the layers, is applied, where to match the vertical lengths of the encoder and decoder outputs.
Table 6.
The proposed LUU-Net (U-Net with limited upsampling) design for sound event detection. The encoder blocks are identical to U-Net, but the decoder used , which upsamples along the time axis, but not along the frequency axis. Therefore, the vertical size does not change in the decoder, all 8. In the layers, is applied, where to match the vertical lengths of the encoder and decoder outputs.
| Name | Input Shape | Output Shape | Output Size |
---|
encoder | | | | |
| | | 79,872 |
| | | |
| | | 39,936 |
| | | |
| | | 19,968 |
| | | |
| | | 39,936 |
decoder | | | | |
with | | | 39,936 |
| | | |
with | | | 39,936 |
| | | |
with | | | |
| | | 2496 × C |
total output size | 299,520 + 2496 × C |
Table 7.
Classification of prediction results by being compared with ground truth labels. Ground truth labels are given, and predicted labels are the output of the binary classifiers. Symbols T, F, TP, FP, FN, and TN are true, false, true positive, false negative, and true negative, respectively.
Table 7.
Classification of prediction results by being compared with ground truth labels. Ground truth labels are given, and predicted labels are the output of the binary classifiers. Symbols T, F, TP, FP, FN, and TN are true, false, true positive, false negative, and true negative, respectively.
| Ground Truth |
---|
| T | F |
---|
predicted | T | TP | FP |
F | FN | TN |
Table 8.
Grid search results on the dataset generated by the original policy. The model is the proposed LUU-Net. Classwise mean F1 scores of audio tagging () and sound event detection () tasks were computed, and their average was used to rank the hyperparameter values. The top 3 were .
Table 8.
Grid search results on the dataset generated by the original policy. The model is the proposed LUU-Net. Classwise mean F1 scores of audio tagging () and sound event detection () tasks were computed, and their average was used to rank the hyperparameter values. The top 3 were .
| | | Average | Rank |
---|
| 65.26 | 50.67 | 57.96 | 7 |
| 66.44 | 52.81 | 59.62 | 5 |
| 65.77 | 53.51 | 59.64 | 4 |
| 66.34 | 52.50 | 59.42 | 6 |
| 65.64 | 53.93 | 59.78 | 3 |
| 65.33 | 54.36 | 59.85 | 2 |
| 64.27 | 50.35 | 57.31 | 8 |
| 69.66 | 51.02 | 60.34 | 1 |
| 65.72 | 47.03 | 56.37 | 9 |
| 63.80 | 45.26 | 54.53 | 11 |
| 64.07 | 45.38 | 54.72 | 10 |
Table 9.
Audio tagging (AT) and sound event detection (SED) results on the dataset generated by the original synthetic policy. The neural network models were CNN, U-Net, and the proposed LUU-Net, whose configurations are shown in
Table 4,
Table 5 and
Table 6, respectively. Various pooling methods were applied to the output of the LUU-Net: AlphaMEX, MEX, GWRP, and the proposed global threshold average pooling (GTAP) explained in
Section 3.2 with varying
. GTAP with the same
values was also applied to the CNN and U-Net to compare the performance variations with LUU-Net. For all the experiments, the mean precision (mPrc), mean recall (mRcl), mean F1 scores (mF1), and the number of steps per unit second were measured.
Table 9.
Audio tagging (AT) and sound event detection (SED) results on the dataset generated by the original synthetic policy. The neural network models were CNN, U-Net, and the proposed LUU-Net, whose configurations are shown in
Table 4,
Table 5 and
Table 6, respectively. Various pooling methods were applied to the output of the LUU-Net: AlphaMEX, MEX, GWRP, and the proposed global threshold average pooling (GTAP) explained in
Section 3.2 with varying
. GTAP with the same
values was also applied to the CNN and U-Net to compare the performance variations with LUU-Net. For all the experiments, the mean precision (mPrc), mean recall (mRcl), mean F1 scores (mF1), and the number of steps per unit second were measured.
Model | Pooling Method | AT Task | SED Task | #Step/s |
---|
mPrc | mRcl | mF1 | mPrc | mRcl | mF1 |
---|
CNN | GWRP | 47.1 | 70.7 | 53.1 | 40.2 | 45.1 | 39.5 | 3.69 |
GTAP | 35.2 | 75.9 | 46.9 | 74.1 | 11.7 | 19.1 | 4.54 |
GTAP | 34.3 | 75.1 | 45.8 | 72.6 | 13.3 | 21.2 |
GTAP | 52.8 | 40.1 | 39.5 | 28.4 | 37.2 | 27.0 |
U-Net | GWRP | 53.0 | 80.9 | 62.9 | 40.2 | 66.7 | 49.2 | 8.97 |
GTAP | 48.2 | 79.3 | 58.9 | 56.7 | 49.0 | 50.6 | 13.11 |
GTAP | 46.5 | 81.2 | 58.0 | 53.0 | 53.3 | 51.7 |
GTAP | 68.6 | 70.3 | 68.0 | 38.8 | 66.1 | 47.4 |
LUU-Net | AlphaMEX | 58.5 | 73.4 | 64.0 | 62.1 | 32.7 | 40.7 | 21.11 |
MEX | 56.7 | 76.6 | 64.1 | 50.1 | 55.6 | 51.5 | 32.25 |
GWRP | 56.8 | 77.4 | 64.1 | 45.9 | 60.0 | 50.7 | 28.99 |
GTAP | 56.7 | 77.9 | 64.5 | 56.0 | 52.0 | 52.5 | 35.68 |
GTAP | 55.7 | 79.0 | 64.4 | 53.2 | 55.6 | 53.1 |
GTAP | 67.0 | 72.6 | 68.8 | 42.3 | 64.2 | 50.0 |
Table 10.
AT and SED results on the dataset generated by the longer clipping policy. The CNN, U-Net, and proposed LUU-Net are shown in
Table 4,
Table 5 and
Table 6, respectively. Pooling methods AlphaMEX, MEX, GWRP, and the proposed GTAP with varying
values.
Table 10.
AT and SED results on the dataset generated by the longer clipping policy. The CNN, U-Net, and proposed LUU-Net are shown in
Table 4,
Table 5 and
Table 6, respectively. Pooling methods AlphaMEX, MEX, GWRP, and the proposed GTAP with varying
values.
Model | Pooling Method | AT Task | SED Task | #Step/s |
---|
mPrc | mRcl | mF1 | mPrc | mRcl | mF1 |
---|
CNN | GWRP | 48.0 | 71.1 | 54.2 | 46.1 | 34.4 | 36.7 | 3.57 |
GTAP | 35.7 | 74.0 | 46.8 | 77.2 | 7.2 | 12.5 | 4.37 |
GTAP | 34.4 | 77.7 | 46.6 | 76.3 | 9.2 | 15.7 |
GTAP | 52.4 | 43.1 | 42.5 | 36.7 | 34.4 | 31.3 |
U-Net | GWRP | 54.2 | 80.8 | 63.7 | 48.0 | 50.1 | 47.4 | 8.64 |
GTAP | 50.2 | 78.2 | 59.7 | 62.5 | 34.6 | 42.8 | 12.54 |
GTAP | 48.2 | 80.4 | 59.2 | 60.4 | 37.5 | 44.4 |
GTAP | 67.6 | 69.0 | 66.9 | 47.2 | 51.4 | 47.3 |
LUU-Net | AlphaMEX | 60.0 | 73.7 | 64.9 | 67.6 | 25.9 | 35.8 | 21.13 |
MEX | 57.0 | 77.0 | 64.3 | 56.6 | 40.0 | 45.4 | 32.33 |
GWRP | 56.8 | 78.0 | 64.4 | 53.0 | 44.6 | 46.8 | 28.38 |
GTAP | 57.2 | 76.8 | 64.4 | 62.3 | 36.9 | 44.8 | 35.12 |
GTAP | 55.8 | 77.8 | 64.1 | 60.1 | 40.1 | 46.6 |
GTAP | 66.0 | 71.2 | 67.4 | 50.5 | 48.6 | 48.0 |
Table 11.
AT and SED results on the dataset generated by the random onset policy. CNN, U-Net, and the proposed LUU-Net are shown in
Table 4,
Table 5 and
Table 6, respectively. Pooling methods AlphaMEX, MEX, GWRP, and the proposed GTAP with varying
values.
Table 11.
AT and SED results on the dataset generated by the random onset policy. CNN, U-Net, and the proposed LUU-Net are shown in
Table 4,
Table 5 and
Table 6, respectively. Pooling methods AlphaMEX, MEX, GWRP, and the proposed GTAP with varying
values.
Model | Pooling Method | AT Task | SED Task | #Step/s |
---|
mPrc | mRcl | mF1 | mPrc | mRcl | mF1 |
---|
CNN | GWRP | 36.9 | 60.9 | 42.6 | 31.0 | 37.0 | 30.8 | 3.66 |
GTAP | 30.6 | 66.1 | 40.4 | 61.7 | 13.5 | 20.7 | 4.54 |
GTAP | 28.4 | 68.4 | 38.7 | 54.9 | 17.0 | 24.2 |
GTAP | 39.1 | 33.9 | 32.0 | 19.8 | 30.1 | 20.2 |
U-Net | GWRP | 39.4 | 66.7 | 48.1 | 29.2 | 49.2 | 35.6 | 8.67 |
GTAP | 38.0 | 65.7 | 46.8 | 48.1 | 36.1 | 39.6 | 12.97 |
GTAP | 35.5 | 66.6 | 45.1 | 43.4 | 38.5 | 39.6 |
GTAP | 53.7 | 50.8 | 50.7 | 26.7 | 45.8 | 32.5 |
LUU-Net | AlphaMEX | 43.9 | 60.7 | 49.4 | 51.2 | 19.5 | 26.7 | 21.13 |
MEX | 40.8 | 64.2 | 48.6 | 35.6 | 41.9 | 37.2 | 32.33 |
GWRP | 42.9 | 64.5 | 49.8 | 34.3 | 43.7 | 37.1 | 27.26 |
GTAP | 43.4 | 64.2 | 50.5 | 46.9 | 37.7 | 40.8 | 33.64 |
GTAP | 42.3 | 65.0 | 50.0 | 42.3 | 39.7 | 40.0 |
GTAP | 51.3 | 55.7 | 52.1 | 28.4 | 45.6 | 34.0 |
Table 12.
AT and SED results on the dataset generated by the mixed policy. CNN, U-Net, and the proposed LUU-Net are shown in
Table 4,
Table 5 and
Table 6, respectively. Pooling methods AlphaMEX, MEX, GWRP, and the proposed GTAP with varying
values.
Table 12.
AT and SED results on the dataset generated by the mixed policy. CNN, U-Net, and the proposed LUU-Net are shown in
Table 4,
Table 5 and
Table 6, respectively. Pooling methods AlphaMEX, MEX, GWRP, and the proposed GTAP with varying
values.
Model | Pooling Method | AT Task | SED Task | #Step/s |
---|
mPrc | mRcl | mF1 | mPrc | mRcl | mF1 |
---|
CNN | GWRP | 32.2 | 57.4 | 38.7 | 32.4 | 27.8 | 27.9 | 3.59 |
GTAP | 27.4 | 62.3 | 36.7 | 59.8 | 9.0 | 14.8 | 4.42 |
GTAP | 26.6 | 62.9 | 36.1 | 62.6 | 10.4 | 16.7 |
GTAP | 41.6 | 33.8 | 32.3 | 30.0 | 28.3 | 24.9 |
U-Net | GWRP | 36.0 | 62.3 | 44.2 | 33.8 | 37.3 | 34.1 | 8.64 |
GTAP | 34.4 | 61.1 | 42.9 | 54.2 | 25.8 | 33.5 | 12.51 |
GTAP | 33.5 | 62.0 | 42.5 | 50.9 | 28.2 | 34.8 |
GTAP | 50.6 | 50.0 | 48.8 | 33.8 | 40.1 | 34.8 |
LUU-Net | AlphaMEX | 39.0 | 57.9 | 45.4 | 51.4 | 15.8 | 22.7 | 21.05 |
MEX | 36.8 | 60.7 | 44.3 | 37.8 | 32.4 | 33.1 | 32.35 |
GWRP | 38.5 | 60.1 | 45.5 | 37.8 | 33.8 | 34.3 | 28.21 |
GTAP | 37.7 | 60.4 | 45.4 | 50.4 | 27.3 | 34.1 | 34.92 |
GTAP | 38.7 | 60.9 | 46.0 | 47.4 | 29.9 | 35.3 |
GTAP | 47.3 | 53.4 | 48.8 | 34.4 | 38.4 | 34.5 |