1. Introduction
Musical performances with acoustic instruments and voice are often associated with electronic-digital equipment, such as an amplification system or a more complex context-sensitive signal processing in which data from different kinds of sensors are combined. The term Augmented Musical Instrument (AMI), or augmented musical practice, is used to characterize these situations. Digital Musical Instruments (DMIs), on the other hand, are fully built up from different types of sensors, at times mimicking traditional acoustic interfaces and physical vibrational sources [
1]. These two broad categories are not neatly separated, for what one of the reasons is the fact that they can share a good deal of software and hardware. Augmented instruments can also be used for studying performance techniques and expressiveness in musical performances, with the Yamaha’s Disklavier as a notable example [
2].
High quality sensors are normally expensive, and, in many cases, require special installations and conditions for their use. In recent years, affordable 3D Micro-Electrical-Mechanical System (MEMS), e.g., accelerometers and gyroscopes, have become available for consumer use [
3]. These devices are named Inertial Measurement Unit (IMU) and can be differentiated by the number of Degrees of freedom (DoF) offered by the implemented sensors on them: 6 DoF (three-dimensional (3D) gyroscopes associated with 3D accelerometers), and 9 DoF (3D gyroscopes associated with 3D accelerometers and 3D magnetometers). The development of algorithms for the fusion of these data allows for estimating linear acceleration (isolated from gravity), angular velocities (rotation), and attitude (spatial positioning).
In the field of music and new technologies, considerable effort has been made to adequately characterize these devices, which combine flexibility with budget constraints. In the field of new interfaces for musical expression, it is not uncommon that designs are based on a good deal of intuition. In these designs, the search for personal expression and development of expertise defies standardization. On the other hand, it is also possible to find studies dedicated to a systematic review of different types of sensors for musical expression [
4]. There are also hybrid initiatives that balance between these two poles, such as the present study, which associates artistic needs with the development of a research project in a short period of time.
We have conducted an experiment with a wearable 9 DoF wireless IMU, named MetaMotionR at UFMG. This experiment was extended by a comparative study using a Qualisys optical motion capture system, during a three-month research visit to McGill University. Our goal was to study the accuracy of the data generated by the sensor in a musical performance, the characteristics of its wireless transmission, and its potential for use in interactive music setups. To accomplish that, we adopted actual musical situations to compare the data that were provided by the optical motion and inertial systems to capture the performance on a nylon guitar. Several issues arose during the experiment and during data analysis which are studied in detail in this work: Bluetooth Low Energy (BLE) transmission protocol, delay in IMU response, positioning of the sensor and markers, data synchronization, and integration of acceleration curves. Thus, this study contemplates an evaluation of the effect of these issues on the IMU data, and qualitative and quantitative analysis of IMU measurements compared to the motion capture data provided by Qualisys.
Comparisons were made using rotational and translational data. The direct comparison of the attitude angles delivered by each system (rotational data) is easily addressed, in contrast to the more complex comparison of translation movements. This difference happens because selected movements include a combination of translation and rotation, which affects the linear acceleration values delivered by the inertial sensors. Thus, we propose a compensation method for the double integration of acceleration curves in cyclic movements, which may be useful in situations not requiring high accuracy. Further comparisons are performed with the derivation of positional data. This study is exploratory and presents the objective of obtaining a general qualitative view of the behavior and accuracy of the selected sensor under the conditions offered by our setup. All of the the codes of evaluation and proposed methods, as well as the datasets, are publicly available at
www.musica.ufmg.br/lapis.
The paper is structured, as follows.
Section 2 presents the related works that adopted motion capture systems and IMUs to analyze human movements, varying from validation to exploratory studies. In
Section 3, we describe the chosen IMU, some aspects of data streaming with the BLE protocol, the Qualisys configuration used in this study, and the setup used to receive and record sensor data and audio from one guitar in one notebook. The proposed compensation method for integration is also presented here.
Section 4 depicts the musical excerpts, the participants, and defines the adopted recording protocol.
Section 5 describes the steps used to align the data, and also the segmentation process. The presentation of results in
Section 6 includes quantitative and qualitative analysis of the data comparison. Finally, the
Section 7 discusses the results and prospects for using the selected sensor in daily musical situations.
2. Related Work
Although the present study is exploratory and not dedicated to the validation of a specific sensor; it is precisely in this area that we can find a series of works that compare data from inertial sensors and 3D optical systems. Comparisons can be made in very controlled environments or involve participants with diverse purposes. Cuesta-Vargas et al. [
5] did an extensive literature review around works that “compare inertial sensors with any kind of gold standard”, indicating that “this gold standard has to be a tool for measuring human movement”. Lee and Jung [
6] proposed a new method for local frame alignment between an IMU and a motion capture system, validating it with well-controlled experiments. Ricci et al. [
7] used a robotic arm and commercial IMUs to study movements related to “typical human dynamics”, and got errors of up to 10°, depending on the frequency, amplitude and orientation of the rotations.
Some validations are made based on data from sports practice [
8,
9], daily human activities [
10], surgical practice [
11], or even horse walking and trotting [
12]. Several validation studies are aimed at clinical purposes and measure the angular movements of different parts of the body or gait [
13,
14,
15,
16,
17,
18]. In such cases, accuracy is essential to define limits between normal and impaired movements.
We encountered two works that used sensors from the same brand we have chosen for the present study. Anwary et al. [
13] use IMUs made by Mbientlab for analyzing gait asymmetries, aided by an algorithm specially written for data fusion. They also use Qualisys measurements to validate the results that were obtained with the IMUs. Here, the distance estimation is made by double integration, using a method known as zero-velocity update (ZUPT). This method is justified as follows: “when a stationary period of the acceleration is detected the assumption is made that the foot is on the ground and the velocity at that time is set to 0”. The results, obtained for a young and an older group, are all above 88%.
Beange [
16] conducted validation tests with MetaMotionR sensors in two different environments: a controlled one, using a motorized gimbal with rotations around the three axes, and an uncontrolled one, for the assessment of functional movement quality of the spine in patients with low back pain. Rotational data that were produced by the IMUs were compared with optical motion capture data produced by a Vicon system. The conclusions are that the IMUs “have acceptable performance in all axes when considering absolute angle orientation and motion tracking, and measurement of local dynamic stability; however, there is a low-to-moderate correlation in one non-primary axis, and that axis changes depending on the direction of motion”. This work reports the importance of proper sensor placement, and problems related to real-time streaming due to inconsistencies in the frame rate.
In all of these studies, the results depend on the experiment design (controlled environment or not, choice of sensors, type of movements), selection of participants, and intended goals. Most of them validate the angular displacement data produced by the IMUs. In clinical situations, Root-Mean-Square Errors (RMSE) of 2° or less are acceptable, and between 2 and 5° tolerable [
16]. For in-field applications, Bessone et al. [
8] consider that RMSE below 5° are acceptable, and below 10° tolerable.
In the music technology field, we also find studies that compare optical and inertial data, such as the ones conducted by Skogstad et al. [
19] on synthesis control and by Solberg and Jensenius [
20] on the analysis of dancing to electronic dance music. There are also works that are dedicated to conducting gestures, for use in interactive and multimodal environments [
21], or human-robot interaction [
22]. Polfreman [
23] compared the recognition of hand postures, for musical purposes, using optical, inertial, and muscular information.
A few works analyze communication protocols for interactive systems to estimate the viability of them in live performances. McPherson et al. [
24] studied the latencies presented by standard configurations used in these systems, while Wang et al. [
25] performed a review of the capabilities and drawbacks of using BLE in musical applications. They focused on using BLE Musical Instrument Digital Interface (MIDI) when compared to other wired and wireless implementations. The BLE protocol showed higher latency values than the other option. Despite that, the authors concluded that “BLE MIDI is a potentially interesting replacement for wired MIDI interfaces due to its wireless capability and extensive hardware and software support in modern systems.”
Works dedicated to guitarists’ gestures are not very common, especially those that focus on the strumming technique. They can be grouped into: creative applications in real-time [
26,
27]; development of new interfaces or augmented instruments [
28,
29,
30,
31]; and, performance analysis [
32,
33,
34,
35].
The work by Visi et al. [
26] is a real-time creative application, in which wristbands with flex sensors and three-axis accelerometers are used on both arms of an electric guitar player. Acceleration data produced by the strumming gesture are mapped to sound processing and “affects the timbre and decay of the strummed chord according to the intensity of the movement”. Pérez et al. [
27] propose a method for extracting guitar instrumental controls in real-time by combining motion capture data, audio analysis and musical score. Guaus et al. [
28] propose a gesture capture system able to gather the movements from left-hand fingers, using capacitive sensors on the fingerboard of guitars. Larsen et al. [
29] developed an actuated guitar that “utilizes a normal electrical guitar, sensors to capture the rhythmic motion of alternative fully functioning limbs, such as a foot, knee, or head, and a motorized fader moving a pick back and forth across the strings”. Harrison et al. [
30] developed “four guitar-derivative DMIs to be suitable for performing strummed harmonic accompaniments to a folk tune,” and studied the reactions of 32 players (half of them were competent guitarists and the other half non-musicians) to different input modalities and physical forms. Verwulgen et al. [
31] implemented two types of ergonomic adaptions to the guitar design and tested them with professional players using a Vicon motion capture system.
The works focused on performance analysis usually employ sensors to characterize players or technical patterns of guitar strumming. Matsushita and Iwase [
32] presents a device that is similar to a wristwatch using a three-axis gyro sensor to analyze the guitar strumming by the players. With this, they could clearly distinguish between beginners and experienced players. Freire et al. [
33] describes a study of microtiming features present in accompaniment patterns played with strummed chords. Perez-Carrillo [
34] presents two methods to reconstruct the trajectory of occluded markers in motion capture sessions of guitarists plucking strings: “a rigid-body model to track the motion of the guitar strings and a flexible-body model to track the motion of the hands.” Armondes et al. [
35] propose a multimodal approach to the strumming technique using multichannel audio recording, an IMU with six degrees of freedom, and a high frame rate video recording to analyze the connection between gestures, rhythm, and generated sound.
4. Participants, Excerpts, and Recording Procedure
Given the main purpose of this short-term study, we have chosen to collect data in a simulation of real musical situations, focused on gestures that were related to percussion, with not very rigid control over the performances. The choice of a classical guitar with hexaphonic pickups has a few reasons: the development of augmented nylon guitars in the two partner institutions (McGill and UFMG), the fact that some researchers are also guitarists, and the possibility of further analysis including the multichannel audio recordings. The musical excerpts were remotely defined by one of the authors, and two others were responsible for the performances. Such a procedure is not uncommon in our laboratories, focused on the development and use of new musical interfaces. After recording several versions of each excerpt, each musician chose the two best renditions of each passage. For the objectives of the present study, which do not include performance analyses or gesture characterization, the data provided by two semi-professional players is sufficient. Musician 1 is left-handed, and Musician 2 is right-handed; both played on the same instrument.
4.1. Excerpts and Takes
The chosen musical excerpts are: (1) a sequence of down-and-up strummed chords, with a regular rhythm, based on Pink Floyd’s
Breathe (
Figure 8), with a duration of four seconds; (2) a more complex rhythmic pattern, typical from rock accompaniment, to be interpreted with some freedom (
Figure 9), with a duration of ca. eight seconds; and, (3) a sequence of three different percussive musical phrases on the guitar (
Figure 10), with durations between one and three seconds.
The first and second excerpts explore cyclic and relatively fast down-and-up movements of the right hand. The gestures in the first excerpt are to be played with the frequency of 120 cycles per minute, and some gestures in the second excerpt are done with a frequency of 140 cycles per minute.
The takes will be referred with a combination of letters and numbers, relative to the musician (m1 or m2), excerpt (r1, r2, or r3), and take (t1 or t2). For example, m1r1t2 indicates the second take of the excerpt (rhythm) one made by the first musician. Excerpts 1 and 2 were played by Musician 1 with a pick and by Musician 2 without a pick.
4.2. Recording Procedure
Recordings in the systems Qualisys and GuiaRT/IMU were made independently. The musician’s chair was positioned on the origin of the reference framework, facing the positive x-axis. When the motion capture cameras started recording, the musician pressed a pedal, which started recording in Max/MSP, along with a click played on the loudspeakers. After four clicks, the performance started. The capture duration was predefined in the motion capture software QTM; for the guitar and IMU, the release of the pedal marked the end. The IMU was positioned in different ways, with double-sided adhesive tape, to test whether different orientations around the z-axis could influence the results, as depicted in
Figure 11. We observed that some positions of the right hand of guitarists favor the occurrence of confusion between the axes (gimbal lock). Thus, this factor was taken into consideration during the sensor positioning to avoid creating singularities in the calculation of the rotation angles.
The data that were recorded on both systems were transferred to Matlab, and each take was stored as a data structure. All analyzes were performed on this platform after a processing phase.
5. Data Processing
This section outlines the process of the data yielded by the IMU and Qualisys capture systems. This process consists of a data alignment phase, and an excerpt segmentation to allow a more accurate analysis. Specific settings for filters and thresholds are also presented in this section.
5.1. Aligning the Data
The first step to align the data from both systems is to estimate the offset between the audio recordings, the delay between the audio recorded on video by Qualisys, and the audio recorded on GuiaRT. After that, we also estimated the distance between the guitar and the camera at 4.5 m. This distance corresponds to a delay of approximately 13 ms to be compensated from the offset value. The IMU response delay was fixed as 24.4 ms, according to the empirical analysis in
Section 3.1.4). Finally, we empirically observed that the best alignments were achieved with one video frame delay, which is, 40 ms.
Therefore, the offset between the audio recordings must be compensated by these three values, in order to achieve the time alignment between the data from the motion capture system and the IMU. The result of this operation is rounded to the nearest multiple of 10 ms (as the sample rate is 100 Hz) and then applied to shift back the motion capture data.
The attitude angles also needed some alignment. The first step was to invert the signal of the IMU angles for the reasons presented above (
Section 3.4.1). In addition, as we were not dealing with the absolute orientation of the rigid bodies, and also using different positions of the IMU on the hand, it was also necessary to align the Euler angles with the z-axis for each take. This procedure was done by observing the minimal and maximal values of this physical quantity, from which an offset was estimated, and further refined by visual inspection of the curves. All of the extrinsic rotations were done in the
xyz order.
For the conversion of the acceleration values delivered by the IMU in
g (acceleration of gravity), we used the value 9.81 m
/s. Values are expressed in degrees for angles, in m/s for speeds, and in cm for displacements. The IMU origin point (
Figure 4b) was considered to be the origin of its frame of reference, around which the rotations are calculated.
The occlusion is an intrinsic problem of optical capture systems, which was handled in this study by performing a linear interpolation between the nearest captured points.
5.2. Segmentation
In performances, musicians widely explore two types of control: attacks (sudden bursts of energy) and continuous modulations of some physical or acoustical characteristics. A sensor is expected to provide both types of information when connected to specific parts of the body. In the present case, the attacks are expressed by sudden variations in acceleration and angular velocity, while the modulations are linked to the attitude angles. In situations where the sensor fusion mode is not used, gravity is used instead.
However, accelerometers are not suitable for detecting slow translations (without rotations), as in these cases, the acceleration values would be blurred with the background noise. This problem is the reason why we decided to divide the third excerpt into three different parts before the analysis, as it demands a gradual translation of the right hand from the bridge to a region close to the guitar neck.
Excerpts 1 and 2 will be analyzed as a whole from the 100 ms before the first attack to the moment of the last attack; the duration of them are 4000 ms and 8000 ms, respectively. In excerpt 3, rotation comparison will be made for each segment; for translations, this will be done for each single (as in gestures a and b) or compound attack (as in gesture c).
7. Discussion
We believe that the main goals of this study have been fulfilled. First, we were able to gain a deeper understanding of the BLE transmission protocol and estimate the sensor’s response delay and implement real-time streaming with a fixed sample rate. It was also possible to evaluate the data that were produced by the selected sensor in real musical situations, exploring different types of complex hand gestures. The time alignment procedure proved to be adequate for the proposed comparisons. All of the measurements are affected by the use of the MetaMotionR sensor in real-time streaming (delay, jitter, filtering), the variety of gestures, sensor positioning and, possibly, also by the size of the sensor. The markers to define it as a rigid body were placed at almost critical distances.
The three different comparisons made here—between rotations, accelerations, and displacements—have different and somewhat cumulative sources of errors. The most direct comparison, which occurred between attitude angles, is affected by yaw offsets and drifts and the axes mostly explored by the gestures. Nevertheless, we obtained results between acceptable and tolerable, when compared to the ones described in the literature [
8,
16].
The comparison between the accelerations, obtained by deriving positional motion capture data, is also affected by the accuracy of the rotational data, due to the use of rotation matrices, and also possibly by the guitar spatial orientation. For this type of comparison, we did not find any reference values in the literature. As they involve fewer sources of errors, these results could be used as relative references regarding the comparison of displacements, since they refer to the same translation movements.
The comparison of displacements by the integration of acceleration data is the most complex, as it depends on several factors: accuracy of the rotation angles, the effectiveness of detecting zero-crossings, type of gesture under analysis, limitations of the integration method. In this case, the results have low accuracy, which could be improved by segmentation. Furthermore, even a rough estimation of displacement during a short gesture can be very useful for interactive contexts. In the studied gestures, it was not possible to apply the assumption that was taken by ZUPT (existence of points with zero velocity and acceleration), due to the oscillatory characteristics of the strumming technique. This study’s threshold values should not be taken as reference, but as starting points, and may vary according to different musicians and gestures. For musical interaction purposes, one of the most important issues in parameterization is to offer quick and straightforward ways to test, verify, store and retrieve different configurations or presets.
In general, the data generated by the IMU showed a good correlation with the data from Qualisys, given the differences between them: kinematics versus dynamics, price, and complex versus wearable configuration. Based on this, we believe that this sensor can offer a refined control of interactive musical performances. A strong positive point for the performances is its wearable quality: wireless, small size and weight, easy to attach to different parts of the body. A significant drawback is related to the jitter in the BLE transmission, and its restricted bandwidth, which limits the number of sensors to be used. Another critical issue is the response delay, which can impair its use in DMIs: an average delay of 24 ms would not be acceptable in situations that require clear attacks.
In conclusion, we consider as main contributions of this study the implementation of a BLE streaming with a fixed sample rate, an estimation of the sensor response delay, an overview of the accuracy of the data produced by hand gestures in musical situations, and the development or adaptation of tools that can be used in interactive setups, mainly with augmented musical instruments.