4.1. Discussion of Results
This work reviews some of the key features and functionalities demanded by MST users in the WaSH sector, and describes the development and application of a systematic evaluation framework for assessing the relative performance and suitability of seven MSTs with respect to their use for mobile data collection in WaSH. The evaluation framework developed in this study is based on international standards for the evaluation of software quality, adapted to the specific case of MST assessment using information gathered from the MST literature and an online user survey. This framework was further refined through application to the evaluation of seven MSTs tools currently used in the WaSH sector. As a result, the authors believe that this evaluation framework represents the first rigorously designed and tested framework for the evaluation of MSTs for field-level data collection.
A review of user preferences revealed a strong focus on cost, speed, ease of use, and fast learning curves. Respondents were often unable to distinguish between features of the MST software and those of the hardware and mobile network they were using, but required all these systems to function in such a way as to facilitate rapid and easy data collection with minimal training requirements and minimal risk of lost data. Users primarily reported using MSTs for waterpoint mapping, suggesting that integration of mapping with survey-based data collection may still be in its infancy among WaSH MST users. Overall, most MST users reported very high satisfaction with their current MST tool, regardless of its features. This tends to suggest that the majority of WaSH MSTs in use are functionally adequate for many users’ current applications, but that “word-of-mouth” may not be a useful method for selecting the best MST options available, as willingness to recommend an MST did not seem to be particularly sensitive to its features, among our survey sample. These results may also suggest that most users find MSTs to be far superior to the paper-based data collection tools they may have used previously, regardless of the features and characteristics of the MST they use now. We may therefore speculate that while MSTs differ from each other with respect to features and performance, these differences may be small in comparison to the substantive advantages of MSTs in general vs. pen-and-paper tools.
One of the most important criteria identified by survey respondents for selecting MSTs, and one of the criteria for which the most variability was observed in this work, was cost. The aggregate performance of the five “paid” MSTs was not significantly better than the aggregate performance of the two free MSTs tested, when cost was removed from the overall composite score. This suggests that at the time of this study, differences in performance between paid and free MSTs may be small compared to differences in cost (95% confidence interval). Furthermore, performance did not vary monotonically with cost, suggesting that costlier MSTs did not always deliver greater value. Most of the MSTs studied offered broadly comparable levels of data security and instructional materials, irrespective of cost. The quality of technical support available for free vs. paid MSTs was not directly assessed in this study; however, it should be noted that ODK (one of the two free MSTs) offers support primarily through an online user community, while other MST developers (paid and free) offer support directly, in addition to any user communities they may have.
The systematic evaluation of the seven MSTs selected using the newly created evaluation framework also indicated that while most of these MSTs were adequate for the creation and completion of the case study test questionnaire (STQ), some tools lacked key question types (i.e., barcode and video) and skip logic features (i.e., multiple dependencies and dependencies on numeric data) needed to construct and complete the full STQ. Substantive differences also emerged with respect to the performance, ease of use, cost, speed of use, and learning curves of the various options. A combined ranking of evaluation results across these categories revealed that one of the MSTs tested performed far better than the others according to the current evaluation framework.
The evaluation framework indicated important differences between the MSTs tested, even though all seven of the MSTs tested appeared to be broadly suitable for most field-level monitoring applications, since major defects were not detected in any of the seven cases. Moreover, in most cases the missing features and functionalities needed to construct and complete the standard test survey are ones that could be addressed in the field using simple work-arounds, such as manual entry of ID numbers from barcode tags, or the use of a separate barcode scanner app for collecting barcode data, and the substitution of still images for video records in documenting most routine field observations. Likewise, in the case of missing skip logic features, survey questions could likely be restructured to work around the skip logic functionality missing in selected MSTs. Thus, the application of the evaluation framework appeared to be successful in revealing the relative strengths of particularly well-suited MSTs from a field of acceptable options, something that user opinions and recommendations appeared unable to do (as users’ subjective levels of satisfaction and willingness to recommend MSTs varied little across technology options). Thus, systematic evaluation frameworks such as the one developed in this work may add substantial value in helping implementers select the best available option for their application from among a field of acceptable choices. Moreover, such frameworks may assist MST developers in identifying key areas on which to focus for improving the next version of their products.
4.2. Limitations of This Study
Several limitations of this study deserve mention. The MST user survey, which informed the development of the evaluation framework, was taken from a small (n = 31) convenience sample consisting primarily of members of a WaSH MST user group that provided relevant responses to an online survey. While the diverse institutional background of respondents and the number of countries represented suggest a plurality of experiences and responses, this sample is by no means representative of all current or potential MST users. Furthermore, it should be noted that few government agencies were included in the sample; this may be indicative of slower adoption of MSTs by government, lower rates of representation on RWSN’s D-group listserve, differences in Internet access or willingness to complete online surveys, or to any number of other factors. Thus, both selection bias and response bias may have been introduced by the convenience sampling approach. Without data on non-respondents within the D-group listserve, as well as on MST users who are not members of the group, it is not possible to assess the extent or nature of these potential biases. The response rate of 4% is also relatively low for online surveys; while it is not possible to determine the reason that this response rate was not higher, we may speculate that factors could include a substantive proportion of inactive members and emails on the D-group list-serve, language barriers among international members, a large proportion of MST nonusers among the listserve members (who may potentially have been more reticent to respond), limited Internet connectivity, limited interest in the survey, or any number of other factors.
Another potential source of bias is the possibility that some respondents may have had relationships or affiliations with the developers of specific MSTs, and conflicts of interest related to such potential relationships cannot be ruled out. Thus, the results obtained from this survey should be considered illustrative, but by no means representative of the attitudes and preferences of all MST users outside of the sample of survey respondents. To the extent that the survey results informed the development of the evaluation framework, these caveats should be kept in mind.
It should also be noted that the results of the evaluation framework are highly sensitive to the STQ used; specifically, the format and content of the questions included in the test questionnaire should be determined with the intended application in mind. Test surveys should be designed to encompass one example of each type of question and/or data type that may be used in the intended monitoring and evaluation applications, and to include multiple different types of skip logic cases that may occur in typical field data collection instruments. The greater the extent to which the STQ can be customized to the intended application, the more representative and useful the evaluation results are likely to be. Furthermore, STQ design should be done with as little prior knowledge as possible of the specific features of individual MSTs to be tested, to avoid the unintentional introduction of bias towards one tool or another. For this study, we attempted to use actual survey questions of the type commonly used in WaSH monitoring and evaluation work, with sufficient diversity in the type of data collected and the structuring of survey questions to highlight the strengths and weaknesses of the different MSTs tested. However, the researchers in this study had some familiarity with the features of several of the candidate MSTs tested, and thus the possibility that this knowledge may have introduced unintentional bias cannot be ruled out.
Furthermore, where MSTs crashed or performed incorrectly during the creation and completion of the STQ by testers, the extent to which these errors are attributable to issues with the MST, the test device, the user, the network, or interactions between these four elements cannot be determined; thus these results reflect performance of the MSTs as deployed with the testers, devices, and networks used.
It should also be noted that the scoring rubric used in this work (equal-weighted rankings across multiple performance categories), while effective in differentiating among the MSTs tested, was very simple, and may not fully reflect the relative priorities different end users place on different aspects and features of WaSH MSTs. For example, the speed and ease of data collection was weighted equally to the speed and ease of form construction in the current study, when for many applications the former (which must be done thousands of times by field workers) may prove far more important than the latter (which may only be done a small number of times by IT professionals). Likewise, cost was weighted equally to performance metrics—when for some small programs cost may be of paramount concern, while for large institutions it may be insignificant relative to performance considerations. Thus, more sophisticated and customizable scoring rubrics may also be able to improve the sensitivity and specificity of this evaluation method for different MST end-users and applications. A sample worksheet providing weighted ranking of the MSTs evaluated in this work is provided for illustrative purposes (
Worksheet S1).
Furthermore, it is useful to note that the purpose of this work is primarily to develop and test an MST evaluation method, rather than to rank existing WaSH MSTs. Thus, the sample of MSTs used in this work was neither an exhaustive nor a representative sample, and was selected for illustrative purposes. It would be a mistake to infer that the best-performing MST tool in this study was necessarily the best such tool available at the time the work was performed, or to generalize these results outside of the specific tools and versions tested in the specific time period during which this work was conducted.
Moreover, the testers used in this work were students and staff at UNC, and while they attempted to replicate realistic field conditions and apply the evaluation method with the mindset of field-level WaSH program staff in developing country settings, it is not realistic to assume that the results obtained using this framework for simulated monitoring using the STQ in the US context will be exactly representative of results achieved in the field when MSTs are used by field staff with diverse educational and technical backgrounds in different geographic settings with different questionnaires and different hardware and network service conditions. Thus, the framework is meant to yield results that are indicative, but not necessarily representative, of the typical in-field performance that might be expected of MSTs evaluated for a given application. While the proposed evaluation framework is, to our knowledge, the most rigorous and systematic tool available for assessing MSTs for use in field-level monitoring and evaluation, and may be useful in establishing the overall strengths and weaknesses of different MSTs, implementers are advised to adequately pilot candidate MST(s) under actual field conditions as part of any MST selection, training, or implementation activity.
Finally, it is worth noting that many of the MSTs studied have released updates and/or new versions since the testing activities were conducted; thus, some results and information related to these MSTs may thus already be obsolete. However, the evaluation framework validated in this work, and the performance priorities highlighted by the associated user surveys, are likely to remain useful even as the MSTs tested in this work continue to evolve and mature.