3.1. Survey 1
As discussed in [
5], the survey was conducted online on a sample of industries and sectors including, among others, finance and insurance, automotive, information technology (IT), healthcare, research and education (see further details in
Table 1). The different groups were selected for their use of big data for creating and capturing not only economic value, but also social value in terms of public safety and security for the final users or customers [
5]. Then, the survey was carried out from January to March 2017, using multiple channels such as an online questionnaire form (
https://www.aegis-bigdata.eu/what-is-the-current-and-expected-use-of-big-data-technologies-a-glimpse-to-our-aegis-questionnaire-results/), face to face, and telephone interviews. In the case of those interviews, the sample was made up of companies from the IT industry (for the big data technological infrastructure [
32,
33]) and finance (for the degree of information content of their product and information intensity of the value chain [
34]), being an example of data-driven industry [
35]. Ultimately, we received 77 replies to the questionnaire out of 110 invitations (a response rate of 70%). In what follows we discuss the main results from the survey and the interviews.
Most of the respondents came from IT or related industries, as shown by
Table 1. As to the geographical distribution, as shown by
Table 2, all the countries of the partners of the project were covered with additional replies from Portugal, France, Belgium, Bulgaria, Luxembourg, the Netherlands, United Kingdom (UK), Spain and, outside Europe, Mexico, Argentina, United States (US). Furthermore, there was also a regular distribution of respondents from small and medium-sized enterprises (~75%) and large entities with more than 1000 employees (~25%) [
5]. Generally, while 55.3% of respondents already had a strategy for using big data and analytics, only 34.2% were effectively using them, 35.5% were starting their use, 13.2% were on a planning phase, and only 17.1% had no experience [
5]. Concerning the data sources, the most cited sources were logs (~45%), transactions (~32%), events (40%), sensors (32%), and open data (30%). It is worth noting that open data had the higher rate of willingness for exploitation in the next five years together with social media and free-form text [
5]. Moreover, the sample has shown little interest in data coming from phone usage, reports to authorities, radio-frequency identification (RFIDs) scans or point of sale (POS), and geospatial data. In general, although ~72.6% of data sources were multilingual, only ~50% of the sample declared to have the needed tools to handle different languages. As for data sources considered as relevant although not yet fully exploited, the main obstacles for their use were related to security, privacy and legal issues, availability and discoverability of data, lack of a common data model and lack of the necessary skills or strategy within the organization. Actually, most of respondents (40%) stated that less than 10% of data collected is further processed for activities connected to value creation and capture, although they also foresaw an increase in the next five years [
5].
Thus, the limited exploitation of big data seems associated to a low degree of information capacity of the considered organizations [
19] and a gap in analytics capabilities [
36,
37]. These weaknesses could be also reflected by the fact that more than 60% of respondents had in-house both data collection and data analytics, while only a few were outsourced. In general, it seems that those organizations require an IT transformation rather than a simple reconfiguration or renew of the IT portfolio [
5]. As to this issue, the main technologies in use for big data analytics among the respondents were Apache Hadoop (21%) and Microsoft Power BI (17%). Finally, only 36.5% of respondents declared that they share data with other subjects [
5].
3.1. Survey 2
In this section, we discuss the results of the second survey carried out during the AEGIS project to collect further requirements from the potential stakeholders [
6]. The questionnaire was submitted between February and March 2018 to all the different sample groups of the first survey, although targeting some specific roles for the participants in their organizations. According to [
6], the roles identified were:
Manager: “a person responsible for controlling or administering an organization or group of staff, he/she has a high-level point of view about big data analytics but is the person that could benefit from them. He/she has a focus on business intelligence” (6, p. 18).
IT Technical Operator: “a person responsible for the management of the data storage, curation and collection, he/she knows which could be the critical points of these tasks” (6, p. 18).
Data Scientist: a “person that extracts information from data, using big data analytic tools, for instance following the instructions of the manager. He/she has the proper skills for data analysis and could identify the deficiencies of the existent tools” (6, p. 18).
Table 3 shows for each role the main topics/features investigated through the survey.
An online version of the questionnaire (powered by Easy Feedback) has been provided and it is still Available online: the following link:
https://indivsurvey.com/aegis/117873/8il3tU (still online: Accessed on 17 September 2019). Moreover, each AEGIS’s partner sent direct email invitations to people in personal networks or on LinkedIn groups and Facebook and we eventually received 37 valuable replies to the questionnaire out of 56 submissions (see
Table 4 for the type of organizations of the respondents).
As shown by
Table 5 it is worth noting that 14 out of 37 respondents were from the Information Technology (IT) industry or related sectors (such as “Information Management”, “Statistics and Information Systems” or, generally, information and communication technology—“ICT”). As for the size of businesses, as shown by
Table 4, large enterprises and small-medium-micro enterprises were almost equally represented—although it is worth noting that the latter forms of enterprise have to be considered as a single cluster for this survey—otherwise we have an average of ~four enterprises for those that were not large enterprises. As shown by
Table 6, considering the country of origin of the organizations of the respondents, the majority of the replies came from Austria (~31%), Greece (~17%), and Italy (~17%), followed by Spain (~13%) (it is worth noting that ~21% of respondents did not mention the country of their organization).
Considering now the use of big data (see
Table 7), the 60% of the respondents for the organizations that participated to the survey have declared that they were effectively using big data. It is also worth noting that here we define ‘big data effectiveness’ as “the capacity to elaborate big data and use them to create value” on the basis of the discussion of the results of the first survey [
5] (see also the concluding remarks in this paper).
As said, the survey included the three different types of participants presented above and shown with their related features in
Table 3—thus, the answers came from managers (49%), data scientists (35%) and IT technical operators (16%).
Considering managers, there were eighteen replies (see
Table 4)—four of them came from small, medium, and micro organizations planning to use big data, although not having a designed team (internal or external) to perform data analysis. Moreover, considering the organizations that declared to be beginners in the use big data, they preferred to perform analysis through external consultants or, in some case, an internal team, but not as a main activity. Instead, the organizations that were effectively using big data (the majority in the IT and automotive sectors) had an internal team of data scientists. Also, it is important to point out that 18% of the participants declared that even if they have a dedicated budget for big data and analytics, the investment was not adequate. Accordingly, only 2% of participants declared to have the proper hardware to manage big data [
6].
Moving now to the added value of big data and analytics,
Table 8 reports the perspective of managers and data scientist, while
Table 9 shows the main issues that the managers pointed out as key to the use of big data. Here, it is worth noting that contrary to the first survey the heterogeneity of available data was not considered an issue by managers and IT technical operators; the reason for this result could be related to the different degree of experience with big data exploitation required of the participants of the second survey.
Considering specifically the AEGIS big data value chain, one of the questions regarded which of its steps were actually implemented in the respondents’ organization: 62% of the respondents declared to carry out “Data Analysis”, 48% “Data Acquisition”, 43% “Data Storage”, 37% was actually using the results of the analysis (“Data Usage”), and finally 18% performed “Data Curation”. It is worth noting that the respondents that declared that they were ‘effectively using big data’ were also already implementing all the steps of the big data value chain identified in the AEGIS project.
As for the origin of the data involved in the analysis by the various organizations in their activities,
Table 10 shows the main sources as identified by managers, data scientists, and IT technical operators that replied to the survey (question: ‘Which are the data involved in the analysis of your organization?’). As to this issue, the respondents that used external or purchased data also declared to have an agreement with the providers, which included a reference to further processing of previously collected personal data. Furthermore, all the participants agreed on the importance of linking datasets from different domains/data sources for their analyses, although real time data were used by a limited number of participants (~15% for the three roles considered in the survey). Also, only the 10% of the overall respondents declared to use alerts, warnings or monitoring systems based on big data analytics as a support after an event, while 10% did not know, and 80% did not use that kind of automated feedback. Finally, as shown in
Table 11, the sharing of data and the related analyses was mainly with the customers and with colleagues of the same team.
Focusing now on the data scientists that participated in the survey, six respondents out of thirteen declared that there were different restrictions about data visibility in their organization, while only two declared that there were no restrictions about data visibility (four respondents replied that they did not know about it). Then, considering again the results shown by
Table 10, the data scientists pointed out that data processed came mainly from external sources related to customers (53%), or from internal data related to customer (e.g., contracts)—46%. The other categories (open data—38%; data internal of the organization—30%; real-time data—15%; and purchased data—7%) scored percentages considerably lower. Also, all of the participants of the survey asserted that they used to acquire data only when needed, and through scheduled streaming. Furthermore, the types of data mainly used for the analysis were logs and sensors data, while, contrary to the first survey, the data types not yet exploited but that the participants would have liked to use were: geospatial data, phone usage, email, transactions, social media, audio, radio-frequency identification (RFID) scans or point of sale (POS) data, and earth observation. Considering now the tools used, 46% of data scientists answered that they had proper analytic tools for their needs, the most popular tools being R, Matlab, and Python, while other tools mentioned were Pandas, Microsoft Excel, Spark and SAS Base. The algorithms actually adopted by the respondents or that they would have liked to adopt for the analysis are reported in
Table 12, while the main output format identified for the results of the analyses were the tabular one (69%).
Taking the above issues into account, only three out of thirteen data scientists declared that their organization have scheduled automated analysis of data, while six declared that their organization didn’t perform scheduled automated analysis of data, and four did not know about it. The last question for each participant was aimed to understand the main features/functionalities for a potential big data and analytics platform. To this end, a set of features/functionalities were listed (see
Table 13 and
Table 14) and the respondents could assign to each of them a level of interest ranging from ‘Not at all’ (“0”) to ‘Very’ (“3”).
As shown in
Table 13, two features/functionalities with the highest median for the three categories of respondents are the ones related to metadata management, queries, and visualizations (cells in grey in
Table 13). Here, it is worth noting the interest of managers in metadata, usually a more technical subject than a business oriented one, yet one reason for this result could be related to the already mentioned significant presence in the survey population of companies from the IT industry. Also, it is worth noting that the median value for managers’ preference for a feature/functionality as “where you can manage the metadata related to your data” is 2.5, thus closer to 3 (“Very”) than the median values for all the other features/functionalities, which are in a range between 1 (“Slightly”) and 2 (“Moderately”).
Those results are reflected also by the values shown in
Table 14, where metadata management, queries, and visualizations received the highest percentage in the survey for features and functionalities that were considered as “very” interesting (grey cells in
Table 14), especially by managers and IT technical operators. However, it is also worth remarking that ‘being online and free’ is one the key features that interested more data scientists in a platform for big data and analytics. This could be potentially interesting for further research, considering that, according to the results in
Table 10, data scientists were less oriented or more cautious with regard to openness in data sharing; thus, they seem to be more oriented to openness when they need to access tools than when they have to share results with subjects external to their organization. Looking now at the features that were evaluated as not interesting at all (grey cells in
Table 14 for “not at all” answers), a high percentage of managers and IT technical operators considered not worth having features for buying and selling assets, as well as the provision by an eventual platform of a set of open assets. On the contrary, data scientists were not interested in connecting in-house streaming datasets and being able to store analyses and data assets.