1. Introduction
In collaborative networks, the partners work together to create competitive advantages by defining the activities to be carried out by each organization, the business processes to be executed, the roles to be played, the communication channels, and the definition of interoperability at both the process and system levels in order to achieve common business goals [
1,
2,
3]; e.g., a supply chain process may involve several organizations [
4]. Collaborative networks foster joint problem-solving through resource sharing and the fusion of complementary skills. This collaborative environment enhances organizations’ potential to create and acquire knowledge, leading to the innovation of products or services [
5]. In this context of collaborative innovation, members of the supply chain plan and implement actions for knowledge sharing and knowledge application to develop new products and services quickly and efficiently, enabling them to maintain and improve their performance in the long term [
6]. Furthermore, in Industry 4.0, end-to-end digital integration is required in the supply chain, with a business process design logic that crosses organizational boundaries. These business processes can be defined using the business process model and notion (BPMN) language [
7,
8], a standard for graphically representing the logic of the business process and its subsequent automation [
9], which not only makes the logic more understandable but also makes it easier to integrate the perspective of the control flow, the subprocesses, the data flows (internal or external), and the resources involved in the processes into a BPMN diagram [
10].
Data-driven approaches are characterized by decision-making based on the analysis and interpretation of data, rather than observations, allowing decisions and solutions to be supported by facts [
11,
12]. Process mining techniques are implemented through data-driven approaches, making it possible to discover process patterns within event logs and detect and diagnose differences between observed and modeled behavior [
10,
13], which helps in decision-making to improve and optimize business processes [
14,
15]. In this way, approaches based on process mining techniques have been implemented to verify and enhance business processes. These techniques are characterized by supporting discovery, conformance checking, enhancement, and predictive analytic tasks [
16,
17,
18]. In a discovery approach to the business process model, event data generated by the execution of business processes are analyzed, making it possible to identify the logic and behavior of the process from these event data, known as event logs. Process conformance checking consists of evaluating the alignment of the behavior of the actual business process model against the behavior discovered in the event log (generated by the business process itself), to detect any possible deviations. In the process improvement task, various analyses are carried out, considering all the attributes available in the event log and in the real business process model to detect possible bottlenecks, high time consumption in task execution, deviations, and duplication of the execution of tasks by different resources, among others; making it possible to identify opportunities for improvement in the business process.
In approaches based on process mining techniques that implement tasks such as predictive process monitoring or trace clustering, an event log preprocessing stage is included, in which the input data must be encoded to feed the prediction or inference algorithm. At this stage, an encoding method is typically implemented to transform complex event data into a numerical or representative feature space [
19]. One of the most important methods for this purpose is Doc2Vec, based on representation learning, developed in natural language processing (NLP) [
20]. This learning uses neural network architecture models to automatically learn distributed vector representations of a concept of interest (for example, an activity or a trace) with high quality. Doc2Vec is an architecture for computing continuous vector representations of words from large datasets with high dimensionality. In the process mining domain, several approaches based on representation learning techniques have been presented to significantly improve the performance of the inference algorithm. In [
21], the authors proposed several activity-level models, traces, models, and logs to deal with the high dimensionality of real-life event logs and to generate a distributed representation that can be used in different process mining tasks. In [
22], the authors presented a case-level solution that uses word embeddings for business process data to better encode process instances. For their part, ref. [
23] expounded an approach for conformance verification based on vector representations of each activity/task present in the model and the event log. Therefore, the vectors generated by Doc2Vec can be used to find similarities between traces, allowing for the quick analysis of large event logs by expressing words in the vector-space model and considering the context when learning through the co-occurrence of activities.
Recent studies have proposed solutions for different process mining tasks applied in intra-organizational business processes [
24,
25,
26,
27]. However, when process mining solutions are implemented in inter-organizational business processes (IOBP), aspects such as the process’s privacy and autonomy; data with different levels of granularity; and event data stored in other sources, formats, and distribution form must be considered. Therefore, managing independently generated event logs requires methodologies and algorithms to process, align, and merge the event logs generated by process-oriented information systems [
28]. Importantly, events need to be correlated across organizational boundaries. Then, by implementing process mining techniques, the tasks of discovery, monitoring, compliance, and improvement of IOBPs, which have yet to be studied to date, can be carried out. Furthermore, the analysis can be extended to discover and verify the process choreography, which represents the formalization of interactions through messages from the participants in an inter-organizational collaboration [
29].
In this sense, automatic analysis of the historical information recorded from the execution of the business processes of the participating organizations can help to find relationships within the IOBP. The above can be achieved through data-driven and process model-level analysis. At the structured data level, the organizations participating in the IOBP are responsible for selecting and structuring the data from their information systems and consequently choosing the appropriate level of abstraction and the point of view of the data. At the level of process models, the business process flow of the participating organizations is analyzed, in search of patterns that can complement the analysis of structured data, to obtain sufficient information for identifying collaboration patterns between organizations and discovering the IOBP model. Different approaches are available in the state of the art that partially address analyzing and discovering process choreography, focusing on the analysis of the information contained in event logs [
30,
31,
32], in business process models [
33,
34,
35,
36,
37], document electronics, and information related to the business process [
38].
Therefore, this paper proposes a data-driven methodology supported by semi-automatic methods that enable the discovery of the IOBP model and the process choreography in a collaborative environment. The relationships between the organizations participating in the business process are identified, labeled, and defined using a method based on the Doc2Vec algorithm and by calculating the cosine similarity measure between events to identify possible message-type tasks and their task subtype (send/receive), for which a set of definitions are specified to formalize the relationships, as well as a group of rules for assigning the message task subtype. These criteria are formulated in terms of relationships at the trace level and the event level. Next, each collaboration participant’s intra-organizational business process model is determined, marking in the model the tasks previously identified as message-type tasks and their subtype, and defining the relationship between the processes through flow message connectors, which allows building an IOBP, including its process choreography. Subsequently, an inter-organizational event log is generated from the intra-organizational event logs, applying a fusion of traces from the relationships identified by the message-type tasks and their subtype, containing the event data of both traces. Finally, the process choreography and intra- and inter-organizational models are evaluated using the metrics of precision, recall, F-score, and generalization. The proposed approach was evaluated using four event logs derived from real-life IOBPs and two artificial event logs. The results achieved are very acceptable, with an overall performance in the discovery of the process choreography of 0.86 for the relationship precision metric, a relationship recall of 0.89, and with a measurement F-score of the relationship of 0.86, with a performance over 89% in the message-type task identification task. On the other hand, for the average evaluation of the quality level of the IOBP discovered, a precision of 0.94 was achieved, with a recall value of 0.99, and generalization indicator of 0.63, which indicates that the model of the inter-organizational process discovered could reflect more than 94% of the behavior contained in the merged event log.
2. Related Work
In [
39], a technique to discover collaboration models from intra-organizational event logs was proposed. The structure of the event log was extended to support interaction data between participants by adding attributes to contain the message name, message identifier, participant role, and type of communication between participants. Interactions between participants are identified through an event data analysis in the event log, determining the correlation between the messages exchanged. Subsequently, the intra-organizational models discovered with the information from the interaction of messages are combined, which enables the generation of an inter-organizational business process model aligned with the BPMN language. Intra-organizational business process models are discovered for each participant in the collaboration by applying algorithms available in the state of the art. Similarly, ref. [
40] presented a process mining approach to discover inter-organizational business processes and process choreography from an extended event log. This log requires information about the participants and the messages exchanged between the participants, to discover a model of the inter-organizational process represented by the BPMN language. The extended event log includes information required for the inter-organizational process model and process choreography. For example, the participant attribute identifies the participant that executes the event, and an attribute contains the type of event; in the case of message-type events, the information of the participant who receives the message is required. A fundamental stage in this proposal is extracting all message-type events and the information related to the message: the participant who sends or receives it. With this extra event log, a model of the process with the message interactions between the participants involved in the collaboration is discovered. This model is used to build process choreography and inter-organizational models in conjunction with the intra-organizational process models discovered for each participant. Our proposal takes a different approach from the studies mentioned above. It does not require the extension of the event log or adding information about the messages and resources exchanged between the collaboration participants. Instead, our method is based on a unique set of methods and formal rules. These tools allow for the identification of potential message tasks and the determination of the task’s subtype, which in turn defines the message’s meaning (send/receive).
On the other hand, ref. [
41] proposed a process mining technique to merge intra-organizational event logs and discover an inter-organizational process model represented by a directly-follows graph. This approach is characterized by only using the common elements of an event log: case ID, timestamps, and activity. Furthermore, it is based on the premise that two activities of different organizations occur consecutively with a very short time difference, for which several time thresholds are defined. Therefore, adjacent activities with the minimum time difference should be interconnected and extracted, since they belong to the same trace within an inter-organizational event log, forming the sequence of the activities of the merged event log, ordered by the timestamp value. Each extracted activity pair will be identified in this log by concatenation with the original case IDs. The rest of the events of the same trace (which were not extracted) of each participant are embedded in the trace according to the timestamp value, with which the trace is constructed with all its events. This procedure is executed until no adjacent activities are identified in the event logs of each collaboration participant. In our case, the relationship between message tasks is determined by a cosine similarity measure that ensures that two tasks (from different participants) are close and possibly related. Furthermore, the task’s subtype is determined through a set of rules that allow analyzing the context of the message-type task, that is, the antecedent and consequent tasks for both parties of the collaboration.
Differently, ref. [
42] presented an approach based on a Petri net extension that supports the management of message attributes and resources exchanged in workflows (called RM_WF_net) to formalize healthcare processes in hospitals, particularly inter-departmental processes. From the formalization, algorithms are applied to discover intra-departmental models and identify collaboration patterns in each intra-departmental model, with which a collaboration model is built. The first algorithm discovers a control-flow structure based on WF-net. Subsequently, the event log is processed to identify messages and resources, which generates a RM_WF_net for each department. On the other hand, ref. [
43] presented a process mining approach in an inter-organizational environment for a cloud computing multi-tenancy architecture through declarative models. Through a set of business rules, information related to the processes of systems that run in the cloud is extracted, and distributed data are identified, enabling the building of an event log. This approach makes it possible to represent processes with high variability. The previous proposals differ from our approach since using Petri nets reduces the expressiveness of the discovered model notation and does not support high-level notations compared to a BPMN-based model. In addition, there may be some difficulties in representing complex behaviors in the process logic, for example, in event-based gateways, which does not happen in BPMN-based models.
3. Preliminary Formalization
This section introduces the main foundations of the proposed approach, which formalizes the methodology phases and enables the identification of message-type tasks from direct tasks (previous or subsequent) or non-direct tasks. The above facilitates the marking of message tasks by their subtype (send/receive), making it possible to merge event logs and discover the correlation of messages exchanged in a collaboration.
Definition 1 (Mapping an event to a sentence). This refers to a sentence of words that represent each event . The sentence of words is generated from the values of the attributes that makeup , representing the sentence’s words.
Definition 2 (Mapping a trace to a document). This refers to statements representing each case in the event log L. This document is generated from the values of the attributes of each activity , which represent the document’s words (see Definition 1).
Definition 3 (
Incoming (•θ) and Outgoing edges (θ•) for the task θ). Given a BPMN model and a task , its incoming edges and its outgoing edges [24]. Definition 4 (Direct predecessors of task m). Given a BPMN model and a task , its set of t- of task m are all tasks , such that there is a direct path between p and m; and this path is contained in its set of incoming edges (see Definition 3).
Definition 5 (Direct successors of task m). Given a BPMN model and an event , its set of t- of m are all tasks such that there is a direct path between m and s, and this path is contained in the set of outgoing edges (see Definition 3).
Definition 6 (Non-direct predecessors of task m). Given a BPMN model and a task , the set of its t-- is the set of tasks such that for each , there are one or more paths between the tasks i and m that visit the event .
6. Discussion
The approaches for discovering IOBP models and process choreography presented in [
39,
40,
41] exhibited a similar objective to our proposal. In [
39], the authors described the discovery of an IOBP model and the interaction of messages between the participants using a healthcare scenario, as used in our experimentation. Their experiment obtained values of 0.4 and 1.00 for the fitness and precision metrics, respectively, utilizing an extended version of the event log to identify messages between participants. For their part [
40], the authors reported the discovery of an IOBP model using the same Healthcare scenario and an extended event log to manage the message data, reporting independent diagrams for the IOBP model and the choreography process discovered. In our experimentation, the quality assessment of the IOBP model of the healthcare scenario obtained a value of 1.00 for the precision and recall metrics. Furthermore, in the quality assessment of the discovery of the intra-organizational models that made up the IOBP model, values of 0.98 and 1.00 were achieved for the precision and recall, respectively.
On the other hand, ref. [
41] obtained results between 0.94 and 1.00 for the precision metric and 0.905 and 1.00 for the recall metric in their discovery of a collaborative model using a classic event log (BPIC 2012) in the process mining domain. In their approach, no additional information is required to determine which tasks can be correlated, applying a technique of adjacent activities, and identifying the minimum execution time between the tasks to assess their link. We presented an experiment with the event log of the Air Quality System scenario, which had characteristics and complexity similar to the BPIC 2012 event log. The results of the identification evaluation of the message task achieved a value of 1 for the precision and recall metrics. In discovering the IOBP model, a precision of 1 and 0.99 for recall was obtained. Our approach demonstrated a high performance on most event logs considered in the experimentation, without including additional information in the event log to identify message-type tasks.
In this way, the proposal to discover the choreography of a process in an inter-organizational collaboration environment is governed by a set of configurable methods. For example, the values of the variables
and
allow filtering cases with similar information and selecting events that are potentially considered message-type tasks, respectively. The
word embedding representation used to calculate cosine similarity at the case and event level is highly effective. However, it is limited to the quantity and quality of information within the event logs, to generate a robust model that allows the discovery of the choreography between the participants of an IOBP. Furthermore, the patterns established in
Table 1, as well as the models of intra-organizational processes discovered by the Split-miner algorithm, were fundamental elements for identifying the subtype in the message-type tasks, enabling the discovery of the choreography of the process.
According to the results obtained in the evaluation of the discovery of the choreography of the process (see
Table 3 and
Table 4), the following classification can be defined based on the characteristics identified in the experimentation:
Complete choreography. This refers to the fact that the proposed approach can find the complete process choreography in an inter-organizational environment, with the same number and relationships as found in the reference choreography. In conclusion, case-level and event-level representations of the event log only allow the discovery of message-type events.
Under-complete choreography. This refers to an approach that has the ability to find a percentage of the relationships in the choreography of the process. This situation may be because there was insufficient information for the representation obtained from the word embedding model to obtain a high similarity, making it difficult to relate all message-type events.
Over-complete choreography. This refers to the fact that the method finds part of the process’s choreography but also recovers irrelevant relationships not found in the reference choreography. This behavior is because the information used to obtain the representation from the event logs is very general. The above issue causes more relationships to be recovered than the existing ones and to meet the condition that the calculated similarity value exceeds the threshold .
Partially correct choreography. This refers to the model identifying a percentage of the process choreography correctly. In addition, with the ability to find partially correct relationships; that is, in a relationship of two identified events , an event a or is incorrect in the relationship, due to the relationship that is expected to be recovered, according to the reference model, whether or . The above may be because the identified relationship has a higher degree of similarity than the expected relationship or . This behavior is caused by the fact that the information used to obtain the representation is not sufficiently discriminating to separate the relationships correctly and that the word-embedding model did not correctly learn from the information in the event logs, causing the generation of relationships with high similarity between message-type tasks and other event-types.
In the experimentation carried out, the scenario with the greatest complexity in identifying message-type tasks was Healthcare, according to the weighted value of 0.54 in the FsR metric. In the Helthcare scenario, the process choreographies discovered had the characteristics of a partially correct choreography and a under-complete choreography. In the Air Quality System and Travel Agency scenarios, the process choreographies were classified as a complete choreography, indicating that the information in the event logs, as well as the patterns defined in the proposed methodology, supported the construction of a process choreography similar to the expected one. Moreover, in the scenarios Purchase order and Manufacturing process, process choreographies were generated with characteristics of Complete choreography and Over-complete choreography, which indicates that the complete choreography was recovered but relationships that were not part of the choreography were also recovered, as seen in the RP metric of 0.83 and 0.91, respectively. Finally, in the Transfer of goods scenario, choreographies with characteristics of over-complete choreography and under-complete choreography were obtained, which were reflected in the PR and RR metrics, indicating that true relations and relations that were not part of the choreography of the process were recovered.