Identifying WeChat Message Types without Using Traditional Traffic
Abstract
:1. Introduction
1.1. Contributions
- First, we enumerated the characteristics of APNs and analyzed the traffic patterns of APNs.
- We extracted five first-order statistical features to effectively identify APNs traffic from background traffic and compared the performances of four widely used machine learning algorithms for APNs traffic recognition. The results show that the proposed classifier shows better performance in terms of efficiency and accuracy than the state-of-the-art classifiers.
- We have successfully identified six WeChat message types from APNs traffic using a hash table lookup method.
- Finally, the coping strategies of resistance analysis method were proposed.
1.2. Organization
2. Related Work
3. Motivation to Deal with Challenges Using the APNs
- Each time iOS connects to the Internet, iOS establishes a persistent IP connection with the APNs server [13]. We can cluster all the packets based on this connection.
- APNs uses transport layer security (TLS) to prevent attackers from obtaining the content of a message. Thus, we cannot directly use the load of the packet. We use its side-channel information as an alternative; e.g., time interval and packet length.
- iOS uses port 5223 or 443 to communicate with the APNs server. The patterns of APNs traffic have no differences using these two different ports [7]. When the APNs uses port 5223, we can identify the traffic of the APNs based on this port number. When the APNs uses port 443, the port-based traffic classification method fails because 443 is a well-known port of the HTTPS protocol.
- The application on iOS has three states: running in the foreground, running in the background, and not running [14]. Different applications can use different remote message notification services when the application is in the foreground. For example, Apple’s iMessage still uses the APNs. In contrast, WeChat uses its own remote message notification service. When the application is in the other states, only the APNs can be used to notify the user of new messages at the system level. In these two cases, we can only use APNs traffic to identify the message types.
4. Proposed Classification System
4.1. Traffic Collection
4.2. Traffic Segmentation
- Tagging data: To improve the performance of our approach, we compared the performance of our approach using different, widely used, supervised machine learning algorithms. Those supervised machine learning algorithms generally work in three steps: feature extraction, training, and testing. During the first step, all the bursts are used to extract features that are expected to be sensitive to the given labeled bursts. It should be noted that the extraction step is applied to all the bursts from the training set; that is, those for which the message type is known. During the training step, the classifier is established based on the feature vectors in the prior step. In the testing step, the extracted features from the testing set are fed into the trained classifier to complete the identification task. Thus, we labeled the data as APNs or non-APNs traffic. Since one of the two ports used by the APNs is a well-known port of the HTTPS protocol, we cannot distinguish APNs and non-APNs traffic using the well-known port. Therefore, we manually mark APNs and non-APNs traffic based on the time of message arrival.
- Time series transformation: Data pre-processing is a data-mining technique that is used to transform the raw data into a useful and efficient format. If there is much irrelevant and redundant information, denoting the noisy and unreliable data, it is more difficult to extract the valuable information during the training phase. The traffic we collect contains encrypted information and some control fields of the IP protocol. The encrypted information and parts of control fields (flags, checksums, etc.) are approximately random strings. The random strings cannot help us distinguish APNs and background traffic, but still occupy storage space. Therefore, we leverage its side-channel information, i.e., packet length and timestamp information, to identify APNs traffic. In this way, we obtain the fixed pattern of APNs traffic, as shown in Figure 4. This is the APNs traffic that carries one WeChat text notification. The horizontal axis represents the time series, while the vertical axis represents the length of the TCP payload. The positive value represents the length of the packet received by iOS, while the negative value represents the length of the packet sent by iOS. The message types of WeChat are only related to the length of the first packet of APNs, and the length of the other four packets does not change. After the APNs server sends a 53 byte packet to iOS, iOS sends a 53 byte packet response. After about 5 s, iOS sends a 69 byte packet to the APNs server and the APNs server sends a 53 byte packet. At that point, a complete message transmission ends.
- Segmenting traffic into bursts: We define flow as a sequence of packets with a quintuple (source IP, destination IP, source port, destination port, and protocol) and session as the flow in both directions (i.e., exchange the destination and source). As mentioned in Section 3, the APNs is a persistent connection. Thus, we cluster all sessions according to their quintuple. After that, the session is divided into bursts. According to [15], each action produces a traffic burst. To measure the similarity between bursts, the first step is to segment flow into bursts. A burst can be defined as a set of consecutive packets and the time interval between adjacent packets as the time within the threshold of a period.
4.3. Feature Extraction
4.4. APNs Traffic Identification
4.5. Message Type Identification
5. Evaluation
5.1. Data Description
- Text: The text message of WeChat only contains the variable length of the text. We found that the length of the APNs’s TCP segment increases staircase-wise as the length of text message increases from 1 to 91. In addition, when the length of the text increases from 92 to 16,354, the payload length of the segment does not change.
- Red packet: There are two variables in the red packet message: the first is the amount ranging from 0.01 to 200 RMB, and the second is the greeting that explains the purpose of the red packet.
- Fund transfer: Similar to red packet messages, the fund transfer message also has two variables: amount and note. The range of the amount is 0.01 to 200,000 RMB for daily transaction amount. Additionally, the note uses 1 to 20 letters to illustrate the purpose of the fund transfer.
- Picture: A picture in WeChat only contains the variable of image size.
- Video: Similarly to picture messages, video messages contain only one variable; namely, video length, which ranges from one to 10 s.
- Voice: The length of the voice ranges from one to 60 s.
5.2. Evaluation Metrics
- : defined as the percentage of the number of instances correctly classified in all samples. It is defined by the following formula:
- : also called true positive rate or sensitivity in this context. It is given by
- : also called positive predictive value, expressed as
- : takes into account both precision and recall, which is formulated by the following formula:
5.3. Selection of Burst Threshold
5.4. Classification Performance
5.5. Efficiency
5.6. Countermeasures
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Wang, Y.; Zheng, N.; Xu, M.; Qiao, T.; Zhang, Q.; Yan, F.; Xu, J. Hierarchical Identifier: Application to User Privacy Eavesdropping on Mobile Payment App. Sensors 2019, 19, 3052. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wang, Q.; Yahyavi, A.; Kemme, B.; He, W. I know what you did on your smartphone: Inferring app usage over encrypted data traffic. In Proceedings of the 2015 IEEE Conference on Communications and Network Security (CNS), Florence, Italy, 28–30 September 2015. [Google Scholar]
- Park, K.; Kim, H. Encryption Is Not Enough: Inferring user activities on KakaoTalk with traffic analysis. In Proceedings of the International Workshop on Information Security Applications, Jeju Island, Korea, 20–22 August 2015; pp. 254–265. [Google Scholar]
- Conti, M.; Mancini, L.V.; Spolaor, R.; Verde, N.V. Analyzing android encrypted network traffic to identify user actions. IEEE Trans. Inf. Forensics Secur. 2016, 11, 114–125. [Google Scholar] [CrossRef]
- Fu, Y.; Hui, X.; Lu, X.; Jin, Y.; Chen, C. Service Usage Classification with Encrypted Internet Traffic in Mobile Messaging Apps. IEEE Trans. Mob. Comput. 2016, 15, 2851–2864. [Google Scholar] [CrossRef]
- Shafiq, M.; Yu, X.; Laghari, A.A. WeChat Text Messages Service Flow Traffic Classification Using Machine Learning Technique. In Proceedings of the IEEE International Conference on IEEE International Conference on High-performance Computing & Communications, IEEE International Conference on Smart City, Sydney, NSW, Australia, 12–14 December 2016. [Google Scholar]
- Coull, S.E.; Dyer, K.P. Traffic analysis of encrypted messaging services: Apple imessage and beyond. ACM SIGCOMM Comput. Commun. Rev. 2014, 44, 5–11. [Google Scholar] [CrossRef]
- Conti, M.; Li, Q.Q.; Maragno, A.; Spolaor, R. The dark side (-channel) of mobile devices: A survey on network traffic analysis. IEEE Commun. Surv. Tutor. 2018, 20, 2658–2713. [Google Scholar] [CrossRef] [Green Version]
- Guo, W.; Liu, H. The analysis of push technology based on iphone operating system. In Proceedings of the 2013 2nd International Conference on Measurement, Information and Control, Harbin, China, 16–18 August 2013; Volume 1, pp. 570–574. [Google Scholar]
- Wang, Y.; Ke, W.; Tao, X. A feature selection method for large-scale network traffic classification based on spark. Information 2016, 7, 6. [Google Scholar] [CrossRef] [Green Version]
- Sultan, K.; Ali, H.; Ahmad, A.; Zhang, Z. Call Details Record Analysis: A Spatiotemporal Exploration toward Mobile Traffic Classification and Optimization. Information 2019, 10, 192. [Google Scholar] [CrossRef] [Green Version]
- Gusgård, O. Application Development for the Apple Watch. Available online: https://www.theseus.fi/bitstream/handle/10024/147350/Gusgard_Thesis.pdf?sequence=1 (accessed on 25 December 2019).
- Lee, D. Designing the multimedia push framework for mobile applications. Int. J. Adv. Sci. Technol. 2011, 32, 117–124. [Google Scholar]
- Brüstel, J.; Preuss, T. A universal push service for mobile devices. In Proceedings of the 2012 Sixth International Conference on Complex, Intelligent, and Software Intensive Systems, Palermo, Italy, 4–6 July 2012; pp. 40–45. [Google Scholar]
- Stöber, T.; Frank, M.; Schmitt, J.; Martinovic, I. Who do you sync you are? Smartphone fingerprinting via application behaviour. In Proceedings of the Sixth ACM Conference on Security and Privacy in Wireless and Mobile Networks, New York, NY, USA, 17–19 April 2013; pp. 7–12. [Google Scholar]
- Yan, F.; Xu, M.; Qiao, T.; Wu, T.; Yang, X.; Zheng, N.; Choo, K.K.R. Identifying WeChat Red Packets and Fund Transfers Via Analyzing Encrypted Network Traffic. In Proceedings of the 2018 17th IEEE International Conference On Trust, Security And Privacy in Computing And Communications/12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), New York, NY, USA, 1–3 August 2018; pp. 1426–1432. [Google Scholar]
- Erdem, E.; Sandıkkaya, M.T. OTPaaS—One time password as a service. IEEE Trans. Inf. Forensics Secur. 2018, 14, 743–756. [Google Scholar] [CrossRef]
- Wang, D.; Wang, P. Two birds with one stone: Two-factor authentication with security beyond conventional bound. IEEE Trans. Dependable Secur. Comput. 2016, 15, 708–722. [Google Scholar] [CrossRef]
- Wang, D.; Cheng, H.; He, D.; Wang, P. On the challenges in designing identity-based privacy-preserving authentication schemes for mobile devices. IEEE Syst. J. 2016, 12, 916–925. [Google Scholar] [CrossRef]
- Jiang, Q.; Qian, Y.; Ma, J.; Ma, X.; Cheng, Q.; Wei, F. User centric three-factor authentication protocol for cloud-assisted wearable devices. Int. J. Commun. Syst. 2019, 32, e3900. [Google Scholar] [CrossRef]
Information | Sender | Receiver |
---|---|---|
Device | Mumu simulator | iPhone 8 Plus |
OS version | Android 4.4.4 | iOS 11.3 |
WeChat version | 6.6.6 | 6.7.1 |
Account name | Bob | Alice |
Message Types | Variable | Number of Messages |
---|---|---|
Text | length | 100 |
Red packet | amount and greeting | 100 |
Fund transfer | amount and note | 100 |
Picture | size | 100 |
Video | size | 100 |
Voice | size | 100 |
Algorithms | Accuracy | F1 | Precision | Recall |
---|---|---|---|---|
KNN | 0.9924 | 0.9930 | 0.9911 | 0.9951 |
SVM | 0.9689 | 0.9683 | 1.0000 | 0.9430 |
RF | 0.9984 | 0.9986 | 0.9979 | 0.9993 |
GNB | 0.9996 | 0.9996 | 1.0000 | 0.9993 |
Produces | Time (Seconds) | Time [16] |
---|---|---|
Time series transformation | 1.48 | / |
Traffic segmentation (burst) | 0.03 | 4.0 |
Feature extraction | 1.70 | 2.06 |
10-fold cross validation (GNB) | 0.04 | / |
Training and testing (RF) | / | 0.22 |
Total | 3.25 | 6.28 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, Q.; Xu, M.; Zheng, N.; Qiao, T.; Wang, Y. Identifying WeChat Message Types without Using Traditional Traffic. Information 2020, 11, 18. https://doi.org/10.3390/info11010018
Zhang Q, Xu M, Zheng N, Qiao T, Wang Y. Identifying WeChat Message Types without Using Traditional Traffic. Information. 2020; 11(1):18. https://doi.org/10.3390/info11010018
Chicago/Turabian StyleZhang, Qiang, Ming Xu, Ning Zheng, Tong Qiao, and Yaru Wang. 2020. "Identifying WeChat Message Types without Using Traditional Traffic" Information 11, no. 1: 18. https://doi.org/10.3390/info11010018
APA StyleZhang, Q., Xu, M., Zheng, N., Qiao, T., & Wang, Y. (2020). Identifying WeChat Message Types without Using Traditional Traffic. Information, 11(1), 18. https://doi.org/10.3390/info11010018