1. Introduction
Modern artificial intelligence algorithms make machines better in decision-making applications [
1], as well as in interaction-related applications, becoming more and more intuitive for a human user [
2,
3,
4,
5]. Computers, with the increase of their computing power, are able to perform many activities much faster than humans [
6,
7,
8,
9,
10]. The world’s leading research centers are working on algorithms that would allow the machine to compete with the human brain [
11,
12,
13,
14]. Adapting (at least partially) intelligent human behavior enables the analysis and appropriate interpretation of data, and thus the advancement in some of the current challenges, including decision-making processes [
10,
15,
16].
Machine learning relies on the analysis of a set of collected inputs to detect patterns, similarities or characteristics [
5,
17,
18,
19]. This makes the machine able to ‘learn’ (as the term implies) based on the data received [
5,
17,
20]. Training performed in an iterative form allows one to increase the efficiency and accuracy of predictions [
21]. This is because the algorithms that the machine uses in each repetition gather experience, which is then being used for making adjustments [
21,
22]. Samples of categorized data are also provided for training, thanks to which it is possible to check what result is expected for them. This enables the computer to calculate output values that get closer to the expected value with each iteration. During the learning process, an algorithm is created that enables the machine to process data and, on its basis, also predict values of a specific type [
3,
19,
23,
24,
25].
The variety of applications is constantly expanded in specialized research centers, both public and private [
15,
19], and the rapid development of computers enables the processing of more and more data and the use of increasingly complex algorithms. Thanks to machine learning, it became possible to solve complex problems [
19,
20,
21,
24,
25,
26].
The implementation of self-learning algorithms allowed machine learning to be used in solving various tasks [
19,
20,
27,
28]. Machine learning models, trained with the use of images and videos, made it possible to extract and identify specific objects, faces and/or persons [
19,
24,
29,
30,
31]. It is also possible to infer within the time domain, i.e., to generate predictions [
31].
1.1. Motivation for This Study
Scientists all over the globe are interested in increasing the computational capability of their laptops and desktops in order to increase their productivity and comfort [
32,
33,
34,
35]. However, the hardware manufacturers provide only raw parameters (while the computational capability is still strongly dependent on the software frameworks and the operating system), and comparing frameworks is also difficult due to the differences in recommended hardware [
36,
37].
The most inconvenient aspect of buying new hardware for research work is the fact that the products are often advertised using nonprofessional (marketing) (‘foggy’) expressions/phrases, not related to the actual comfort and computational efficiency during everyday work [
38,
39].
This is particularly true and important in the case of computers equipped with the Apple silicon chip, while this platform is one of the most tempting options for researchers who want to have hardware acceleration support for their machine learning projects [
40,
41]. Apple’s M1 chip is equipped with integrated neural cores, which makes this platform worth consideration when ‘updating’ computational hardware [
42,
43]. However, it is very difficult to find any information on the actual benefits of using a CPU with neural cores, and it is even more difficult to find any comparisons between existing versions of the CPU and between M1 and the Intel based laptops of previous editions.
The authors decided that such a comparison may be very useful, if done with real-world modern machine learning examples, and that the results may be of great value for researchers who are considering the investment in buying a new (possibly neural-core-enabled) laptop.
1.2. Scope and Limitations of Study
The research presented within this paper is targeted toward the researcher (laptop buyer) who is trying to make a reasonable choice among Apple’s laptops. The authors’ approach is to clarify and substantiate concrete arguments that present the advantages of particular laptop models in numbers (as opposed to the foggy world of marketing), to enable the possibility to make a fact-based choice on which Apple laptop to buy. The reason for benchmarking only Apple laptops was made upon the fact that M1s have a neural processing unit, and therefore many researchers-laptop buyers may be interested in measuring the real benefits of NPU acceleration in M1s.
This research compares all M1 models against one Intel i5 processor. The i5 is not the most efficient member of its architecture family, but it is the only one available on the market in the new Apple laptops at the moment. Although Apple used to sell i9 based laptops in the past, it failed to fit the preferences of users, possibly due to the increased energy consumption (shorter battery life) and cooling (Apple users are not used to active cooling systems—CPU fans are preferred to be quiet/off), and the use of i9s was discontinued.
The comparison was made only among Apple laptops, and it would be also interesting to see the comparison beyond that limitation. However, it was decided to limit the number of variables to make the research as precise and trustworthy as possible. The authors did not mix operating systems; there was also no disturbance resulting from switching toolboxes, libraries, frameworks and programming languages. The uniform testing conditions were provided on purpose—so that the research would not be accused of being influenced by a poor library choice or questionable framework implementation. It was decided to compare only the current laptops sold by Apple, to assess the real benefits of NPU and the real differences between M1 versions. Of course, the benchmarks of other architectures would be interesting to see in this comparison (including CUDA-enabled platforms), but it would be a completely new challenge (with the difficulty set on making the comparison a just and honest test for different operating systems and environments), which was not in the scope of this paper.
1.3. Benchmarking
The benchmarking of computer hardware is performed by using specialized software, designed to stress-test and to show the computational power limitations of the whole system or of its specific components, usually presenting the results in the form of a computation time [
44,
45,
46]. It is very important that the tasks are as close to reality as possible, that is, that they correspond to real use cases in which the equipment will be used. The use of realistic tasks ensures that the measured performance corresponds to ‘real’ applications [
47]. ML-related benchmarking is increasingly attracting attention [
48,
49,
50,
51].
It is very important to properly analyze and interpret the results after the end of the measurement phase. After collecting data, the difference between the comparative runs of the benchmark executed on the competitive machines is analyzed. The results can be converted into an easy-to-interpret reference scale such as points or ratings, visualized and/or compared to other published results of the same benchmark [
52].
2. Materials and Methods
According to the title of this paper, the main objective of the research was to assess the usability of M1 based platforms for basic machine learning research tasks (for which a laptop-based environment would be sufficient). The benchmarks and comparisons were made using Swift based ‘Create ML’ machine learning ‘Playground’ project, developed and run in the XCode environment.
Swift is an open-source programming language developed by Apple. Since its release in 2014, the language has continuously evolved to make mobile and desktop programming easy and fast. The syntax is intuitive and allows an interactive programming process. Swift is optimized to perform best on its target devices—platforms developed by Apple. The achieved efficiency is high, because the compiled code is adjusted to the hardware [
53].
Xcode is an integrated development environment (IDE) created by Apple for the macOS operating system. It supports various programming languages and includes many tools for the software development process. Using Xcode, the user is able to design user interfaces with dynamical adjustment to the screen size, write code that can take advantage of many frameworks shared by Apple and run the code on physical or virtual Apple devices and many more. Finished applications can be beta-tested by real users using the ‘TestFlight’ program or sent directly to the App Store [
54,
55].
Xcode Playgrounds are a solution for quick and interactive program development. These simplified IDE versions allow testing the code almost in real time. At the same time, Playgrounds enable the user taking full advantage of the Xcode IDE and its frameworks, without creating large projects. The advantage of easy project sharing and ability to include all datasets in the created playground are the reasons why Xcode Playgrounds were selected for this research.
The framework ‘Create ML’ was introduced by Apple at the Worldwide Developers Conference in 2018. Its aim is the development of custom machine learning models in a simple way. To train the models, Create ML can analyze various types of inputs, including images, text, tabular data and more. The created models can be implemented in Core ML applications for the Apple platform [
23,
56]. The inputs and outputs of Create ML are presented in
Figure 1.
‘Create ML’ is optimized to perform fast model training on computers that are made by Apple. The model creation process should popularize machine learning technologies in applications for Apple devices [
57].
The workflow with ‘Create ML’ involves the following steps: (1) identifying the problem whose kind of action is required from the model, and (2) choosing a training template. ‘Create ML’ ships with a collection of templates including an image classifier, object detection and more. Apple extends this list regularly [
58,
59].
The templates are needed to determine the right training algorithm. The training techniques developed by Apple allow to create an efficient machine learning model, even with a small quantity of data. The process performance and memory usage are hardware-optimized. After training, a ‘.mlmodel’ file is created. The machine learning model is saved in it. The file can be implemented in Xcode projects using Core ML [
57]. The process of creating a model with Create ML is shown in
Figure 2.
‘Create ML’ can be used in two ways: by writing code in the Swift programming language or using a graphical application (graphical user interface, GUI) bundled in Xcode. In both ways, the user is able to create and train custom machine learning models that are ready for implementation in applications [
56,
59]. In the created project, ‘Create ML’ was used without the GUI.
2.1. Model Creation
To create a machine learning model using ‘Create ML’, it is necessary to know what the expected output is, and what data will allow to achieve it. This will enable the selection of an appropriate ‘Create ML’ ‘template’ (that determines the training type) and the preparation of a relevant dataset [
57]. The data should be differentiated, in a similar quantity for each of the considered classes, obtained in a similar format, by similar methods and/or sensors. The performance of the model depends on the size of the datasets. The more data used for training, the greater the likelihood of obtaining good results [
57,
60].
The collected data must be properly segregated and labeled. When processing images, text or audio files, Create ML suggests that the best way is to prepare folders (directories) named after the classes existing within the dataset, and then to place the files in the correct folders. Tabular data can be provided in the form of tabular data files, e.g., ‘.csv’ or ‘.json’, and then converted by Create ML to the MLDataTable type. It is necessary to indicate which columns will be used as ‘features’ and which ones as the ‘target’ columns that are to be returned by the model [
57,
60].
The prepared data should also be divided into two subsets: ‘training data’ on which the training is performed and ‘test data’ used for the evaluation of the created model. Apple recommends randomizing the whole dataset into these two subsets in a ratio of 80:20. This makes it possible to test the model made on data that have not been used for training purposes, which makes the evaluation more accurate. The process of image data preparation for ‘Create ML’ is presented in
Figure 3. In the case of tabular data, ‘Create ML’ has a ready function that allows for automatic data division [
60,
61].
2.2. Model Training
The process of creating a machine learning model in ‘Create ML’ consists of several stages, depending mainly on the selected template. In most cases, the process is similar to the image classifier example shown in
Figure 4.
The first step in creating an image classifier is extracting the features from the provided data. The system first extracts a separate set of ‘validation data’ from the ‘training data’ set that is used in the training process. Next, Create ML analyzes the rest of the training data to determine the image features. The same process is repeated on the validation data.
In the next stage, training takes place. ‘Create ML’ uses transfer learning [
57], capable of applying an existing model (trained with a dataset relevant to one problem) to a completely new problem. The macOS operating system already has extensive machine learning models [
57,
62] that were created by Apple. ‘Create ML’ uses their patterns to perform a new training using previously extracted features. This allows to speed up the entire process, increase the accuracy and reduce the size of the models [
57].
The case training a tabular data model is presented in
Figure 5. At the beginning, the data are read and then the validation subset is extracted. If the user has chosen a specific algorithm to be used for training, it is used. However, there is also a possibility of an automatic selection of the algorithm. In this case, ‘Create ML’ performs the model training using various algorithms with parameters selected by itself and then compares their results. The process continues until the optimal solution is found.
2.3. Model Evaluation
After completing the training, it is possible to evaluate the model. For this purpose, the test subset of the dataset is used. ‘Create ML’ first repeats the process of analyzing the received data, and then passes the data to the model. In the next step, the feedback responses from the model are confronted with the ground truth (labeled testing dataset). Evaluation results are usually lower than training and validation results because tests are performed on this part of the dataset, which has not been involved at the training stage. This allows for a better presentation of the situation of a real application of the model [
60].
2.4. Model Export
After training, a ‘.mlmodel’ file is created containing the saved machine learning model. This file can be imported directly into Xcode projects and implemented using ‘Core ML’. It is also possible to save the file in a selected location on the disk or to share it using file sharing [
57].
2.5. Datasets Used within This Research
To perform this research about the machine learning model creation time, three model types were selected:
Image classifier;
Tabular classifier;
Tabular regressor.
Two image datasets and two tabular datasets were used in this work.
2.5.1. ClassifierData Dataset (Photos)
The dataset ‘ClassifierData’ contains images of electronic devices divided in 4 categories: ‘iPhone’, ‘MacBook’, ‘Apple Watch’ and ‘AirPods’. Each category counts 25 photos captured with an iPhone X. The pictures present the devices from different angles and in various light conditions. A sample view of the dataset is shown in
Figure 6.
The dataset is divided into two parts: training data and testing data, split with a ratio of 4:1. The directory structure of the dataset complies with Apple’s ‘Create ML’ guidelines [
60] and is presented in
Figure 7. The size of the ‘ClassifierData’ dataset is
MB.
2.5.2. Animals Dataset (Kaggle)
An image dataset called ‘Animals’ is available on the Kaggle platform, within the project named ‘
Animals Detection Images Dataset. Collection of wild animal species with annotations’ [
63]. The collection was made available to the public domain under a Creative Commons license.
The ‘Animals’ dataset is divided into two subsets for training and for testing. In each of them there are photos in 80 different categories. They include various types of animals such as dogs, cats, lions, tigers and many others. A sample view of the dataset is presented in
Figure 8. The size of the entire dataset is
GB. This research was conducted by using the following quantities for the training, validation and testing—21,468, 1098 and 6505, respectively. Label files were removed from each category as they are not used for the ‘Create ML’ image classifier.
2.5.3. PaymentFraud Dataset (Kaggle)
The ‘PaymentFraud’ dataset refers to the ‘
Online Payments Fraud Detection dataset’. Online payment fraud big dataset for testing and practice purpose’, from the Kaggle platform [
64]. The dataset includes 5,080,807 entries, divided into two classes, each entry described by eight features. Its author is the user ‘Rupak Roy’. The collection was made available under a Creative Commons–attribution–noncommercial license.
The set consists of a ‘.csv’ file containing data about online transactions, including, inter alia, the sender’s and recipient’s account balances, amount and type of transaction. Each entry also contains an ‘isFraud’ label, which describes whether the transaction was normal (0) or was a fraud (1). A view of the dataset is shown in
Figure 9. The size of the ‘PaymentFraud.csv’ file is
MB.
2.5.4. SteamReviews Dataset (Kaggle)
The ‘SteamReviews’ dataset contains approximately 21 million user reviews of games available on the Steam platform. It was uploaded to the Kaggle platform by ‘Marko M.’ under the name ‘
Steam Reviews dataset 2021. Large collection of reviews of Steam games’ [
65]. The dataset is available under the GNU GPL 2 license.
In order to adapt the dataset for use in ‘Create ML’, a cleanup was performed. The Python programming language was chosen for its ease of programming and adaptation to working with data. Using the ‘Pandas’ package, the contents of the ‘.csv’ file were loaded into the ‘DataFrame’ format. Then the data was cleaned, by removing some of the columns, including the comment text, author, creation date and others. Finally, the cleaned dataset was saved to a
GB file ‘SteamReviewsCleaned.csv’. A sample view of the file is presented in
Figure 10.
2.6. Choosing Appropriate Classifiers for a Particular Dataset
As part of the study, the training and evaluation of 4 machine learning models were carried out. The following classifier were used for each of the four datasets:
ClassifierData—image classifier;
Animals—image classifier;
PaymentFraud—tabular classifier;
SteamReviews—tabular regressor.
2.7. Models Creation
The models were created using functions written in the Swift programming language. The functions are described below.
Each image classifier was created in a function called ‘createImageClassifier’. The training dataset was passed using the ‘MLImageClassifier.DataSource’ type. At the beginning, the actual timestamp (with nanosecond accuracy) was saved. Next, the ‘Create ML’ process of training an image classifier was started using the ‘MLImageClassifier’ function. The whole process was documented/logged in the console. When the training was done, another timestamp was saved and the total training time was calculated. The function returned the created model and time value in seconds.
Implementing the use of the ‘createTabularClassifier’ function was similar to the process of creating an image classifier. The only differences were that the data type was different—‘MLDataTable’, and the Create ML function ‘MLClassifier’ took not only the training data but also required indicating the target column for classification. The time of the training was measured and returned (in seconds), as well as the created model.
A function called ‘createTabularRegressor’ was used to create a tabular regressor. It took in the same arguments as the ‘createTabularClassifier’ function. However, the missing data needed to be removed from the ‘MLDataTable’ using the ‘data.dropMissing()’ command. The ‘MLLinearRegressor’ function performed a Create ML training of a tabular regressor model.
2.8. Model Testing
Each model was tested using a dedicated function. The time of the Create ML testing process was also measured and the results acquired.
The ‘testImageClassifier’ function took the testing dataset as an argument in the form of a ‘MLImageClassifier.DataSource’ type, as well as the created machine learning model. Firstly, the initial timestamp was acquired. Then, the Create ML testing process was performed, and the percentage of correctness was calculated based on the result returned. The function calculated the total time of the testing by subtracting the initial timestamp from a final timestamp. The evaluation of a tabular classifier was almost the same, except for the model type (‘MLClassifier’) and the testing data type (‘MLDataTable’).
To perform a test of a trained tabular regressor, the ‘testTabularRegressor’ function was used. The data type of the model was ‘MLLinearRegressor’ and the testing dataset in the form of ‘MLDataTable’. After the evaluation, the maximum error and RMSE (root-mean-square error) were returned. The function also measured the time of the testing process.
2.9. Parent Functions of Creating Models
The process of training and testing machine learning models was controlled by special functions, created for each ‘Create ML’ template type. The functions loaded the datasets, called the subfunctions described before and presented the results in console. This reduced the required amount of code, because in the case of the image classifier, functions were universal—they could be used with various datasets.
In the ‘createImageModel’ function, the image classifier was created and evaluated. Firstly, the paths to the training and testing data were created, based on the dataset name passed. Then, the data was loaded into the ‘MLImageClassifier.DataSource’ type, and the directory structure of ‘labeledDirectories’ type was declared. The model was created using the ‘createImageClassifier’ subfunction. The training and validation accuracies were calculated from the result. Then, the model evaluation was performed using the ‘testImageClassifier’ subfunction. All results were printed to the console.
The tabular data was handled in a similar way. To assess the tabular classifier, a function ‘createTabularClassifierModel’ was used. At the beginning, a ‘.csv’ file with the dataset was opened and loaded into the ‘MLDataTable’ type. Next, the feature columns were declared and a new ‘MLDataTable’ was created using the feature columns only. The new table was then randomly split into training and testing datasets, in compliance with the recommended 8:2 proportion. To create a tabular classifier, the ‘createTabularClassifier’ subfunction was used. The subfunction took the training data and a target column as arguments. The created model and testing dataset were passed to the ‘TestTabularClassifier’ subfunction. The obtained results were presented in the console. The tabular regressor research process differed from the classifier by the obtained data types and datasets.
2.10. Measurement of Execution Time
In order to obtain a greater reliability of the results when measuring the time of creating the models, the study was performed in a loop. All measurements were made three times. The total run time of each iteration of the loop, the development process and the model testing process were measured. All information about the script running status was reported in the console on an ongoing basis.
2.11. Hardware Used
The research was carried out on 4 Apple MacBook Pro computers. Each device was running macOS Monterey version 12.4 [
66], with Xcode version
installed.
The four above-mentioned notebooks were chosen to reflect the buyer’s options when buying a new laptop. The main difference was the CPU version (and architecture): Intel i5 (8 GB), Apple M1 (8 GB), Apple M1 Pro (16 GB) and Apple M1 Max (32 GB).
The first system compared was the Apple MacBook Pro 2016. It is the only one of the compared computers operating on the Intel architecture. It has a dual-core Intel Core i5 GHz processor, 8 GB of RAM and an integrated graphics chip Intel Iris Graphics 500, equipped with 1536 MB. This computer is referred to as ‘i5’ in this work.
The next tested device was the 202 MacBook Pro. It is a computer with the first Apple M1 integrated chip, based on the ARM architecture. It has 8 computing cores: 4 high-performance and 4 energy-saving. The chipset also includes an 8-core GPU (graphical processing unit) and a 16-core Neural Engine—Apple’s custom NPU (neural processing unit), accelerating machine learning computations [
67]. The computer also has 8 GB of RAM. It is referred to as ‘M1’ in this research.
Another computer on which the tests were performed was the MacBook Pro 2021 with the M1 Pro chip. This is a higher option on the Apple chipset, featuring an 8-core CPU and a 14-core GPU. The main difference between the M1 and the M1 Pro systems is the division of the functionality of the processor cores—the M1 Pro uses 6 cores as high-performance and 2 as energy-saving ones. The memory interface is also improved [
68]. The M1 Pro also features 16-core Neural Engine [
68]. The computer is equipped with 16 GB of RAM. It is referred to as ‘M1 Pro’ for easy recognition.
The fourth computer tested was the MacBook Pro 2021 with the M1 Max chip. It is a processor equipped with 10 CPU cores (8 high-performance and 2 energy-saving), 24-core GPU and 16-core Neural Engine and 32 GB of RAM. According to [
68], the bandwidth of the system is improved compared to the M1 Pro [
68], as well as the built-in memory [
68]. It is referred to as ‘M1 Max’.
4. Discussion
The research included a comparative analysis of the computational performance of four currently available Apple laptop processor models in machine learning applications performed using ‘CreateML’—a machine learning tool designed by Apple. As Create ML is able to create the same machine learning models, regardless of the hardware it runs on, all models obtained satisfactory results, which confirmed their usability and applicability in the prepared use cases.
The obtained results of the processing time measurements clearly showed that the ‘M1 Pro’ and ‘M1 Max’ computers were best suited for ‘CreateML’ machine learning applications. Older computers, although technically weaker, were still capable of making useful machine learning models, although it took considerably more time. The use of three publicly available datasets in the work enabled the research to be repeated and the results to be compared.
On the basis of the obtained results the following conclusions can be drawn:
The computer equipped with the ‘M1’ chipset was the best one at creating models from smaller datasets;
The ‘M1 Pro’ and ‘M1 Max’ systems usually achieved similar results;
The ‘M1 Max’ was the best one at processing tabular data;
The advantage of the ‘M1 Max’ system over the ‘M1 Pro’ grew with the size of the tabular dataset;
Despite the longest training and validation times, the ‘i5’ computer created similar or identical machine learning models, and in the case of the ‘ClassifierData’ dataset, it even achieved the highest accuracy;
The number of iterations of the training process in ‘Create ML’ did not translate directly into the correctness obtained;
The ‘training data—analysis time’ and validation had no direct effect on the ‘model training—total time’ of the model;
Despite the use of the same Neural Engine accelerator (NPU) in the ‘M1’, ‘M1 Pro’ and ‘M1 Max’ systems, their training times were different, which suggests that the training process was also influenced by other parameters of the microcircuit and/or system;
In all of the time results of the analysis or data processing, there was a similarity in the results of Apple’s M1 based chipsets;
The evaluation confirmed that the prepared models were almost identical or the same (in the case of tabular sets), for each model;
The ‘evaluation accuracy’ was usually lower than the ‘training accuracy’ and ‘validation accuracy’.
It is possible to extend the ‘Benchmarker’ program with the possibility to use new data sets. Its methods and functions are already prepared to allow for the easy use of new data. This would allow us to check how the processors/systems deal with datasets of different sizes and properties. Another new research opportunity is to perform the benchmarks on other computers. This would allow the results to be confronted with other macOS-compatible hardware options (including iMac or Mac Pro) and to check their machine learning performance.
It would also be interesting to analyze DL-based image processing models with image outputs, as they require much larger computational resources; however, the graphical output would make it much more challenging to determine the correctness and compare the efficiency among the tested platforms.