For my bachelor’s thesis, I evaluated the applicability of artificial intelligence within a mobile web application. Therefore, a prototype for classifying handwritten digits on a mobile touch screen device was implemented using React and multiple machine learning libraries. Afterwards, this prototype was used to evaluate the training and prediction runtimes on different mobile devices.

Motivation

Mobile devices, especially smartphones, are becoming more and more important due to the continuous progress of digitalization. In addition, artificial intelligence is developing into an increasingly important branch of computer science, which is now on everyone’s lips. The topic „AI in the mobile web” arose from the idea of linking these two areas, i.e. to use artificial intelligence in the context of a mobile web application. For this, however, the applicability and usability of such an approach must first be examined. The main focus of the thesis was the implementation of a specific prototype, execution of performance analysis, and an evaluation of the produced results for deciding whether it’s reasonable to use or train a model in a web application on a mobile device.

Prototype

As already mentioned in the introduction, for evaluation purposes a prototype had to be implemented. Therefore, Facebook’s open-source JavaScript library React was used in addition to three different JavaScript machine learning libraries. These different machine learning libraries for comparison purposes are TensorFlow.js, Brain.js, and Keras.js. The prototype itself was designed as a simple Single-Page-Application for mobile web browsers. Therefore, the three different machine learning libraries were assigned to its own testing ground to try classification and / or training. Since TensorFlow.js is the only one of the three mentioned libraries that supports both the training of models and the classification with pre-trained models, only this library and its results will be discussed in the following.

TensorFlow.js uses two different types of Feed-Forward Neural Networks to compare the different runtimes they need for both classification and training. The first type is a Fully Connected Network where every output is connected to every input. The second type is a Convolutional Neural Network (CNN). Both networks receive a 28×28-pixel black and white image from the MNIST dataset as input. For the Fully Connected Network, the input image will be flattened out to a one-dimensional array with length of 784 for the input layer. This is followed by one hidden layer with 32 neurons and an output layer consisting of ten neurons for the ten different classes. The Convolutional Neural Network consists of one Convolutional Layer and 3 Fully Connected Layers. The Convolutional Layer, which extracts interesting features, is also followed by a Max Pooling Layer which downsamples the image.

To evaluate the impact of the different model architectures, the web application makes it possible to choose between the MLP and CNN as model architecture. After training was completed, the application displays the total time needed for the training process, the achieved accuracy and loss. Furthermore, the self-trained model can now be used to predict a handwritten digit to ensure everything went as intended.

Results

For testing the implemented prototype a variety of different test devices – both smartphones and tablets – and mobile web browsers were used. Since the runtimes for all browsers were identical, only the used device/hardware is important for consideration. 

Using a pre-trained model for classification purposes only requires a simple forward pass to get a prediction. Therefore, the input data is fed into the network, the data is processed step by step by each hidden layer and its activation function, and the output layer returns the classification result/prediction for the corresponding sample. Luckily, the forward propagation is not as computationally expensive as the backpropagation needed for training a neural network. As a result, the monitored runtime of classification was only about one-tenth of a second for both model architectures.

As I have already mentioned, the process of training a neural network is much more computationally expensive than making a prediction since the performance-intensive backpropagation is also needed. Therefore, much higher runtimes can be expected. Both models were trained with a total of 2,000 images, a batch size of 32, and 10 epochs. While the training of the MLP only took around five seconds for the most recent device (iPhone 8), the training process of our CNN already took about four minutes. For older devices, there was an even higher runtime. The oldest device (Sony Xperia Z2) already needed 20 seconds to train the MLP and nearly 19 minutes for training the CNN. From a user perspective, such long runtimes are unacceptable.

Conclusion

As shown in detail in the evaluation of the results, artificial intelligence is currently only of limited use in the context of a mobile web application. The runtime required for training, even with a small amount of training data, very quickly exceeds an acceptable value in more complex network architectures. However, it is quite conceivable to use smaller models for simple use cases, which produce consistently correct results with only a small amount of training data, already today. Furthermore, pre-trained models, that have been trained with the well-known Python library TensorFlow, can be downloaded and used in the web application without any problems. Thus the computationally complex training process can be shifted out of the web application and only a few milliseconds are needed to use the downloaded networks.