- How many websites are using Grid / Flex / Float
- How many websites are using React / Angular / jQuery
- How do websites score on Lighthouse on average
- How do these numbers change over time
These statistics were calculated for the whole set of websites, as well as for each 25% quantile. Further enhancements are necessary for the scraping performance and the analysis accuracy.
The frontend segment is evolving very fast. In the past five years, many now very popular frameworks and languages were introduced and it is hard to keep up with each of them.
The figure above shows an example of new libraries which have gained popularity very quickly, making it difficult for developers to keeping up with the state of the art. A decision about which technology to use is often hard to make, due to the high amount of new and maybe even unknown possibilities. As a result, it is interesting to know what is currently used. This information can be used as an additional indicator to make decisions about technologies and approaches as well as a possibility to compare one’s own work to others.
The goal of this thesis was the development of an automated pipeline which scrapes, analyzes and visualizes multiple websites to get information about what is actually used. The focus is on the technical implementation of such a pipeline to provide a base that can be expanded to offer more analyses. As a result, the core theme of this thesis is less to analyze as much and as detailed as possible. Instead, the goal was the development of the whole process chain from the scraping to the visualization which can be enhanced afterward.
The main parts of the application were written in Python using the following architecture:
As can be seen it consists of three steps. The first one scrapes the raw data and stores it in a MongoDB. The next step takes the raw information and analyzes it regarding the defined evaluation goals. Afterward, the results are visualized and plotted. The graphs can be accessed on a website.
As mentioned above, the majority of the program is written in Python. Since Lighthouse is currently only available as a node package, a separate server is necessary which is responsible for the creation of the Lighthouse report. This leads to the following architecture:
Finally we aggregate and store the collected information, which consists, amongst others, of:
- page source of the website
- Lighthouse KPIs
- internally and externally loaded CSS
Each scraping iteration stores the data in a new collection with a timestamp in its name. A complete run of the visualization step takes about 5-6 hours. The hardware used in this task is described at the end.
In the beginning a poll was created in which the experts of the inovex GmbH had the chance to suggest their ideas for possible analysis goals. Three of the suggestions were chosen for further examination. Those three key results concern the following topics:
- CSS layout
- Average Lighthouse KPIs
The input for the analysis process is the output of the scraping step which consists of the raw data from the website and the KPIs from the Lighthouse report, namely accessibility, best practices, performance, progressive web app, and SEO.
There are three possibilities for a website to layout their DOM elements: flex, float, and grid. Having scraped all the CSS of a website the usage of those three ways is detectable by a keyword search. Grid can only be used as a value of the display attribute, whereas float can only be used as a property and flex in both ways. By searching for those keywords it is possible to detect the usage of each layout style.
The last step aims to plot the results of the analysis. I used the visualization framework bokeh to create interactive diagrams along with the possibility to publish those plots on a seperate server.
The first visible element is a the drop-down list to select the analysis document. By default, the latest results are used. In case of a change, the following three plots are updated automatically.
The first plot shows the total numbers of websites using the grid, flex, and float layout. Per default, all quantiles are selected. If the user changes this, the values for the y-axis will be updated to always show only the results of the selected quantiles. As indicated by this graphic, the vast majority were classified as float and/or flex as their layout. Only about 20 websites were categorized as websites using grid.
This is followed by the Lighthouse results, as shown in the figure above. It visualizes the average values for each topic either for the whole collection or a selected quantile. It’s interesting that most websites seem to care more about being ranked highly in Google search than providing a speedy user experience, with performance being lower than 50 of a maximum of 100.
The next section contains the timelines. Those plots are statics and not influenced by the selected analysis document.
The plot above shows the plot with the history of the total number of websites using the grid, flex, or float layout. Currently, the results are not reliable enough to make statements about trends, as the number of analyzed websites varies between runs. The total count of scraped websites differs with each process and the second collection contains over 100 additional sites. Reasons for that can be, among others, timeouts, or differently fired on-load events. This noise needs to be considered and can only be fixed by a vast number of analyzed websites. As a result, the analysis process will be more reliable once a considerable number has been evaluated.
Further Research and Enhancement Opportunities
Better keyword detection
One way to improve the performance is AWS Lambda. Currently the scraping is implemented via multiprocessing. This means that even though it is already faster than the more simplistic approach via a single core, it is still too slow to analyze a high number of websites. A large set is necessary in order to get more reliable results. Thus, either the number of processors has to be increased or the application has to be migrated to AWS Lambda. Due to the problem of massively growing costs for higher numbers of processors, it is recommendable to migrate the application to AWS. Even though a concept needs to be created concerning the feasibility of using AWS, considering costs, performance, and the different solutions of AWS, the advantage of processing any chosen number of websites in parallel might outweigh it.
Another idea to improve the analysis is to break down the analysis results for the whole set of websites to each single one. Since a list of all processed websites is already stored by the scraping process, the implementation is also realizable for all data that has already been collected. This idea comes along with the suggestion to refactor the visualization completely. Instead of running a bokeh server, a node server in combination with a charting framework like d3 is recommendet. The reason is that the visualization should come along with more components, for example, a search field to get the results of a specific website, separate views for the timelines, the single website results, and the overall statistics. Those requirements exceed the possibilities of bokeh.
- Determine the respective keywords.
- Analyze the stored values.
- Visualize the analysis results.
Summed up, this thesis provides a base which can be further enhanced in order to get data about the usage of technologies and various KPIs. By collecting the information periodically, trends can be identified and used as a part of a decision process about which technology to use in your own project.
Device: MacBook Pro (Retina, 13-inch, Early 2015)
Processor: 2,9 GHz Intel Core i5
Memory: 16 GB 1867 MHz DDR3
Download speed: 200 Mb/s