Development of an Automated Scraper to Identify Trends in Web Development

Gepostet am: 31. Oktober 2018

TL;DR

As part of my Bachelor’s Thesis I implemented a scraper which collects information about websites‘ HTML, CSS, and JavaScript. Furthermore, the Lighthouse KPIs are created for every website. This information is then used to plot the following statistics:

  • How many websites are using Grid / Flex / Float
  • How many websites are using React / Angular / jQuery
  • How do websites score on Lighthouse on average
  • How do these numbers change over time

These statistics were calculated for the whole set of websites, as well as for each 25% quantile. Further enhancements are necessary for the scraping performance and the analysis accuracy.

Motivation

The frontend segment is evolving very fast. In the past five years, many now very popular frameworks and languages were introduced and it is hard to keep up with each of them.

angular leading the statistics

Source: https://insights.stackoverflow.com/trends?tags=jquery%2Cangular%2Creactjs%2Cbackbone.js%2Cember.js%2Cvue.js%2Caurelia

The figure above shows an example of new libraries which have gained popularity very quickly, making it difficult for developers to keeping up with the state of the art. A decision about which technology to use is often hard to make, due to the high amount of new and maybe even unknown possibilities. As a result, it is interesting to know what is currently used. This information can be used as an additional indicator to make decisions about technologies and approaches as well as a possibility to compare one’s own work to others.

Objective

The goal of this thesis was the development of an automated pipeline which scrapes, analyzes and visualizes multiple websites to get information about what is actually used. The focus is on the technical implementation of such a pipeline to provide a base that can be expanded to offer more analyses. As a result, the core theme of this thesis is less to analyze as much and as detailed as possible. Instead, the goal was the development of the whole process chain from the scraping to the visualization which can be enhanced afterward.

Architecture

The main parts of the application were written in Python using the following architecture:

As can be seen it consists of three steps. The first one scrapes the raw data and stores it in a MongoDB. The next step takes the raw information and analyzes it regarding the defined evaluation goals. Afterward, the results are visualized and plotted. The graphs can be accessed on a website.

Scraping

As mentioned above, the majority of the program is written in Python. Since Lighthouse is currently only available as a node package, a separate server is necessary which is responsible for the creation of the Lighthouse report. This leads to the following architecture:

The required input is a list of URLs which are stored in a CSV file. For the purpose of the thesis, a part of the alexa-top-websites was used. The list of URLs is crunched via multiprocessing which makes the scraping step vertically scalable. Pyppeteer, the Python API for Puppeteer, is the library which is used to scrape the data from each website. Puppeteer allows to control a headless Chrome and can be used, amongst others, to open a webpage, get the page source, and get the HTML after it was edited by the browser and JavaScript. The next step is calling the Lighthouse server. A request with the URL is sent and then used to create the according report which is then sent back as a response to the main script. In order to get the styles and scripts which are integrated externally, the HTML is searched for links pointing towards a CSS or JavaScript file. Those links are then used to download the respective information.

Finally we aggregate and store the collected information, which consists, amongst others, of:

  • the HTML of the website after being edited by the browser and JavaScript
  • page source of the website
  • Lighthouse KPIs
  • internally and externally loaded JavaScript
  • internally and externally loaded CSS

Each scraping iteration stores the data in a new collection with a timestamp in its name. A complete run of the visualization step takes about 5-6 hours. The hardware used in this task is described at the end.

Analysis

In the beginning a poll was created in which the experts of the inovex GmbH had the chance to suggest their ideas for possible analysis goals. Three of the suggestions were chosen for further examination. Those three key results concern the following topics:

  • CSS layout
  • JavaScript technology
  • Average Lighthouse KPIs

The input for the analysis process is the output of the scraping step which consists of the raw data from the website and the KPIs from the Lighthouse report, namely accessibility, best practices, performance, progressive web app, and SEO.

There are three possibilities for a website to layout their DOM elements: flex, float, and gridHaving scraped all the CSS of a website the usage of those three ways is detectable by a keyword search. Grid can only be used as a value of the display attribute, whereas float can only be used as a property and flex in both ways. By searching for those keywords it is possible to detect the usage of each layout style.

The same approach was used to determine the JavaScript library in use, in my thesis I focused on React, Angular, and jQuery. As for the CSS evaluation, a keyword search is used as well. Websites built with React are using the react keyword to access the components of the library. For Angular it is ng- and for jQuery the jQuery.

With this simple keyword search we count the total number of websites using each layout style and the amount of websites using one of the three JavaScript libraries. The Lighthouse KPIs of all websites are further processed to calculate the average values over all websites. Furthermore, those numbers are broken down to each 25% quantile of the analyzed URLs. Those steps are visualized in the following figure.

Visualization

The last step aims to plot the results of the analysis. I used the visualization framework bokeh to create interactive diagrams along with the possibility to publish those plots on a seperate server.

The first visible element is a the drop-down list to select the analysis document. By default, the latest results are used. In case of a change, the following three plots are updated automatically.

The first plot shows the total numbers of websites using the grid, flex, and float layout. Per default, all quantiles are selected. If the user changes this, the values for the y-axis will be updated to always show only the results of the selected quantiles. As indicated by this graphic, the vast majority were classified as float and/or flex as their layout. Only about 20 websites were categorized as websites using grid.

The next plot visualizes the results of the JavaScript libraries analysis. As can be seen,  jQuery was detected on most websites.

This is followed by the Lighthouse results, as shown in the figure above. It visualizes the average values for each topic either for the whole collection or a selected quantile. It’s interesting that most websites seem to care more about being ranked highly in Google search than providing a speedy user experience, with performance being lower than 50 of a maximum of 100.

The next section contains the timelines. Those plots are statics and not influenced by the selected analysis document.

The plot above shows the plot with the history of the total number of websites using the grid, flex, or float layout. Currently, the results are not reliable enough to make statements about trends, as the number of analyzed websites varies between runs. The total count of scraped websites differs with each process and the second collection contains over 100 additional sites. Reasons for that can be, among others, timeouts, or differently fired on-load events. This noise needs to be considered and can only be fixed by a vast number of analyzed websites. As a result, the analysis process will be more reliable once a considerable number has been evaluated.

Further Research and Enhancement Opportunities

Better keyword detection

The first improvement concerns the classification of the CSS and JavaScript technology. A reason for a possible misclassification is that a keyword appears often enough without being used as a framework invocation, for example  :case"missing-glyph":return!1;  which is a JavaScript snippet of a website written with react. As can be seen, the Angular keyword ng- appears. If this happens often enough, the website will be classified as an Angular website. One way to avoid this is by determining a minimum value of how often a framework’s keyword appears on a website known to be written with that specific library. By parsing the websites which are listed at Made with Angular, this number, as well as the maximum value of keyword appearances of a framework which is not used (like shown in the example above), can be determined. Another possibility is to parse the JavaScript in order to check whether the ng- is at the beginning of the expression.

AWS Lambda

One way to improve the performance is AWS Lambda. Currently the scraping is implemented via multiprocessing. This means that even though it is already faster than the more simplistic approach via a single core, it is still too slow to analyze a high number of websites. A large set is necessary in order to get more reliable results. Thus, either the number of processors has to be increased or the application has to be migrated to AWS Lambda. Due to the problem of massively growing costs for higher numbers of processors, it is recommendable to migrate the application to AWS. Even though a concept needs to be created concerning the feasibility of using AWS, considering costs, performance, and the different solutions of AWS, the advantage of processing any chosen number of websites in parallel might outweigh it.

Refactoring

Another idea to improve the analysis is to break down the analysis results for the whole set of websites to each single one. Since a list of all processed websites is already stored by the scraping process, the implementation is also realizable for all data that has already been collected. This idea comes along with the suggestion to refactor the visualization completely. Instead of running a bokeh server, a node server in combination with a charting framework like d3 is recommendet. The reason is that the visualization should come along with more components, for example, a search field to get the results of a specific website, separate views for the timelines, the single website results, and the overall statistics. Those requirements exceed the possibilities of bokeh.

More Lighthouse

The next idea is to use of the Lighthouse report even more. The whole document comprises over 12.000 lines of JSON—while currently, only five of them are processed. Information like unminified CSS and JavaScript, the usage of responsive images, the inclusion of JavaScript libraries with known security issues, etc. are neither stored nor handled any further, even though they are generated with each scraping iteration.

More Analysis

The last idea worth discussing is the addition of more analyses which were suggested by inovex developers. Two suggestions stand out particularly, namely the check for the delivered JavaScript version and the inspection of the used JavaScript APIs. The proposed implementation for those two ideas is a keyword search. Since this approach is already used to determine the CSS and JavaScript statistics, the additional effort is far less than the other analysis ideas. To add the analysis of the delivered ECMAScript version and the used JavaScript API, the following steps need to be performed:

  1. Determine the respective keywords.
  2. Check the collected JavaScript for those indicators.
  3. Analyze the stored values.
  4. Visualize the analysis results.

Conclusion

By using a headless chrome, a list of websites is scraped. For each website the HTML (edited by by the browser) and JavaScript, internally and externally loaded JavaScript, the page source, the Lighthouse KPIs, and internally and externally loaded CSS, is stored. Afterwards, this information is used during the analysis step to create statistics about how many websites are using either grid, flex, or float, how many are using jQuery, Angular, or React, and how they perform in Lighthouse on average. Finally, the calculated number are plotted.

Considerable enhancements concern the scraping performance and the accuracy of the JavaScript/CSS classification. An approach to solve the former is AWS Lambda. For the latter, an optimization of the minimal number of true positives, and the maximum number of false positive keywords provides a way to enhance the precision.

Summed up, this thesis provides a base which can be further enhanced in order to get data about the usage of technologies and various KPIs. By collecting the information periodically, trends can be identified and used as a part of a decision process about which technology to use in your own project.

Laptop specs:

Device: MacBook Pro (Retina, 13-inch, Early 2015)
Processor: 2,9 GHz Intel Core i5
Memory: 16 GB 1867 MHz DDR3
Download speed: 200 Mb/s

2018-10-31T14:02:42+00:00