Developing a Streaming-based Architecture for Demand Prediction of Taxi Trips in the Presence of Concept Drift

Master Thesis von Marcel Hofmann, 20.05.2019

Home / Über uns / inovex Lab / Streaming-based Architecture for Demand Prediction

Machine learning models are omnipresent for predictions on data streams. One challenge of deployed models is the change of data over time – a phenomenon called concept drift. If not considered in the entire process, from model design to deployment, a concept drift can lead to significant mispredictions. In this thesis the effects of concept drift in regression tasks are explored. A novel approach for concept drift handling is introduced, which depicts a strategy to switch between the application of simple and complex machine learning models. The approach leverages the individual strengths of each model, switching to the simpler model if a sudden drift occurs and switching back to the complex model for typical situations. To evaluate the approach, it is instantiated on a real-world dataset of taxi demand in New York City. This dataset is prone to multiple drifts, e.g. the weather phenomena of blizzards, resulting in a sudden decrease of taxi demand, or festival taking place, which results in unusually high demand. The analysis of the approach over a time span of six years shows that the suggested approach outperforms all regarded baselines significantly.

1. Introduction

In the last decade the amount of data available to businesses has continued to grow and reaping the benefits of this resource has become a necessity in all industries. Machine Learning (ML) is playing an important role in this context, helping to transform and (semi-)automate established business processes { spanning from marketing to operations (Chen et al., 2012). Typical applications of ML range from computer vision over speech recognition to natural language processing or the control of manufacturing robots. Thereby, these techniques are especially influencing data-intensive tasks such as consumer services or the analysis and handling of faults in complex production systems (Jordan and Mitchell, 2015).

ML can create ongoing value when the resulting models are deployed in the information systems of the respective company and deliver ongoing recommendations and optimized decisions on continuous data streams (Dunning and Friedman, 2017). However, data streams usually change over time and thus, their underlying probability distribution or their data structure changes (Wang and Abraham, 2015). For example, many data streams describing human behaviour are bound to shift over time, as influencing factors like trends and preferences change (Zliobaite et al., 2016). The challenge of changing data streams for supervised ML tasks has been described with the term concept drift (Widmer and Kubat, 1996). Most research on concept drift has been focused on classification tasks, and the minority of published papers considers regression tasks (Cavalcante et al., 2016; Baier et al., 2019). Additionally, most of the work on concept drift uses artificial datasets to evaluate new approaches or is focused only on a small subset of publicly available datasets, like the Australian Electricity Market or the Airplane dataset (Gama et al., 2014). While this allows for objective comparisons, the emphasis on simulated problems means that many real-world challenges are ignored. „Real“ data is noisy, contains errors, and it is often unclear what to look for (L’Heureux et al., 2017). Therefore, this work focuses on the application of concept drift strategies for regression tasks on an unexplored real-world dataset which leads to the first research question of this work:

RQ1: How can we address concept drift in regression problems in a real-world context?

For answering this question the application domain of demand forecasts is considered using the publicly available dataset of every taxi trip performed in New York City since 2009 (TLC, 2019). By predicting the number of taxis needed in certain regions of the city at a specific time, the waiting times can be reduced, increasing efficiency and customer satisfaction. With increasing competition in the mobility market (e.g. Uber, Lyft, Lime), being able to anticipate demand may be a decisive factor (Zhao et al., 2016). Given the long time span covered by the dataset and the many hidden variables that can influence taxi demand, different types of drift can be observed, which indicates its suitability for the task at hand. The first part of this research will focus on detecting and describing drifting concepts and evaluating existing approaches on this regression task.

An initial analysis of several predictive models shows that during certain short periods a simple baseline model was significantly more accurate than more sophisticated approaches. In their proposal for a machine learning model management architecture, called „Rendezvous-Architecture“, Dunning and Friedman (2017) include a so-called „Canary Model“. Named after the canary bird used by miners to detect harmful gases, this model is used to „detect shifts in the input data and as a comparison benchmark for other models“ (Dunning and Friedman, 2017, p.38). Inspired by this approach of leveraging an older or simpler model, a second research question emerged:

RQ2: How can we leverage models with different degrees of complexity to make the prediction more robust?

In answering this question a new approach is proposed – named error intersection approach (EIA), which utilizes static prediction models which are alternated based on the development of the error curve. Static models have the advantage that they need to be implemented only once and can also be verified and tested extensively before they are deployed in production for ongoing predictions. This approach is evaluated against detectors for sudden drift, both on its accuracy overall, as well as the behaviour during specific time frames. The thesis is organized as follows: Chapter 2 analyzes the current state of relevant research in the areas of concept drift, time series prediction and traffic demand prediction. Chapter 3 describes the design and implementation of predictors and drift detectors that are used to answer the research questions. Chapter 4 evaluates the accuracy of these predictors and detectors over a time span of six years. Finally, a conclusion is drawn in Chapter 5 on the implications of this work and areas of future research are outlined.

[…]

Developing a Streaming-based Architecture for Demand Prediction of Taxi Trips in the Presence of Concept Drift pdf, 1.05 MB

Haben Sie Fragen?

Christian Meder

Chief Technology Officer

Anrufen E-Mail senden

Name	Borlabs Cookie
Anbieter	Eigentümer dieser Website
Zweck	Speichert die Einstellungen der Besucher, die in der Cookie Box von Borlabs Cookie ausgewählt wurden.
Cookie Name	borlabs-cookie
Cookie Laufzeit	1 Jahr

Akzeptieren
Name	Google Analytics
Anbieter	Google LLC
Zweck	Cookie von Google für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Datenschutzerklärung	https://policies.google.com/privacy?hl=de
Cookie Name	_ga,_gat,_gid
Cookie Laufzeit	2 Jahre

Akzeptieren
Name	Hotjar
Anbieter	Hotjar Ltd.
Zweck	Hotjar ist ein Analysewerkzeug für das Benutzerverhalten von Hotjar Ltd. Wir verwenden Hotjar, um zu verstehen, wie Benutzer mit unserer Website interagieren.
Datenschutzerklärung	https://www.hotjar.com/legal/policies/privacy/
Host(s)	*.hotjar.com
Cookie Name	_hjClosedSurveyInvites, _hjDonePolls, _hjMinimizedPolls, _hjDoneTestersWidgets, _hjIncludedInSample, _hjShownFeedbackMessage, _hjid, _hjRecordingLastActivity, hjTLDTest, _hjUserAttributesHash, _hjCachedUserAttributes, _hjLocalStorageTest, _hjptid
Cookie Laufzeit	Sitzung / 1 Jahr

Akzeptieren
Name	HubSpot
Anbieter	HubSpot Inc.
Zweck	HubSpot ist ein Verwaltungsdienst für Benutzerdatenbanken bereitgestellt von HubSpot, Inc. Wir nutzen HubSpot auf dieser Website für unsere Online Marketing-Aktivitäten.
Datenschutzerklärung	https://legal.hubspot.com/privacy-policy
Host(s)	*.hubspot.com, hubspot-avatars.s3.amazonaws.com, hubspot-realtime.ably.io, hubspot-rest.ably.io, js.hs-scripts.com
Cookie Name	__hs_opt_out, __hs_d_not_track, hs_ab_test, hs-messages-is-open, hs-messages-hide-welcome-message, __hstc, hubspotutk, __hssc, __hssrc, messagesUtk
Cookie Laufzeit	Sitzung / 30 Minuten / 1 Tag / 1 Jahr / 13 Monate

Akzeptieren
Name	Leadfeeder
Anbieter	Dealfront Group GmbH

Akzeptieren
Name	OpenStreetMap
Anbieter	OpenStreetMap Foundation
Zweck	Wird verwendet, um OpenStreetMap-Inhalte zu entsperren.
Datenschutzerklärung	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Laufzeit	1-10 Jahre

Akzeptieren
Name	Podigee
Anbieter	Podigee
Zweck	Wird verwendet, um Podigee-Inhalte automatisch zu entsperren.
Datenschutzerklärung	https://www.podigee.com/de/ueber-uns/datenschutz
Host(s)	podigee., podigee.com, podigee.io

Developing a Streaming-based Architecture for Demand Prediction of Taxi Trips in the Presence of Concept Drift

1. Introduction

Christian Meder

inoNews