Abstract
Dynamic O-D flow estimation is the basis of metro network operation, such as transit resource allocation, emergency coordination, strategy formulation in urban rail system. It aims to estimate the destination distribution of current inflow of each origin station. However, it is a challenging task due to its limitation of available data and multiple affecting factors. In this paper, we propose a practical method to estimate dynamic OD passenger flows based on long-term AFC data and weather data. We first extract the travel patterns of each individual passenger based on AFC data. Then the passengers of current inflows based on these patterns are classified into fixed passengers and stochastic passengers by judging whether the destination can be inferred. Finally, we design a K Nearest Neighbors (KNN) and Gaussian Process Regression (GPR) combined hybrid approach to dynamically predict stochastic passengers’ destination distribution based on the observation that the distribution has obvious periodicity and randomicity. We validate our method based on extensive experiments, using AFC data and weather data in Shenzhen, China over two years. The evaluation results show that our approach with 85% accuracy surpasses the results of baseline methods and the estimation precision reaches 85%.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Dynamic OD flows reflect the traffic demand in a metro system. They describe the number of passengers of all O-D (origin station - destination station) pairs of the metro system, and they are the basis of traffic control and management, such as capacity allocation, emergency coordination. OD traffic flows have time delay problem, which refers that current inflow of a station will leave the system in the future time from other stations. In this paper, we aim to predict the destination distribution of current inflow at each station.
Many works have been done to predict individual user’s destination based on spatial sequence patterns, which mine the sequential frequent patterns from the user’s historical trajectories [1,2,3,4]. Such patterns can reflect co- occurrence of locations. These are the places that the user always visit after visiting somewhere. For example,
Morzy et al. [1] used an Apriori-based algorithm to generate association rules from the moving object database and used it for user’s location prediction. Morzy et al. [1] used a modified version of PrefixSpan algorithm to discover frequent movements of users [2]. Zheng et al. [4] used a HITS-based inference model to mine users’ interesting location and detected users’ travel sequence to make locations prediction. Overall, the destination prediction of the existing approaches relies on transition probabilities between different locations based on historical trajectories using various Markov chain-based models. It is hard to directly apply these methods to our problem as follow: One of the key prerequisites is that there should be enough historical data of individuals to learn the transition probabilities. However, in our context, there are two third passengers only have limited historical data, which makes it hard to directly apply these methods mainly designed for individuals instead of group.
Recently, thanks to the upgrades of urban infrastructures, we can obtain long-term data collected by AFC system, where all passengers need to tap their smart cards when they enter or leave the metro system, which brings us opportunity to understand passenger OD flows from the raw data. By now, AFC data has been used to inform transit agency from planning, to operations, to measuring and monitoring performance [5,6,7,8], and understanding user spatiotemporal travel patterns [9,10,11,12]. Different from them, in this paper, we focus on a basic problem, dynamic OD flow estimation.
In this paper, we propose a novel method of estimating the dynamic OD flow in metro systems based on long-term AFC data and weather data. Our intuition is that the passengers in metro system can be categorized into two groups, fixed passengers and stochastic passengers. The fixed passengers are spatiotemporal regular, such as commuters, students. Their travel behaviors are relatively stable. They often travel from one fixed station to another during relatively fixed time period. Their destination can be uniquely inferred based on their travel patterns mined from their own long-term smart card records. While stochastic passengers are spatial temporal irregular, based on the observations (detailed in following section) that their destination distribution has obvious regularity (e.g. 24-h periodicity, and difference between weekday and weekend) and randomicity (affected by some factors, e.g., weather, past traffic flows), we propose a KNN and GPR hybrid model to estimate their destination distribution. The major contribution of this paper is threefold:
-
1.
We propose a novel method to estimate the dynamic OD flows in a metro system. The method first uses a kernel density-based clustering algorithm to extract each passengers’ travel pattern from long-term AFC data. Then we classify all passengers of current inflow into two groups, fixed passengers and stochastic passengers according to whether their destination can be uniquely inferred based on their travel patterns.
-
2.
For efficiently and robustly predicting the destination distributions of stochastic passengers, we design a dynamic KNN and GPR combined approach with two types of data, the AFC data and weather data, based on the insight that their destination distribution has obvious periodicity and randomicity.
-
3.
We evaluate our method using real-world data of smart card data and weather data over two years in Shenzhen, China. The experimental results show the effectiveness of our method.
The rest of this paper is organized as follows. Section 2 presents the overview of our problem and solution. Section 3 details our solution. The empirical evaluation of the method on real-world data is discussed in Sect. 4, and we conclude the study in Sect. 5.
2 Overview
2.1 Notations and Assumptions
We first present several useful definitions:
Definition 1
(Metro System): A metro system is composed of multiple lines \( L\, = \,\left\{ {l_{1} ,\,l_{2} ,\, \ldots ,\,l_{\left| L \right|} } \right\} \) and multiple stations \( S\, = \,\left\{ {s_{1} ,\,s_{2} ,\, \ldots ,\,s_{\left| S \right|} } \right\} \), where \( \left| L \right| \) and \( \left| S \right| \) is the number of metro lines and metro stations. A metro line \( l_{i} \) passes by some ordered stations.
Definition 2
(Time slots): We divide a day into multiple periods \( T\, = \,\left\{ {T_{1} ,\,T_{2} ,\, \ldots ,\,T_{\left| T \right|} } \right\} \) by a fixed interval \( \tau \). The \( k \) period contains a time range of \( \left\{ {\left( {k - 1} \right)\tau ,k\tau } \right\} \).
Definition 3
(Trip): A trip of a passenger, denoted as \( Tr \), is related to four attributes \( s_{o} ,s_{d} ,t_{o} ,t_{d} \), representing tap-in station, tap-out station, tap-in time and tap-out time. A trip is obtained by joining the tap-in and tap-out records collected by AFC system through matching card ID and time.
Definition 4
(Inflow): We use \( O_{i,k} \) to represent the passengers flows entering metro system from station \( s_{i} \) during time period \( T_{k} \).
Definition 5
(OD flow): The passengers who enter a metro network from station \( s_{i} \) during time interval \( T_{k} \) and destine to station \( s_{j} \), denoted as \( OD_{i,k}^{j} \).
2.2 Problem Definition
In this paper we use two types of data: smart card transaction data and weather data.
Smart card data: Metro passengers need to tap their smart card when they enter and exit metro system. Each record includes card ID, stations, transaction time and transaction type (tap-in and tap-out).
Weather data: We use weather data collected by district-level weather stations. Each record consists of time of day, rainfall, temperature, humidity, wind speed and wind direction.
Dynamic OD flow estimation: Given real-time smart card data, weather data, and current time slot \( T_{c} \), this paper aims to estimate current OD flow matrix \( M_{\left| S \right| \times \left| S \right|} \), where the item of \( m_{ij} \) is the ODflow \( OD_{i,c}^{j} \), representing the number of the passengers of \( O_{i,c} \) who will destine to station \( s_{j} \).
2.3 Data-Driven Insight
By now, there has been a lot of researches focusing on understanding the travel patterns of individual users in urban cities [12,13,14]. We conclude from these studies that the trips of passengers can be classified into two types, fixed trips and stochastic trips. The fixed trips of a user refer that the user often travels from one fixed place to another fixed place at fixed time interval of a day. Thus, passengers can be regard as fixed and we can infer the destination if given the origin station. Aside from fixed trips, we find that the destination distribution of stochastic trips has some regularity and randomicity as follows.
First, historical data has high reference for real-time OD flow estimation. We take 30 min as granularity and calculate the correlation coefficient of OD flows between weekdays, between weekends. The result is shown in Fig. 1. We found that there are strong correlations since the average correlation coefficient are greater than 0.5. The finding indicates that historical data has high reference for real-time dynamic OD flow estimation. In addition, we calculate the similarities of the destination distributions by cosine distance, between rainy day and sunny day, between rainy days, and between sunny days at the same time slot of day. They are 0.51, 0.601, 0.634 respectively. It means that the more similar the weather conditions between two days are the more similar the destination distributions are.
Second, the stochastic passengers’ destination distribution also has randomicity by these external factors (e.g. weather, real-time traffic flows), which is dynamic and we can only give a reliable prediction by using the most updated model.
2.4 Analysis Framework
Figure 2 presents the framework for dynamic OD-Matrix predictive method. First, based on the long-term history smart card transaction data, we extract the spatial-temporal travel patterns of individual passengers. Based on these patterns, the passengers of inflows are divided into two categories according to whether their destination can be uniquely inferred: fixed passengers and stochastic passengers. Then we use a KNN and GPR hybrid model to estimate the proportion that the stochastic passengers will destine to other stations. Finally the OD matrix is obtained by combining the destination estimation results of fixed passengers and random passengers.
The reason why we use KNN and GPR hybrid model is as follow. The OD flow of stochastic passengers is dynamic and has randomicity, we can only give a reliable prediction by using the most updated model for prediction. This implies that frequent retraining is necessary. To fulfill this requirement, we propose a lazy learning approach that integrates the KNN and GPR (Gaussian process regression) for efficient and robust OD flow prediction. GPR is known to be efficient in dealing with relationships among data. However it cannot be scaled up easily for its cubic learning computation and quadratic space requirement. Given the Gaussian process computation barrier, we have to solve a small-scale problem by selecting the limited influential data for GPR, so KNN is combined based on the regularity of destinations distribution, to select the most informative data for Gaussian process Regression.
3 Solution
3.1 Passenger Classification
Individual’s Travel Patterns Extraction.
The process of classifying the passengers who have entered a metro system at station \( s_{i} \) during current time period \( I_{c} \) can be divided into two stages: individual passenger’s travel patterns extraction, dynamic passenger classification.
Definition (Travel pattern): a travel pattern \( p \) of individual user is used to describe the travel’s spatio-temporal regularity, which refers that the passenger often goes from one fixed station to another fixed destination station during a fixed periods of days, and the proportion of these trips exceeds a specified percentage threshold \( \lambda \) (e.g., 60%). For example, in 60% of the days, Bruce goes from the University Town to Shenzhen North Station between 8: 30 and 9: 30. A travel pattern \( P = \left\{ {s_{o} ,s_{d} ,t_{o} ,t_{d} ,r} \right\} \) contains the origin station \( s_{o} \), destination station \( s_{d} \), start time \( t_{o} \), end time \( t_{d} \) and proportion \( r \) respectively.
Given all trips \( TR = \{ tr_{1} ,tr_{2} , \ldots ,tr_{|} N|\} \) of a passenger, we use two steps to extract the passenger’s travel patterns. First, we use a kernel density-based clustering algorithm to cluster passengers’ history trips according the time points of a day. Then, for each cluster, we find whether there exists a subset of trips with similar OD pair, and the proportion of such trips to the total days is greater than the threshold \( \lambda \). If such subset does not exist, no travel pattern is constructed; Otherwise a travel pattern is constructed. In the following, we mainly focus on the first step.
Due to the fact that different passengers have different number of travel patterns, for example, some passengers have one fixed travel period during a day, some passengers have two or more fixed travel periods during a day. Therefore, the number of clusters can’t be determined at the beginning of clustering. Here we use the kernel density-based cluster algorithm which can automatically determine the number of clusters.
The basic principle of the kernel density-based cluster algorithm is that an ideal cluster center has two characteristics: 1) compared with the neighboring data points, the center point of a cluster has greater local density \( \rho \); 2) compared with other data points, the distance \( \sigma \) between the centers of clusters is relatively large. This algorithm is not only suitable for cluster analysis of large-scale data, but also can quickly reduce noise points.
Given a trip set \( TR \), we use the middle time \( t_{m} = \frac{1}{2} \times \left( {tr.t_{o} + tr.t_{d} } \right) \) of a trip \( tr \) to represent the trip’s time for clustering. We assume the time point \( x \) of a trip is an independently distributed random variable, the kernel density estimation function is as follows.
where \( K\left( {\frac{{x - x_{i} }}{h}} \right) \) is the kernel function; \( h \) is the bandwidth, which determines the size of the local range affected by the kernel function.
For the travels in each cluster, if there is a travel subset that satisfies the conditions mentioned earlier, the starting station and destination station of this subset are considered as the of the travel pattern and the time periods involved in these travels are considered as the of the travel pattern. Based on this method, we extract the travel pattern of each passenger.
Through the above process, we can extract the travel pattern of each passenger, denoted as \( P = \left\{ {p_{1} ,p_{2} , \cdots ,p_{\left| P \right|} } \right\} \).
Online Passenger Classification.
Given a passenger of \( O_{i,c} \) of a station \( s_{i} \), who enters metro system in a station \( s_{i} \) at current time slot \( T_{c} \), we first check the passenger’s travel pattern \( M \). If there exists a travel pattern \( m \) satisfying \( \left[ {m.t_{o} \,m.t_{d} } \right] \cap T_{k} \ne null \), we set \( m_{i} .d \) as the destination for the passenger, and classify the passenger as a fixed passenger, otherwise the passenger is a stochastic passenger.
3.2 Destination Distribution Estimation for Stochastic Passengers
For stochastic passenger, we use a hybrid model of KNN and GPR to model the linear and nonlinear characteristics based on different input features respectively. For the stochastic passengers who enter metro system at station \( s_{i} \) during current time period \( T_{c} \), we predict the proportion of these stochastic passengers which will destine to other stations, denoted as a vector \( Y_{i,c} = \left\{ {Y_{i,c}^{1} ,Y_{i,c}^{2} , \ldots ,Y_{i,c}^{\left| S \right|} } \right\} \).
Input of Model.
Given a station \( s_{i} \) and a time slot \( T_{c} \) of a day, we explore three sets of key features, namely time features, weather features, passenger flow features for predicting the proportion of the stochastic passengers travelling to other stations.
Time Features: The time of day and day of the week are significant in OD flow prediction as shown in Fig. 1. We use \( F_{t} = \left( {dy,Ts} \right) \) to represent the time features, where \( dy \in \left[ {1,7} \right] \) represents the day of week, which distinguishes between weekdays from Monday to Sunday. The \( Ts \in \left[ {1,\left| T \right|} \right] \) represents the time slot of day.
Weather Features: We use two weather features, tm (temperature) and hm (humidity), which are studied being correlated with the estimated target. Given current time slot \( T_{c} \) needed to estimate, we use the weather condition \( F_{w} = \left( {tm_{c} ,hm_{c} } \right) \) to represent the weather features.
Traffic Flow Features: We extract the inflow and outflow of each metro station during past time \( M \) intervals, which refers to the total number of passengers who enter and exit metro system at each station during past time interval, \( F_{l} = \left\{ {F_{c - M} , \ldots ,F_{c - 1} } \right\} \), where \( F_{c - i} = \left\{ {O_{c - i} ,D_{c - i} } \right\} \) represents the traffic flow at past time interval \( T_{c - i} \).
K-nearest Neighbor.
The KNN is used to find the data close to current condition. An efficient prediction is based on how we can catch the spatial temporal patterns from the historical data.
The condition is represented as a feature vector x which includes the three types of features: time, weather, traffic flow, \( x = \left\{ {F_{t} ,F_{w} ,F_{l} } \right\} \). The target value Y is probabilities of the stochastic passengers destining to other stations.
Since destination distribution has obvious temporal characteristics, such as 24 h periodicity, different characteristics between weekday and weekend, and strong correlation between near time slots. We only search on the historical data with same day indicator (weekday and weekend) and near time slots.
After that, we can define the metric for KNN by (1) for each pair of historical data \( F \) and test data \( F ' \). In the following equation, \( C\left( {f,f '} \right) \) is the correlation coefficients of OD flows during the time features of \( F \) and \( F ' \).
Gaussian Process Regression.
The K neighbors selected by KNN algorithm are used as the input of GPR computation for passengers’ destination distribution. Given the K nearest neighbors \( H \equiv \left\{ {X = \left( {x_{i} , \cdots ,x_{k} } \right),Y = \left( {y_{i} , \cdots ,y_{k} } \right)} \right\} \), where \( x_{i} \) denotes an input feature vector and \( y_{i} \) denotes the target value. The goal of GPR is to find a function \( f \) which can describe the relationship between \( x_{i} \) and \( y_{i} \). GPR can be completely specified by a mean function \( m\left( x \right) \) and a covariance function \( k\left( {x,x '} \right) \), as \( f = GP\left( {m\left( x \right),k\left( {x,x '} \right)} \right) \), where \( x \) and \( x ' \) represent two input feature vectors. The key point of GPR is to select an appropriate kernel function k. In this paper, we take the most commonly used method that is the squared exponential kernel function as follows:
Given current conditions \( x_{c} \), the corresponding target is predicted with GPR as:
4 Analysis of Experimental Results
4.1 Dataset
In the experiments, we use two types of data: (a) AFC data; (b) weather data over two years from Jan, 1, 2014 to Dec, 31, 2015 in Shenzhen, China. The metro system has 117 stations and 5 physical lines.
4.2 Metrics
In this paper, we evaluate the performances of our model by two kinds of evaluation metrics, which are the mean accuracy \( ACC \), mean absolute error MAE, according to following two Equations. Their definitions are as follows, where \( y \) is the ground truth of OD flow matrix, \( \tilde{y} \) is the prediction of OD flow matrix. To be noted, the absolute value of the subtraction between two matrix is defined as the sum of the absolute differences of all items in the two matrixes.
4.3 Compared Methods
We evaluate our method by comparing with the other learning models. The selected learning models are presented as follows, and each of these models is a representative of a group of method with similar base:
Empirical Estimation (EMP): represents the method based on the naive empirical knowledge, we estimate dynamic OD flow matrix by using its historical average.
Bayesian Network (Bayes): is a typical graph-based algorithm, which is a representative for the probability based models.
Artificial Neural Network (ANN): represents the model to capture the non-linearity between three type of features and OD flows.
XGBoost: represents the prediction by resemble learning method for regression problems, by producing a prediction model in the form of an ensemble of basic learning models, typically decision trees.
KGmetNC: In order to justify the necessity of the step of passenger classification in our method. We use a method called as KGmetNC that doesn’t perform such step. KGmetNC considers all passengers of current inflow as stochastic passengers and uses KNN-GPR combined method to estimate passengers’ destination distribution.
4.4 Experiment Result
Comparision to Baselines.
We compare our method with various methods in terms of metrics and highlight the best performance with bold font (as shown in Table 1). First, we can see from Table 1 that our method with more than 0.85 accuracy has the best performance. Moreover, the naive empirical baseline achieves more than 0.652 accuracy, which suggests the OD flows are relatively stable. The Ensemble method XGBoost achieves the best performances among all tradition methods including EMP, Bayes, ANN, which suggests that ensemble methods are also a good choice since they can capture the non-linear correlation between dynamic OD flow and observation. Second, we also can observe the contribution of using the step of passenger classification.
4.5 Different Time Period
We compare the predicting results during different time periods of days as shown in Table 2. Specifically, we choose two kinds of days, weekend and weekday, and partition a day into two parts, peeking hours (07:00–09:00, 17:00–19:00) and low hours (other times) for comparison. We can observe that the predicting results at low hours on weekdays are worse than those at peeking hours. It is easy to understand for the reason that there are larger amount of fixed passengers at peak hours, especially at AM peak hours. Same reasons for that the average accuracy in weekend is worse than that in weekday.
4.6 Different Data Sizes
We evaluate the performance of the proposed method using different data sizes. Basically, we want to confirm that more data can really improve the predicting performance. We test the proposed method with different sizes of the historical data, from one month to nearly two years data. Figure 3 shows the RME with data size. In Fig. 3, we can see that more training data indeed gives better performance. However, the result is going to be stable as the number of input instances goes larger than 18 months.
5 Conclusion and Discussion
In this paper, we propose KGmet, a data-driven method to predict the dynamic OD flow in a complex metro system. KGmet first divides the passengers of current inflow into two categories: fixed passengers and random passengers according to individual passengers’ travel patterns. Then KGmet uses a KNN-GPR combined method, which can keep updating based on newly acquired data, to predict the dynamic destination distribution of stochastic passengers. We evaluate KGmet with real-world datasets collected from Shenzhen, China. The experimental results show that our method works well for estimating dynamic OD flows with 85% accuracy.
References
Morzy, M.: Prediction of moving object location based on frequent trajectories. In: Levi, A., Savaş, E., Yenigün, H., Balcısoy, S., Saygın, Y. (eds.) ISCIS 2006. LNCS, vol. 4263, pp. 583–592. Springer, Heidelberg (2006). https://doi.org/10.1007/11902140_62
Morzy, M.: Mining frequent trajectories of moving objects for location prediction. In: Perner, P. (ed.) MLDM 2007. LNCS (LNAI), vol. 4571, pp. 667–680. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73499-4_50
Jeung, H., Liu, Q., Shen, H.T., Zhou, X.: A hybrid prediction model for moving objects. In: A Hybrid Prediction Model for Moving Objects, pp. 70–79. IEEE (2008)
Zheng, Y., Zhang, L., Xie, X., Ma, W.-Y.: Mining interesting locations and travel sequences from GPS trajectories. In: Mining Interesting Locations and Travel Sequences from GPS Trajectories, pp. 791–800 (2009)
Zhang, D., Zhao, J., Zhang, F., Jiang, R., He, T., Papanikolopoulos, N.: Last-mile transit service with urban infrastructure data. ACM Trans. Cyber-Phys. Syst. 1(2), 1–26 (2016)
Cheon, S.H., Lee, C., Shin, S.: Data-driven stochastic transit assignment modeling using an automatic fare collection system. Transp. Res. Part C: Emerg. Technol. 98, 239–254 (2019)
Allahviranloo, M., Chow, J.Y.: A fractionally owned autonomous vehicle fleet sizing problem with time slot demand substitution effects. Transp. Res. Part C: Emerg. Technol. 98, 37–53 (2019)
Zhang, D., Zhao, J., Zhang, F., He, T.: coMobile: real-time human mobility modeling at urban scale using multi-view learning, pp. 1–10 (2015)
Antoniou, C., Dimitriou, L., Pereira, F.: Mobility patterns, big data and transport analytics: tools and applications for modeling (2018)
Zhang, D., Zhao, J., Zhang, F., He, T., Lee, H., Son, S.H.: Heterogeneous model integration for multi-source urban infrastructure data. ACM Trans. Cyber-Phys. Syst. 1(1), 1–26 (2016)
Zhao, J., et al.: Estimation of passenger route choice pattern using smart card data for complex metro systems. IEEE Trans. Intell. Transp. Syst. 18(4), 790–801 (2016)
Zhao, J., Qu, Q., Zhang, F., Xu, C., Liu, S.: Spatio-temporal analysis of passenger travel patterns in massive smart card data. IEEE Trans. Intell. Transp. Syst. 18(11), 3135–3146 (2017)
Kusakabe, T., Asakura, Y.: Behavioural data mining of transit smart card data: a data fusion approach. Transp. Res. Part C: Emerg. Technol. 46, 179–191 (2014)
Widhalm, P., Yang, Y., Ulm, M., Athavale, S., González, M.C.: Discovering urban activity patterns in cell phone data. Transportation 42(4), 597–623 (2015). https://doi.org/10.1007/s11116-015-9598-x
Acknowledgment
The authors would like to thank anonymous reviewers for their valuable comments. This work is supported in part by the National Key R&D Program of China (No. 2019YFB2102100), and by “National Natural Science Foundation of China” No. 61802387, and by National Natural Science Foundation of Shenzhen No. JCYJ20190812153212464, and by Shenzhen Discipline Construction Project for Urban Computing and Data Intelligence,and by China’s Post-doctoral Science Fund No. 2019M663183.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Ye, J., Zhao, J., Zhang, L., Xu, C., Zhang, J., Ye, K. (2020). A Data-Driven Method for Dynamic OD Passenger Flow Matrix Estimation in Urban Metro Systems. In: Nepal, S., Cao, W., Nasridinov, A., Bhuiyan, M.Z.A., Guo, X., Zhang, LJ. (eds) Big Data – BigData 2020. BIGDATA 2020. Lecture Notes in Computer Science(), vol 12402. Springer, Cham. https://doi.org/10.1007/978-3-030-59612-5_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-59612-5_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59611-8
Online ISBN: 978-3-030-59612-5
eBook Packages: Computer ScienceComputer Science (R0)