RanjanNavin1
                     BhandariSovit1
                     KimYeong-Chan1,2
                     KimHoon1,2*
               
                  - 
                           
                        (IoT and Big Data Research Center, Incheon National University, Yeonsu-gu, Incheon,
                        22012, Korea
                        							{ranjannavin07, sovit198}@gmail.com
                        						)
                        
- 
                           
                        (Department of Electronics Engineering, Incheon National University, Yeonsu-gu, Incheon,
                        22012, Korea
                        							{kyc0288, hoon}@inu.ac.kr
                        						)
                        
 
            
            
            Copyright © The Institute of Electronics and Information Engineers(IEIE)
            
            
            
            
            
               
                  
Keywords
               
                Q-learning,  Mobile routing,  Traveling salesman problem,  Dijkstra algorithm,  Illegal parking,  Edge,  Computing,  Networks
             
            
          
         
            
                  1. Introduction
               Route planning in a network is an optimization problem that has been studied in applied
                  mathematics and computer science for decades. This problem is computationally expensive
                  and has been solved using many different algorithms proposed, but providing fast and
                  reliable solutions to this optimization problem is still a challenging task [1]. Route planning plays a critical role in a variety of applications, such as packet
                  transfer in wireless sensor networks [2-6], routing in 5G and 6G networks [7-9], and vehicle routing [10-14]. Several researchers have developed and proposed many exact, heuristic, and metaheuristic
                  algorithms to solve the route-planning optimization problem. In particular, these
                  algorithms have their respective pros and cons. For example, an exact algorithm, such
                  as the Dijkstra algorithm, computes every possible solution and chooses the best among
                  them. However, even though this algorithm provides the exact solution, it tends to
                  suffer from high time-complexity even while route planning in a fairly small network
                  [14-16]. However, since time complexity is a major factor for real-time route planning in
                  a network, it is much better to look for an approximate solution instead of this exact
                  solution. Therefore, the heuristic and metaheuristic algorithms, such as the local
                  search algorithm, ant colony optimization, particle swarm optimization, etc., are
                  preferred in solving the route-planning optimization problem. However, these algorithms
                  also suffer from limitations. For instance, the heuristic algorithms, in many cases,
                  are unable to generate the best solution to this optimization problem.
               
               Recently, machine learning (ML) algorithms have become popular in various fields due
                  to these algorithms’ ability to handle large-data and data non-linearities and the
                  availability of cheap and high computational power [17-19]. Reinforcement learning (RL) is the major class of machine learning and has been
                  inspired by the behavior learning of animals. The RL algorithms seem to perform well
                  in partially observed environments or high-dimensional spaces, when an approximate
                  solution to any complex problem may be of greater value than an exact/best solution,
                  which is possible only to any simple problem, in these environments/spaces. The RL-based
                  algorithms can solve the route-planning optimization problem in applications like
                  packet transfer in a wireless network, vehicle route planning in a transportation
                  network, signal routing in a communication network, navigation through a maze, etc.,
                  which involve large networks, and the RL algorithms like Q-learning give the best
                  solution to this problem regardless of the environment.
               
               The advantages of Q-learning, such as its off-policy and temporal-difference learnings,
                  inspired us to propose the DQSR algorithms (one for each stage). The DQSR is based
                  on two Q-learning agents: the first stage’s Q-learning agent develops the shortest
                  path between each customer and this customer’s nearest neighbor customers in the network
                  and finds this path’s length and the second stage’s Q-learning agent finds the shortest
                  route from any customer to the other customers by virtually visiting each other customer
                  only once and returning to the starting customer in a route that has the shortest
                  distance. Hence, both Q-learning agents share the responsibility of achieving the
                  shortest route, and hence the DQSR achieves this route relatively fast compared to
                  the exact, some heuristic, or the single-stage, Q-learning based algorithms. In addition,
                  the concept of nearest neighbor customers cannot be used in the exact and heuristic
                  algorithms for the shortest-route planning/generation. In essence, the main contributions
                  of this paper are summarized as follows:
               
               · We developed a dual-stage, Q-learning based, shortest-route generation (DQSR) and
                  its two algorithms.
               
               · We implemented the idea of calculating the shortest path and distance between each
                  customer and this customer’s nearest neighbor customers for the shortest-route generation.
               
               · Our extensive experiments on the DQSR based, illegal- parking surveillance demonstrate
                  the efficiency of the DQSR compared to two other algorithms used for the same purpose,
                  in terms of time complexity and shortest-route generation.
               
               We described the background of our research in this section. The rest of the paper
                  is structured as follows. Section 2 presents the works related to our research. Section
                  3 presents the methodology of our research, including the background on reinforcement
                  learning, problem statement, and the DQSR algorithms. Section 4 presents the data
                  set used in the experiment, algorithms compared with the DQSR algorithms, simulation
                  parameters, the case study results, and discussion. Finally, Section 5 concludes and
                  provides the future direction for this research.
               
             
            
                  2. Related Work
               Route planning is one of the most important functions of an intelligent transportation
                  system. It is widely used in various areas, such as daily travel route planning [20-22], package delivery service or indoor warehouse management [23-25], and traffic congestion reduction [26-28]. The route planning also saves a lot of time and money for the transport management
                  agencies. In recent years, many academic works have been devoted to solving the route-planning
                  optimization problem with multiple locations to be connected by the route, in a relatively
                  low time. Some of the recent academic works have also been handling route planning
                  under multiple objectives. 
               
               Nowadays, many algorithms find the shortest path between nodes/locations in a network,
                  with widespread applications. For example, an exact algorithm based on traditional
                  graph-based methods, such as the Dijkstra algorithm, is used to find this shortest
                  path in a network under multiple destinations [29]. Notably, the multi-destination Dijkstra algorithm (MDDA) is the recent algorithm
                  to find the multi-destination shortest path in a network. In particular, the MDDA
                  uses a Fibonacci heap [30,31]. Meanwhile, Lu et al. [32] proposed a node combination method to implement the Dijkstra algorithm in finding
                  the shortest path in a network by following three simple iterative steps: find the
                  nearest neighbor node of the source node, combine the nearest neighbor node with the
                  source node, and modify the weights on the edge connecting the source and nearest
                  neighbor nodes. However, the node combination method is not efficient or faster than
                  the MDDA. Likewise, Orlin et al. [33] proposed an algorithm to efficiently find the shortest path in a network, which is
                  applicable even to a very sparse network or under a small number of positive integer
                  weights. Further, Jin et al. [34] presented the modified MDDA to find the shortest path in a time-schedule network.
                  In particular, they considered the network constraints on its edges to find this shortest
                  path. Also, Asaduzzaman et al. [35] proposed a modified Dijkstra algorithm to find the shortest path to multiple destinations
                  in a network by using a minimum number of weights. Interestingly, Hart et al. [36] proposed an A* algorithm, which is an exact algorithm for incorporating information
                  from an optimization problem to a graphical analysis performed to find a minimum-cost
                  path in a network. Similarly, Ticha et al. [37] found the multi-destination, bi-objective shortest path using an A* search strategy.
                  In particular, their search strategy involves choosing some part of the path for a
                  non-dominated path, in each iteration, to arrive at one of the destination nodes.
                  However, the last two exact and heuristic algorithms and alike have a serious issue
                  of time complexity and have a polynomial run-time according to the graph size. Meanwhile,
                  Bast et al. [38] developed an algorithmic approach for fast routing in a road network having transit
                  nodes, reducing the quickest-path queries in this road network into a small number
                  of table lookups reducing the query time significantly. Their algorithm is more than
                  one million times faster than the best-known algorithm for routing in a general network.
                  Finally, Shen et al. [39] improved the shortest-path databases of a network by (i) contracting the input graph
                  before preprocessing and (ii) limiting the preprocessing step to only a selected subset
                  of the graph nodes.
               
               Vehicle routing or vehicle-route planning is the generalization of the traveling salesman
                  problem (TSP). Meanwhile, involving only the nearest neighbors, a greedy and very
                  simple strategy, allows the best node/vertex to join the solution. But, this strategy
                  is inefficient in solving the TSP. Bentley [40] achieved an improved run-time in finding the approximate solutions to large-geometric
                  TSPs. In addition, Clarke et al. [41] presented an iterative procedure that enables the rapid selection of an optimum or
                  near optimum route for a fleet of trucks (with varying capacities) from a central
                  depot to several delivery points. Recently, new features are being brought by the
                  ML approaches, and researchers have been coupling ML features with the well-known
                  heuristic concepts, introducing new hybrid algorithms. For example, Mele et al. [42] presented a new, constructive heuristics driven by the ML for solving the TSP. This
                  solution’s algorithm enables a reasonable generalization and unveils an efficient
                  balance between the ML and optimization techniques. Likewise, Zhang et al. [43] proposed deep reinforcement learning for the TSP, with time windows and rejections.
                  Their deep RL showed superior performance compared to the Tabu search heuristics,
                  in terms of the solution quality and inference computation time. Further, Reinelt
                  [44] provided the TSP library (TSPLIB) for researchers, with a broad set of test problems
                  from various sources and with various properties. Finally, the arc routing or Chinese
                  Postman problem [45,46] presents a process for finding a least-cost way to traverse each arc of a network
                  at least once and return to the node/vertex from which the agent starts.
               
             
            
                  3. Methodology
               
                     3.1 Reinforcement Learning
                  Reinforcement learning (RL) is a computational approach to understanding and automating
                     goal-directed learning and decision-making. It is about taking suitable action to
                     maximize the reward in a particular situation. The RL differs from other computational
                     approaches, such as supervised learning. In particular, the RL agent learns by interacting
                     with its environment with/without prior knowledge of this environment or without relying
                     on examples. Also, the RL is based on the Markov decision processes (MDPs).
                  
                  An MDP is a discrete-time stochastic control process providing the mathematical framework
                     to model the decision-making in a situation where the outcomes are partly random and
                     partly under the decision-maker’s control. The MDPs are structured from a finite set
                     containing the state-space ($S$), action-space ($A$), state transition probability
                     matrix ($P$), and reinforcement or reward ($R$). In particular, it is the tuple$\left(S,A,P_{a},R_{a}\right)$.
                     In addition, a state $\left(s\in S\right)~ $is the representation of the environment
                     at a specific time step. Further, an action $\left(a\in A\right)$ is one in a set
                     of the possible actions in the state$s\,.$ And, the state transition probability,
                     $P_{a}\left(s,s'\right)=$ $P\left(s_{t+1}=s'~ |s_{t}=s,a_{t}=a\right)$, gives the
                     probability that the action$~ a~ $in the state$~ s~ $at the time$~ t~ $will lead to
                     the state $s'$at the time$~ t+1$. Finally, the reinforcement or reward, $R_{a}\left(s,s'\right),$
                     is the immediate reward received after transitioning from the state$~ s$ to the state
                     $s'$.
                  
                  The RL agent interacts with the environment in a sequence of steps at the time t:
                     the agent (i) receives a representation of the environment, i.e., the state$~ s$ (ii)
                     selects the action$~ a$ by using the ${\varepsilon}$-greedy$~ $selection (explained
                     below) and executes this action (iii) receives the immediate reward signal $R_{a}\left(s,s'\right)$
                     (iv) updates the learning matrix (v) observes the new state$~ s'$of the environment.
                  
                  The goal of the RL agent is to learn a policy$~ \left(\pi \right)$, i.e., to learn
                     to map from the state space ($S$) to the action space ($A$) in a way to maximize the
                     numerical reward signal. Therefore, the RL agent is not explicitly programmed of its
                     action in a state. Instead, the RL agent learns through a trial and error procedure
                     the action yielding the most reward in a state. In addition, the RL agent also follows
                     the exploration and exploitation strategies during its learning to find the most rewarding
                     action in a state. The exploration strategy allows the RL agent to improve the agent’s
                     current knowledge about each state-action pair by taking a random action from the
                     action list, instead of taking the best possible action, in a state. On the other
                     hand, the exploitation strategy makes the RL agent choose the best action from the
                     actions list to get the most reward by exploiting the agent’s current state-action
                     value (knowledge). However, a balance between the exploration and exploitation strategies
                     is essential and accomplished by the ${\varepsilon}$-greedy selection or the Boltzmann
                     selection. The ${\varepsilon}$-greedy selection’s parameter $\varepsilon \left(0<\varepsilon
                     <1\right)$ is defined and the decision policy $\pi \left(s\right)~ $is applied according
                     to the following equation.
                  
                  
                  where $~ a^{\mathrm{*}}~ $and $a^{r}~ $refer to the best and random actions selected
                     (under exploitation and exploration), respectively, in the current state$~ s$.
                  
                  The Q-learning is one of the off-policy RL algorithms in which the agent learns the
                     best state-action value function independent of the policy being followed. In particular,
                     the Q-learning is based on the temporal-difference learning (TD), i.e., the value
                     estimates are updated incrementally as the agent interacts with the environment, without
                     waiting for the final outcome. In particular, the Q-learning’s state-action value
                     is updated as,
                  
                  
                  where $Q_{t}\left(s,a\right)$ and$Q_{t+1}\left(s,a\right)$ are the state-action value
                     (at the time $t$) and the updated state-action value (at the time $t+1)$ of the state-action
                     pair $\left(s,a\right),$respectively. In addition, $r\left(s,a\right)$ is the immediate
                     reward for executing the action $a~ $ in the state$s,\alpha \in \left[0,1\right]$
                     is the learning rate, $\gamma \in \left[0,1\right]$ is the discount factor, and $\max
                     _{a'}Q\left(s',a'\right)$ is the best Q-value (state-action value) in the state$s'$for
                     the action $a'$. 
                  
                
               
                     3.2 Problem Statement
                  Let the road network be represented by a directed graph, $G=\left(V,E\right)$, where
                     $V$ is the set of nodes, $E$ is the set of edges (road segments), and each edge connects
                     two nodes in the road network. The length of the edge $e\in E$ is denoted by $l_{e}\in
                     \mathrm{~ \mathbb{R}}.$ Let us consider that the RL agent has to visit ‘n’ customers’
                     respective locations or edges, $D=\left\{d_{1},d_{2},\ldots ,d_{n}\right\}\in E$,
                     exactly once before returning to the starting node$~ \left(d\in D\right)$. Here, an
                     edge containing a customer is considered a node to perform the DQSR. Notably, the
                     second stage of the DQSR is a classic example of the TSP.
                  
                
               
                     3.3 The Dual-stage, Q-learning based, Shortest-route generation (DQSR) Algorithms
                     
                  
                  
                        3.3.1 The Q-learning based, Shortest-path Generation Algorithm
                     The calculation of the shortest path between two nodes is one of the classic problems
                        in graph theory. The DQSR uses the Global Positioning System (GPS) location, i.e.,
                        the latitude and the longitude, of all the customers to calculate each customer’s
                        all ‘$x$’ nearest neighbor customers’ respective locations. Then, it uses the Q-learning-based,
                        shortest-path generation algorithm (Algorithm 1) to generate the shortest path between each customer and this customer’s nearest
                        neighbor customers and find this path’s length.
                     
                     The inputs to the Algorithm 1 are given in line 1 of the algorithm. In particular, the RL agent requires the road
                        network represented by a directed graph, $G\left(V,E\right)$, road network length
                        graph, i.e., the road network with the each of its edge having the corresponding length,
                        list of customers to visit (in terms of their corresponding edges), each customer’s
                        nearest neighbor customers list, ‘n’ Q shortest-path tables \{$Q_{\colon 1},Q_{\colon
                        2},\ldots ,Q_{\colon n}\}$to store the updated state-action value for each of the
                        ‘n’ customers, respectively, distance table $\left(D_{Route}\right)~ $to store the
                        length of the shortest path between each customer and this customer’s nearest neighbor
                        customers, path table $\left(P_{Route}\right)~ $to store the shortest path between
                        each customer and this customer’s nearest neighbor customers, and maximum number of
                        random walks the RL agent can take before terminating the ‘while’ loop. Notably, the
                        Q shortest-path tables each have the same state transition probability matrix as that
                        of the road network. In contrast, the distance and path tables each have the same
                        state transition probability matrix as that of each customer and this customer’s nearest
                        neighbor customers, i.e., with size $n\times x.$ The $i^{th}$ Q shortest-path table,
                        $Q_{\colon i}\,,$ stores the Q-value update for the path from the $i^{th}$ customer’s
                        all ‘$x$’ nearest neighbor customers to the $i^{th}$ customer. In particular, the
                        $Q_{\colon i}$ is used to calculate the shortest path and distance from the $i^{th}$
                        customer’s all ‘$x$’ nearest neighbor customers to the $i^{th}$ customer; here ‘$\colon
                        $’ represents the list of the $i^{th}$ customer’s nearest neighbor customers. Also,
                        the output of the algorithm includes the updated path and distance tables. In addition,
                        the algorithm requires specifying the learning rate, discount factor, initial epsilon
                        value, epsilon decay rate, and minimum epsilon.
                     
                     The working of Algorithm 1 is shown in lines 4 to 24 of the algorithm. The first ‘for loop’ loops through all
                        the customers to visit, as mentioned in line 4 of the algorithm. The RL agent copies
                        the $i^{th}$ customer’s Q shortest-path table and the road network length graph as
                        the reward table with the sign of its each edge length changed to negative, in lines
                        5 and 6 of the algorithm, respectively. The second ‘for loop’ loops through the $i^{th}$
                        customer’s all ‘$x$’ nearest neighbor customers, as mentioned in line 7 of the algorithm.
                        Since the goal of the RL agent is to reach the $i^{th}$ customer’s nodes, we manipulate
                        the reward table by adding extra positive reward to each edge that is connected to
                        any of these nodes, as in line 8 of the algorithm. This extra reward enables the RL
                        agent to learn effectively. Q shortest-path table update for the path from the $i^{th}$
                        customer’s all ‘x’ nearest neighbor customers to the $i^{th}$ customer is presented
                        in lines 7 to 20 of the algorithm. First, the RL agent starts its functioning from
                        the assigned current state and random-walks one step at a time until its termination
                        according to the ‘while’ condition given in line 11 of the algorithm. In other words,
                        the RL agent’s random walking is terminated either when the number of agent random
                        walks completed exceeds the maximum number of random walks the agent can take (max_steps)
                        or when the agent reaches the end state. Meanwhile, the RL agent chooses the action
                        ($a$) at the current state ($s$) (as mentioned in line 12 of the algorithm), based
                        on the $\varepsilon $-greedy selection, which enables the agent to either explore
                        or exploit depending on the epsilon value. The RL agent then performs the chosen action
                        and moves to the next state ($s')$, as in line 13 of the algorithm. Subsequently,
                        the RL agent updates the previous Q-value, according to Eq. (2), as in line 14 of the algorithm. Further, the RL agent checks if the state $s'$ is
                        the end state and changes the termination parameter (goal)’s value accordingly, as
                        in lines 15 and 16 of the algorithm, respectively. Next, the RL agent replaces the
                        current state with the next state and decrements the max_steps. The epsilon value
                        is then updated after the termination of the RL agent’s path search, as in lines 18
                        and 19 of the algorithm. The entire process of Q-value update, as in lines 10 to 19
                        of the algorithm, is repeated $T$ (an user-defined number) times by the ‘for loop’
                        (in line 10 of the algorithm). These $T~ $repetitions allow the RL agent to learn
                        the correct shortest-path from the $i^{th}$ customer’s all ‘$x$’nearest neighbor customers
                        to the $i^{th}$ customer. The RL agent then calculates the path and distance from
                        the $i^{th}$ customer’s all ‘$x$’nearest neighbor customers to the $i^{th}$ customer,
                        in line 20 of the algorithm, based on the Algorithm 3 (given subsequently in this paper). Finally, the RL agent updates the $i^{th}$ component
                        of the distance and path tables (in lines 22 and 23 of the algorithm, respectively).
                     
                     An example of the Q-learning based generation of the shortest path from each customer’s
                        nearest neighbor customers to this customer is shown in Figs. 1(a)-(d). In particular, Fig. 1(a) shows an arbitrary road network with five customers. Each
                        customer’s starting and end nodes are represented by ‘green’ and ‘light blue’ color
                        dots, respectively. The RL agent then updates the shortest path from each customer’s
                        nearest neighbor customers to this customer by visiting each customer once and returing
                        to the starting customer.
                     
                     The process of updating the Q shortest-path table for each customer is shown in Fig. 1(b). Fig. 1(b) also shows the list of the respective Q shortest-path table updates for all the five
                        customers. In particular, the first image in Fig. 1(b) shows the learned Q-value for the path from the respective end node of all the customer
                        $d_{1}$’s nearest neighbor customers to $d_{1}.$ In addition, each edge’s width represents
                        this edge’s Q-value compared to other edges. Since, the Q-value grows when reaching
                        the destination, keeping the Q shortest-path table update for the path from each customer’s
                        nearest neighbor customers to this customer in one Q shortest-path table is possible
                        and doesn’t contradict when exploiting the Q shortest-path table to generate the shortest
                        path. Fig. 1(c) shows each customer to this customer’s each nearest neighbor customer transitional
                        distance-table. Similarly, Fig. 1(d) shows each customer to this customer’s each nearest neighbor customer transitional
                        path-table. Notably, the paths in this table are reversed when the states are reversed.
                     
                     
                           Fig. 1. An example of the Q-learning based generation of the shortest path from each customer’s nearest neighbor customers to this customer and the shortest route connecting all the customers.Fig. 1(a)is an arbitrary road network with five customers.Fig. 1(b)shows the list of the respective Q shortest-table updates for all the five customers and each edge’s width represents this edge’s Q-value compared to other edges.Fig. 1(c)shows each customer to this customer’s each nearest neighbor customer transitional distance-table.Fig. 1(d)shows each customer to this customer’s each nearest neighbor customer transitional path-table.Fig. 1(e)shows the Q-value transition between each customer and this customer’s nearest neighbor customers.Fig. 1(f)shows the shortest route from the customer $d_{1}$, generated based on theAlgorithm 3and the final Q-routing table fromFig. 1(e).
 
                     
                           Algorithm 1. Q-learning based, Shortest-path generation.
 
                   
                  
                        3.3.2 The Q-learning based, Shortest-route Generation Algorithm
                     Algorithm 2 is for the Q-learning based generation of the shortest route from any customer to
                        other customers by visiting each other customer once and returning back to the starting
                        customer. The inputs to the algorithm are given in line 1 of the algorithm. The RL
                        agent requires the Q-routing table having each customer and this customer’s nearest
                        neighbor customers’ state transition, the distance routing and path routing tables
                        from the Algorithm 1, the reward table having its state transitions same as those of the Q-routing table
                        with the initial reward value as -1, the list of customers to visit, the maximum number
                        of random walks for the RL agent, the total number of episodes, and the starting customer.
                        The output of the algorithm includes the shortest route and distance. The algorithm
                        also requires specifying the learning rate, discount factor, initial epsilon value,
                        epsilon decay rate, and minimum epsilon.
                     
                     The working of Algorithm 2 is shown in lines 4 to 38 of the algorithm. The algorithm consists of five subparts:
                        the global variable setup, episode variable setup, Q-value update, the global variable’s
                        update by comparing the episode variables, and route and route distance generation.
                        The Q-routing and reward tables are copied as the best Q-routing and reward tables,
                        as in line 4 of the algorithm. Then, the global variable for the number of visited
                        customers and the distance covered, respectively, is initialized, as in line 5 of
                        the algorithm.
                     
                     
                           Algorithm 2. Q-learning based, Shortest-route generation.
 
                     
                           Algorithm 3. Path/route generation and the corresponding distance calculation.
 
                     The ‘for loop’ loops through all the available episodes, as shown in line 8 of the
                        algorithm. The episode variables are updated in each episode, as in lines 7 to 11
                        of the algorithm. The Q-value update during each loop is similar to the Q-value update
                        of Algorithm 1 and is shown in lines 12 to 26 of the algorithm. The RL agent starts its functioning
                        from the assigned current state and random-walks one step at a time until its termination
                        according to the ‘while’ condition given in line 12 of the algorithm. The RL agent
                        is terminated either when its completed number of random walks exceeds the  maximum
                        number of random walks the agent can take (max_steps) or when the agent reaches the
                        end state. The agent then chooses an action ($a$) at the current state ($s$), based
                        on the $\varepsilon $-greedy selection which enables the agent to either explore or
                        exploit based on the epsilon value, as in line 13, and selects the next state, as
                        given in line 14 of the algorithm. Subsequently, the episode reward table is updated,
                        as in lines 15 to 22 of the algorithm. A negative reward is assigned between the current
                        and next states if the RL agent visits all the customers and ends at a different customer
                        than the starting customer, as in lines 15 and 16 of the algorithm. In addition, the
                        reward table is updated with a negative reward if the RL agent visits the same customer
                        more than once, as in lines 19 and 20 of the algorithm. On the other hand, a positive
                        reward is assigned if the RL agent visits the starting customer’s each other customer
                        once and returns to the starting customer, as in lines 17 and 18 of the algorithm.
                        Then, the next to visit customer is removed from the to-visit customers list, as in
                        lines 21 and 22 of the algorithm. The RL agent subsequently updates the previous Q-value,
                        based on the current and next customer states and using Eq. (2), as in line 23 of the algorithm. Also, the RL agent replaces the current state with
                        the next state and decrements the max_steps, as in line 26 of the algorithm. Finally,
                        the number of customer locations visited (VD) by the RL agent is counted after the
                        Q-learning process is terminated. Further, the global route and global route distance
                        variables are calculated based on the VD, as in lines 28 to 35 of the algorithm. If
                        the VD is equal to the previous global_visited_path, no changes in the best reward
                        and the Q-routing tables are made. Otherwise, if the VD is greater than the previous
                        global_visited_path, the length of the global route distance is set to negative infinity,
                        and the best reward and Q-routing tables are replaced by the episode reward and Q-routing
                        tables, respectively, as in lines 29 to 31 of the algorithm. This step is followed
                        by assigning a global_visited_path equal to the VD. Further, if the episode route
                        distance (episode_distance) is greater than the global route distance, then the global
                        route distance is replaced by the episode route distance, and the best reward and
                        Q-routing tables are replaced by the episode reward and Q-routing tables, respectively,
                        as in lines 33 to 35 of the algorithm. The epsilon value is then updated after the
                        termination of the RL agent’s route search, as in lines 36 and 37 of the algorithm.
                        Finally, the RL agent generates the shortest route and distance after the Q-value
                        update for all the possible episodes is completed, based on Algorithm 3. An example of the shortest route generation is shown in Figs. 1(e) and 1(f). Fig. 1(e) shows the Q-value transition between each customer and this customer’s nearest neighbor
                        customers. Subsequently, the shortest route from any customer to other customers is
                        generated based on Algorithm 3 andthe final Q-routing table from Fig. 1(e).
                     
                     Algorithm 3 is for the path/route generation and the corresponding distance calculation. The
                        inputs to the algorithm are the updated Q-value table, starting and ending nodes,
                        and distance table (the road length graph for the path generation and the distance
                        routing table for the route generation), as in line 1 of the algorithm. The output
                        of the algorithm is the path/route generated and its corresponding distance, as in
                        line 2 of the algorithm. The working of the algorithm is shown in lines 3 to 14 of
                        the algorithm. The current state, path, and distance are assigned their initial values
                        in lines 3 to 5 of the algorithm. In addition, the variable ‘goal’ is set to False
                        initially, as in line 6 of the algorithm. Notably, the RL agent calculates the path/route
                        according to lines 7 to 13 of the algorithm. In particular, the RL agent goes through
                        the ‘while loop’ until the terminating condition is met. The RL agent then calculates
                        the next state from the current state by using the Q-value table through exploitation,
                        as in lines 8 and 9 of the algorithm. Subsequently, the path and distance are updated,
                        as in lines 10 and 11 of the algorithm. In addition, the RL agent checks if the next
                        state is the end state during each ‘while loop’, as in line 12 of the algorithm. If
                        the next state is the end state, the variable ‘goal’ is set to ‘True’, as in line
                        13 of the algorithm, and the ‘while loop’ terminates. Finally, the path/route and
                        its corresponding distance are returned by the algorithm.
                     
                   
                
             
            
                  4. Experiments
               
                     4.1 Dataset
                  We train the RL agent under the DQSR algorithms and evaluate the DQSR’s effectiveness
                     by conducting a case study on the DQSR-based, illegal-parking surveillance in Yeonsu-gu,
                     Incheon, South Korea. Two types of data were required for this case study: (i) the
                     road network with the road information of the Yeonsu-gu region and (ii) the illegal-parking
                     locations in this region. So, the road network was generated using the Quantum Geographical
                     Information System (QGIS), a free and open-source cross-platform desktop GIS application
                     that supports the viewing, editing, and analysis of geospatial data. Further, we collected
                     the details of parking violations in the Yeonsu-gu region for the period of 2017-2019,
                     specifically between 07:00-09:00 on weekdays. The dataset used in this case study
                     is shown in Fig. 2. The ‘pink’ line shows the region of concern, i.e., the Yeonsu-gu
                     region, and the ‘black’ lines show the road network of this region. Also, the ‘green’
                     dots in Fig. 2 show the respective locations of illegal parking (by the customers)
                     that need to be visited by the RL agent to investigate the parking violation, and
                     the ‘blue’ lines show the road links with illegal parking reported in these road links.
                     In addition, a numerical value adjacent to a road link indicates this road link’s
                     length. Fig. 2 also shows that most parking violations were near a road link and not
                     on a road link. Therefore, we selected the road link nearest to each respective customer
                     (with a corresponding illegal-parking location) for surveillance. The detailed feature
                     representation of the dataset is shown in Table 1. In particular, the dataset consists of the road network information, such as the
                     link id, link starting and ending nodes, road link length, longitude and latitude
                     of each link’s starting node, and road link (i.e., customer) nearest to the illegal-parking
                     location. In addition, Table 2 shows the basic information on the dataset. Meanwhile, Yeonsu-gu, Incheon, has an
                     area of 50.8 km$^{2}$ and includes 2153 road links with a total road network length
                     of 557.5~km.
                  
                  
                        Fig. 2. Dataset Visualization. The ‘pink’ line shows the Yeonsu-gu region’s boundary, the ‘black’ lines show the road network with road length information, the ‘green’ dots represent the respective illegal-parking (by the customers) locations that need to be included in the route, and the ‘blue’ lines show the road links with illegal parking reported in these links.
 
                  
                        Table 1. The feature representation of the dataset used to evaluate the DQSR’s effectiveness.
                     
                           
                              
                                 | S. No. | LINK_ID | S_NODE | E_NODE | Length (m) | Longitude | Latitude | IP | Customer index | 
                        
                        
                              
                                 | 1. | 1610309000 | 1610019100 | 1610119400 | 442.095 | 123.623 | 37.4323 | - | - | 
                           
                                 | 2. | 1610067301 | 1610019100 | 1610090800 | 426.306 | 123.623 | 37.4323 | - | - | 
                           
                                 | 3. | 1610067100 | 1610019100 | 1610019300 | 261.743 | 123.623 | 37.4323 | 13 | 0 | 
                           
                                 | 4. | 1640269200 | 1610118900 | 1640085900 | 88.833 | 126.599 | 37.4187 | - | - | 
                           
                                 | 5. | 1610294600 | 1610118900 | 1610118700 | 197.158 | 126.599 | 37.4187 | - | - | 
                           
                                 | 6. | 1640001700 | 1630002600 | 1640016500 | 230.725 | 126.697 | 37.4361 | - | - | 
                           
                                 | 7. | 1630012200 | 1630002600 | 1630003000 | 329.310 | 126.697 | 37.4361 | 24 | 1 | 
                           
                                 | 8. | 1630012000 | 1630002600 | 1630003900 | 575.696 | 126.697 | 37.4361 | - | - | 
                           
                                 | 9. | 1640002000 | 1630003100 | 1640016600 | 361.970 | 126.681 | 37.4161 | - | - | 
                           
                                 | 10. | 1640002100 | 1630003100 | 1640016700 | 243.891 | 126.681 | 37.4161 | 18 | 2 | 
                           
                                 | … | … | … | … | … | … | … | … | … | 
                        
                     
                   
                  
                        Table 2. The basic information of the dataset used to evaluate the DQSR’s effectiveness.
                     
                           
                              
                                 | S. No. | Parameter | Value | 
                        
                        
                              
                                 | 1. | Region | Yeonsu-gu, Incheon, South Korea | 
                           
                                 | 2. | Total area | 50.8 km2 | 
                           
                                 | 3. | Total road length | 557498.47 m | 
                           
                                 | 4. | Number of road links | 2153 | 
                           
                                 | 5. | Number of illegal parking | NA | 
                           
                                 | 6. | Number of customers | 135 | 
                        
                     
                   
                
               
                     4.2 Algorithms Compared and Simulation Parameters
                  The effectiveness and superiority of the DQSR algorithms are verified by comparing
                     these algorithms with two other exact and heuristic algorithms, respectively. The
                     first algorithm compared (Q+TPS) is based on a Q-learning algorithm that generates
                     the shortest path between each customer to all other customers and a TSP algorithm
                     that solves for the shortest route covering all the customers (without revisiting
                     any customer) and returning to the starting customer. The second algorithm compared
                     (Dijikstra+TSP) is based on the Dijkstra and Travelling Salesman Problem algorithms,
                     where the Dijkstra algorithm generates the meta-graph (the shortest path from each
                     customer to all other customers), and the TSP algorithm finds the shortest route covering
                     all the costumers (without revisiting any customer) and returning to the starting
                     customer.
                  
                  Table 3 summarizes the simulation parameters and their values for implementing the DQSR algorithms
                     and the two algorithms compared with the DQSR algorithms. The simulation parameters
                     for the DQSR are mentioned in Table 3’s lines 1 to 9. Notably, the DQSR’s two Q-learning-based algorithms have different
                     simulation parametric values. In particular, the difference between their respective
                     epsilon decay rates and the number of training episodes is very large.
                  
                  The RL agent learns the Q-value table for 50 pairs of source and destination customers
                     in the DQSR’s Algorithm 1 and goes in a loop of 30 episodes per pair. Therefore, although the epsilon decay
                     rate and total episodes are set to 0.9 and 30, respectively, the RL agent has to go
                     through 1500 cycles before calculating the shortest path and distance. Whereas the
                     DQSR’s Algorithm 2 is tasked only with calculating the shortest route covering all the 135 customers.
                     Since this algorithm searches for the shortest route without repeating with the smallest
                     distance, the epsilon decay rate and total episodes are set to 0.9999 and 40000, respectively,
                     for the better learning of the RL agent.
                  
                  The Q+TSP consists of two algorithms. The first algorithm is Q-learning based and
                     used for the shortest path generation between all 134 other source customers to one
                     particular destination customer. Since some of the customers are far from the destination
                     customer, the epsilon decay rate and episodes are kept larger than the DQSR’s Algorithm 1. In particular, the epsilon decay rate and episodes are set to 0.95 and 40, respectively.
                     Also, this process is repeated for all 135 customers. In addition, we used the TwoOpt_solver
                     algorithm, a python’s inbuilt function that solves the TSP, as the Q+TSP’s second
                     algorithm. Similarly, the Dijkstra+TSP consists of two algorithms. The first algorithm
                     is the Dijkstra algorithm which calculates the shortest path and distance between
                     each customer and other customers. The second algorithm is the TSP algorithm that
                     solves for the shortest route connecting all the customers (by visiting each customer
                     once and returning the source customer). In particular, we used python’s inbuilt function
                     for the Dijkstra algorithm based on the network graph and the TwoOpt_solver algorithm
                     for solving the TSP. Notably, we implemented all the DQSR algorithms and the two algorithms
                     compared, using python 3.7 on an Ubuntu 18.04.4 machine with 1 NVIDIA TITAN Xp Graphics
                     Card and a GPU memory of 11 GB. Notably, the performance of each model was unaffected
                     by the use of the GPU.
                  
                  
                        Table 3. Simulation parameters.
                     
                           
                              
                                 | S. No. | Parameter | Shortest path between customers | Best route covering all the customers | 
                        
                        
                              
                                 | Dual-stage, Q-learning based, Shortest-route generation (DQSR) | 
                           
                                 | 1. | Algorithm | Q-learning | Q-learning | 
                           
                                 | 2. | Learning rate ($\alpha$) | 0.8 | 0.8 | 
                           
                                 | 3. | Epsilon ($\varepsilon$) | 1.0 | 1.0 | 
                           
                                 | 4. | Discount factor ($\gamma$) | 0.9 | 0.9 | 
                           
                                 | 5. | Epsilon decay rate | 0.9 | 0.9999 | 
                           
                                 | 6. | Minimum epsilon | 0.05 | 0.05 | 
                           
                                 | 7. | Episodes | 30 | 40000 | 
                           
                                 | 8. | Number of customers | 135 | 135 | 
                           
                                 | 9. | Number of nearest neighbor customers | 50 | - | 
                           
                                 | Q-learning for the shortest-path generation + Travelling Salesman Problem based, shortest-route-generation
                                    (Q+TSP)
                                  | 
                           
                                 | 10. | Algorithm | Q-learning | TSP: TwoOpt_solver | 
                           
                                 | 11. | Learning rate ($\alpha$) | 0.8 | - | 
                           
                                 | 12. | Epsilon ($\varepsilon$) | 1.0 | - | 
                           
                                 | 13. | Discount factor ($\gamma$) | 0.9 | - | 
                           
                                 | 14. | Epsilon decay rate | 0.95 | - | 
                           
                                 | 15. | Minimum epsilon | 0.05 | - | 
                           
                                 | 16. | Episodes | 40 | - | 
                           
                                 | 17. | Number of customers | 135 | 135 | 
                           
                                 | 18. | Number of nearest neighbor customers | 134 | - | 
                           
                                 | Dijkstra algorithm for the shortest-path generation + Travelling Salesman Problem
                                    based, shortest-route-generation (Dijkstra+TSP)
                                  | 
                           
                                 | 19. | Algorithm | Dijkstra algorithm | TSP: TwoOpt_solver | 
                           
                                 | 20. | Number of customers | 135 | 135 | 
                        
                     
                   
                
               
                     4.3 Results and Discussion
                  We present and discuss the results of the case study that used the DQSR, under various
                     parameters, and analyze the DQSR’s performance by comparing this performance to those
                     of the respective other two algorithms (Q+TSP and Dijkstra+TSP).
                  
                  
                        4.3.1 Route Visualization
                     The road network and the route information are the two data required for the route
                        visualization on the QGIS application. The road network is generated through the QGIS
                        application or can be found in the government portal. Notably, we generated the road
                        network from scratch for the Yeonsu-gu, Incheon, South Korea, and stored it as a shape
                        (.shp) file. The route result is generated using a python program. The route result
                        contains three data fields (i) a list of road links or edges through which the RL
                        agent should travel through, (ii) the sequence number denoting the order in which
                        the agent should travel, and (iii) a flag to separate the agent-selected road links
                        from the total road links list. The route information is finally stored as a comma-separated
                        value (CSV) file.
                     
                     Meanwhile, the python program and QGIS have not been embedded together in this work.
                        Rather, the route result from the python program is manually uploaded to the QGIS
                        application for route visualization. First, the shapefile containing the road network
                        information is loaded to the QGIS through the ‘Add Vector Layer’ function. In particular,
                        the road network layer consists of fields, such as the road link id, node id, location
                        information, etc. Similarly, the route information file is uploaded to QGIS through
                        the ‘Add Vector Layer’ function. It consists of fields like road link id, sequence
                        number, and flag. Next, we join the route layer to the road network layer based on
                        a common field. Consequently, two new data fields, namely sequence number and flag,
                        are added to the road network layer. Also, the road link id in the road network layer
                        is categorized into two groups based on the flag data field: (i) the one selected
                        for the agent to travel through and (ii) the one not included in the route planning.
                        This categorization of the road link is performed through the QGIS Symbology properties,
                        and the road sequence level’s categorization is performed through the QGIS Label properties.
                     
                     Fig. 3 shows the visualization of the shortest routes from the DQSR and Q+TSP algorithms,
                        respectively. In particular, these results are visualized on a QGIS map. The ‘black’
                        lines represent the road network, ‘green’ dots represent the illegal-parking locations,
                        and dotted ‘blue’ lines represent the customer links or the customers for the DQSR-based
                        RL agent to visit. Fig. 3(a) shows the shortest route covering all the customers and returning to the starting
                        customer (‘red’ line). In addition, the numbers mentioned in the road network show
                        the order in which the DQSR-based RL agent had to follow the road links to visit all
                        the mentioned customers and return to the starting customer. Correspondingly, the
                        DQSR-based RL agent’s shortest route is 58155.14 m long. In addition, the DQSR-based
                        RL agent had to visit 600 out of the total 2152 road links, for the shortest route.
                        Importantly, the DQSR-based RL agent could choose to travel through previously traveled
                        road links without concerning that the agent would visit a previously visited customer.
                        The agent was allowed to make such a choice if and only if the path through a customer
                        revisited is the shortest path connecting two other customers. Fig. 3(b) shows the zoomed-in visualization of the region within the ‘purple circle’ of Fig. 3(a). This zoomed-in image of Fig. 3 (b) is more intuitive and easier to visualize. Since
                        the DQSR-based RL agent visited the road links multiple times, only the last sequence
                        numbers of the road links in the route are displayed in the image.
                     
                     
                           Fig. 3. Route visualization on QGIS: (a) The DQSR algorithms based shortest-route; (b) The zoomed-in visualization of the region within the ‘purple circle’ ofFig. 3(a); (c) The Q+TSP algorithm based shortest-route; (d) The zoomed-in visualization of the region within the ‘purple circle’ ofFig. 3(c). The dark purple arrows inFig. 3(d)highlight some of the differences in the shortest-route generation by the DQSR and Q+TSP algorithms, respectively.
 
                     Fig. 3(c) shows the shortest route generated by the Q+TSP algorithm. In particular, this shortest
                        route is represented by the ‘purple’ line of Fig. 3(c). The numbers in the road network show the sequence in which the agent had to travel
                        the road links to complete the shortest route. Notably, the Q+TSP algorithm required
                        55703.39 m for the route connecting all the customers. In addition, the Q+TSP agent
                        visited 597 out of the total 2153 road links in generating the shortest route. Meanwhile,
                        the Q+TSP agent was allowed to go through the same road links without violating the
                        condition of visiting an already visited customer. Fig. 3(d) is the zoomed-in image of the region within the ‘purple circle’ of Fig. 3(c), which is more intuitive for the reader to understand the shortest route. Further,
                        Fig. 3(d)’s dark purple arrows highlight some of the differences in the shortest-route generation
                        by the DQSR and Q+TSP algorithms, respectively.
                     
                     Table 4 summarizes the algorithms’ respective performances. The table includes the physical
                        distance of the shortest route generated by each of the algorithms and also provides
                        the algorithms’ respective time complexities in generating the shortest routes. Hence,
                        the Q+TSP algorithm provides the shortest route in terms of travel distance which
                        is approximately 55.7 km for this route, whereas the DQSR’s shortest route overshot
                        the Q+TSP algorithm’s travel distance by approximately 3 km. But, a 3 km overshoot
                        is merely 0.5 % of the Yeonsu-gu region’s total road length of 557.49 km. Therefore,
                        we can argue that both the DQSR and Q+TSP algorithms respectively generate the shortest
                        route. However, DQSR is more preferred compared to the other two algorithms when the
                        time complexity is taken into account. Table 4 shows that the DQSR required 237 s to calculate the shortest path between each customer
                        and this customer’s nearest neighbor customers and required another 209 seconds to
                        calculate the shortest route connecting all the customers, i.e., a total time of 446
                        s. In contrast, the Q+TSP algorithm required 1412 s to calculate the shortest path
                        between each customer and all other customers and 0.23 s to calculate the shortest
                        route connecting all the customers, i.e., a total time of 1412.23 s. Since the road
                        network for our case study is very large with very high network depth, the Dijkstra+TSP’s
                        Dijkstra algorithm couldn’t generate the shortest path in 6 h. Therefore, we chose
                        to terminate the algorithm.
                     
                     
                           Table 4. The comparison of the respective shortest-routes generated by the DQSR, Q+TSP, and Dijkstra+TSP algorithms, in terms of shortest-route distance and time complexity.
                        
                              
                                 
                                    | S. No. | Algorithm(s) | Parameter | Value | Time complexity | 
                              
                                    | Algorithm 1 | Algorithm 2 | 
                           
                           
                                 
                                    | 1. | DQSR | Route length | 58.15 km | 237 s | 209 s | 
                              
                                    | 2. | Number of visited links | 600 | 
                              
                                    | 3. | Q+TSP | Route length | 55.70 km | 1412 s | 0.23 s | 
                              
                                    | 4. | Number of visited links | 597 | 
                              
                                    | 5. | Dijkstra+TSP | Route length | - | >6 h | - | 
                              
                                    | 6. | Number of visited links | - | 
                           
                        
                      
                   
                
             
            
                  5. Conclusion
               This paper elaborated on all the aspects of the proposed dual-stage, Q-learning based,
                  shortest-route generation and presented its training process. In particular, this
                  shortest-route generation was used for illegal-parking surveillance in Yeonsu-gu,
                  Incheon, South Korea. The analysis of this application, based on experiments, shows
                  that our DQSR algorithms perform better compared to two other shortest-route generation
                  algorithms, in terms of time complexity, and has achieved almost the same shortest-route
                  distance as these two algorithms compared. In particular, this comparison shows that
                  our DQSR was three times faster than the Q+TSP algorithm and had produced the shortest
                  route with a 5 % higher distance than that of the Q+TSP. In addition, the time complexity
                  in our model can further be reduced by lowering the number of nearest neighbor customers,
                  which we fixed to 50 in this study, provided there is enough intersection of customers
                  in each nearest neighbor customer group to generate the shortest route. Further, we
                  plan to increase and decrease the road network size and check the performance of the
                  DQSR in terms of time complexity and shortest-route distance, in the future.
               
             
          
         
            
                  ACKNOWLEDGMENTS
               
                  				This work was supported by Incheon National University Research Grant in 2017.
                  			
               
             
            
                  
                     REFERENCES
                  
                     
                        
                        Nazari M., Oroojlooy A., Takáč M., Snyder L. V., 2018, Reinforcement Learning for
                           Solving the Vehicle Routing Problem, In Proceedings of the 32nd International Conference
                           on Neural Information Processing Systems (NIPS'18), Curran Associates Inc., Red Hook,
                           NY, USA, pp. 9861-9871

 
                     
                        
                        Singh A. K., Pamula R., 2021, An Efficient and Intelligent Routing Strategy for Vehicular
                           Delay Tolerant Networks, Wireless Network, Vol. 27, pp. 383-400

 
                     
                        
                        Geetha M, Ganesan R., 2021, CEPRAN-Cooperative Energy Efficient and Priority Based
                           Reliable Routing Protocol with Network Coding for WBAN, Wireless Pers Commun, Vol.
                           117, pp. 3153-3171

 
                     
                        
                        Liu D., Cui J., Zhang J., Yang C., Hanzo L., 2021, Deep Reinforcement Learning Aided
                           Packet-Routing for Aeronautical Ad-Hoc Networks Formed by Passenger Planes, IEEE Transactions
                           on Vehicular Technology, Vol. 70, No. 5, pp. 5166-5171

 
                     
                        
                        Yao H., Mai T., Jiang C., Kuang L., Guo S., 2019, AI Routers & Network Mind: A Hybrid
                           Machine Learning Paradigm for Packet Routing, IEEE Computational Intelligence Magazine,
                           Vol. 14, No. 4, pp. 21-30

 
                     
                        
                        Roig P. J., Alcaraz S., Gilly K., Bernad C., Juiz C., 2022, Arithmetic Framework to
                           Optimize Packet Forwarding among End Devices in Generic Edge Computing Environments,
                           Sensors, Vol. 22, No. 421

 
                     
                        
                        Waqas A., Saeed N., Mahmood H., Almutiry M., 2022, Distributed Destination Search
                           Routing for 5G and beyond Networks, Sensors, Vol. 22, No. 472

 
                     
                        
                        Alhussein  O.,  et al., 2018, Joint VNF Placement and Multicast Traffic Routing in
                           5G Core Networks, 2018 IEEE Global Communications Conference (GLOBECOM), pp. 1-6

 
                     
                        
                        Celik A., Tetzner J., Sinha K.,  et al., 2019, 5G Device-To-Device Communication Security
                           and Multipath Routing Solutions, Appl Netw Sci, Vol. 4, No. 102

 
                     
                        
                        Feng L,  et al., 2021, Explicit Evolutionary Multitasking for Combinatorial Optimization:
                           A Case Study on Capacitated Vehicle Routing Problem, IEEE Transactions on Cybernetics,
                           Vol. 51, No. 6, pp. 3143-3156

 
                     
                        
                        Zhang J., Yang F., Weng X., 2019, An Evolutionary Scatter Search Particle Swarm Optimization
                           Algorithm for the Vehicle Routing Problem With Time Windows, IEEE Access, Vol. 6,
                           pp. 63468-63485

 
                     
                        
                        Li J., Wang F., He Y., 2020, Electric Vehicle Routing Problem with Battery Swapping
                           Considering Energy Consumption and Carbon Emissions, Sustainability, Vol. 12, No.
                           10537

 
                     
                        
                        Konstantakopoulos G. D., Gayialis S. P., Kechagias E. P., Papadopoulos G. A., Tatsiopoulos
                           I. P. A., 2020, Multiobjective Large Neighborhood Search Metaheuristic for the Vehicle
                           Routing Problem with Time Windows, Algorithms, Vol. 13, No. 243

 
                     
                        
                        Ngo T. G., Dao T. K., Thandapani J., Nguyen T. T., Pham D. T., Vu V. D., 2021, Analysis
                           Urban Traffic Vehicle Routing Based on Dijkstra Algorithm Optimization, In: Sharma
                           H., Gupta M. K., Tomar G. S., Lipo W. (eds) Communication and Intelligent Systems.
                           Lecture Notes in Networks and Systems, Springer, Singapore, Vol. 204

 
                     
                        
                        Zheng Y.-L., Song T.-T., Chai J.-X., Yang X.-P., Yu M.-M., Zhu Y.-C., Liu Y., Xie
                           Y.-Y., 2021, Exploring a New Adaptive Routing Based on the Dijkstra Algorithm in Optical
                           Networks-on-Chip, Micromachines, Vol. 12, No. 54

 
                     
                        
                        El-Sherbeny N. A., 2010, Vehicle Routing with Time Windows: An Overview of Exact,
                           Heuristic and Metaheuristic Methods. J. King Saud Univ. Sci., Vol. 22, pp. 123-131

 
                     
                        
                        Neupane D., Seok J. A., 2020, Review on Deep Learning-Based Approaches for Automatic
                           Sonar Target Recognition, Electronics, Vol. 9, No. 1972

 
                     
                        
                        Bhandari S., Ranjan N., Khan P., Kim H., Hong Y.-S., 2021, Deep Learning-Based Content
                           Caching in the Fog Access Points, Electronics, Vol. 10, No. 512

 
                     
                        
                        Bhandari S., Kim H., Ranjan N., Zhao H. P., Khan P., 2020, Best Cache Resource Based
                           on Deep Neural Network for Fog Radio Access Networks, Journal of Internet Technology,
                           Vol. 21, No. 4, pp. 967-975

 
                     
                        
                        Lazakis I., Khan S., 2021, An Optimization Framework for Daily Route Planning and
                           Scheduling of Maintenance Vessel Activities in Offshore Wind Farms, Ocean Engineering,
                           Vol. 225, No. 108752

 
                     
                        
                        Chen C., Jiang J., Lv N., Li S., 2020, An Intelligent Path Planning Scheme of Autonomous
                           Vehicles Platoon Using Deep Reinforcement Learning on Network Edge, IEEE Access, Vol.
                           8, pp. 99059-99069

 
                     
                        
                        Prianto E., Kim M., Park J.-H., Bae J.-H., Kim J.-S., 2020, Path Planning for Multi-Arm
                           Manipulators Using Deep Reinforcement Learning: Soft Actor-Critic with Hindsight Experience
                           Replay, Sensors, Vol. 20, No. 5911

 
                     
                        
                        Lu E. H-C., Ya-Wen Y., 2019, A Hybrid Route Planning Approach for Logistics with Pickup
                           and Delivery. Expert Syst, Appl, Vol. 118, pp. 482-492

 
                     
                        
                        Tong B., Wang J., Wang X., Zhou F., Mao X., Zheng W., 2022, Best Route Planning for
                           Truck-Drone Delivery Using Variable Neighborhood Tabu Search Algorithm, Appl. Sci.,
                           Vol. 12, No. 529

 
                     
                        
                        Vahdanjoo M., Zhou K., Sørensen C. A. G., 2020, Route Planning for Agricultural Machines
                           with Multiple Customers: Manure Application Case Study, Agronomy, Vol. 10, No. 1608

 
                     
                        
                        Ranjan N., Bhandari S., Khan P., Hong Y.-S., Kim H., 2021, Large-Scale Road Network
                           Congestion Pattern Analysis and Prediction Using Deep Convolutional Autoencoder, Sustainability,
                           Vol. 13, No. 5108

 
                     
                        
                        Ranjan N., Bhandari S., Zhao H. P., Kim H., Khan P., 2020, City-Wide Traffic Congestion
                           Prediction Based on CNN, LSTM and Transpose CNN, IEEE Access, Vol. 8, pp. 81606-81620

 
                     
                        
                        Li Z., Huang J., 2019, How to Mitigate Traffic Congestion Based on Improved Ant Colony
                           Algorithm: A Case Study of a Congested Old Area of a Metropolis, Sustainability, Vol.
                           11, No. 1140

 
                     
                        
                        Balakrishnan A., Banciu M., Glowacka K., Mirchandani P., 2013, Hierarchical Approach
                           for Survivable Network Design, Eur. J. Oper. Res., Vol. 225, pp. 223-235

 
                     
                        
                        Fredman M. L., Tarjan R. E., 1987, Fibonacci Heaps and Their Uses in Improved Network
                           Optimization Algorithms, J. ACM, Vol. 34, pp. 596-615

 
                     
                        
                        Qu T., Cai Z. A., 2015, Fast Isomap Algorithm Based on Fibonacci Heap, In International
                           Conference in Swarm Intelligence; Springer: Cham, Switzerland, pp. 225-231

 
                     
                        
                        Lu X., Camitz M., 2011, Finding the Shortest Paths by Node Combination, Appl. Math.
                           Comput., Vol. 217, pp. 6401-6408

 
                     
                        
                        Orlin J. B., Madduri K., Subramani K., Williamson M. A., 2010, Faster Algorithm for
                           the Single Source Shortest Path Problem with Few Distinct Positive Lengths, J. Discret.
                           Algorithms, Vol. 8, pp. 189-198

 
                     
                        
                        Jin W., Chen S., Jiang H., 2013, Finding the K Shortest Paths in a Time-schedule Network
                           with Constraints on Arcs, Comput. Oper. Res., Vol. 40, pp. 2975-2982

 
                     
                        
                        Asaduzzaman M., Geok T. K., Hossain F., Sayeed S., Abdaziz A., Wong H.-Y., Tso C.
                           P., Ahmed S., Bari M. A., 2021, An Efficient Shortest Path Algorithm: Multi-Destinations
                           in an Indoor Environment, Symmetry, Vol. 13, No. 421

 
                     
                        
                        Hart P. E., Nilsson N. J., Raphael B. A., 1968, Formal Basis for the Heuristic Determination
                           of Minimum Cost Paths, IEEE Transactions on Systems Science and Cybernetics, Vol.
                           4, No. 2, pp. 100-107

 
                     
                        
                        Ben Ticha H., Absi N. A., 2017, Solution Method for the Multi-Destination Bi-Objectives
                           Shortest Path Problem, Elsevier: Amsterdam, The Netherlands

 
                     
                        
                        Bast H., Funke S., Sanders P., Schultes D., 2007, Fast Routing in Road Networks with
                           Transit Nodes, Science, Vol. 316, No. 5824:566

 
                     
                        
                        Shen B., Cheema M. A., Harabor D. D., Stuckey P. J., 2021, Contracting and Compressing
                           Shortest Path Databases, Proceedings of the International Conference on Automated
                           Planning and Scheduling, Vol. 31, No. 1, pp. 322-330

 
                     
                        
                        Bentley J. L., 22-24 January, 1990, Experiments on Traveling Salesman Heuristics,
                           In Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, San
                           Francisco, CA, USA, Vol. 91-99

 
                     
                        
                        Clarke G., Wright J. W., 1964, Scheduling of Vehicles from a Central Customer to a
                           Number of Delivery Points, Oper. Res., Vol. 12, pp. 568-581

 
                     
                        
                        Mele U. J., Gambardella L. M., Montemanni R., 2021, Machine Learning Approaches for
                           the Traveling Salesman Problem: A Survey, Proceedings of the 8$^{th}$ International
                           Conference on Industrial Engineering and Applications 2021 (Europe), pp. 182-186

 
                     
                        
                        Zhang R., Prokhorchuk A., Dauwels J., 2020, Deep Reinforcement Learning for Traveling
                           Salesman Problem with Time Windows and Rejections, 2020 International Joint Conference
                           on Neural Networks (IJCNN), pp. 1-8

 
                     
                        
                        Reinelt G., 1991, TSPLIB-A Traveling Salesman Problem Library, INFORMS Journal on
                           Computing, Vol. 3, No. 4, pp. 376-384

 
                     
                        
                        Usberti F. L., Franca P. M., France A. L. M., 2011, The Open Capacitated Arc Routing
                           Problem, Computers & Operations Research, Vol. 38, No. 11, pp. 1543-1555

 
                     
                        
                        Minieka E., 1979, The Chinese Postman Problem for Mixed Networks, Management Science,
                           Vol. 25, No. 7, pp. 609-707

 
                   
                
             
            Author
            
            
               			Navin Ranjan received his Bachelor's degree from Kathmandu University, Nepal, in
               2016 and his Master's degree from Incheon National University, South Korea, in 2021.
               He worked as a Research Assistant at Industry-University Cooperation Foundation, Incheon
               National University, South Korea (2021~2022). His research interests include image
               processing, computer vision, scene understanding, and reinforcement learning.
               		
            
            
            
               			Sovit Bhandari received a B.E. degree from Kathmandu University, Nepal, and an
               M.S. degree from Incheon National University, South Korea, in 2016 and 2020, respectively.
               He worked as Core Network Engineer at Huawei Technologies Nepal Pvt. Ltd. From Jan.
               2018 to Aug. 2018, and since Sept. 2020, he has been working as a Research Assistant
               at IoT and Big-Data Research Center, Incheon National University, South Korea. His
               research interests include radio resource management, 5G and beyond mobile communications,
               machine learning, the Internet of Things, image processing, and computer vision.
               		
            
            
            
               			Yeong-Chan Kim received his Bachelor’s degree from Daelim University, South Korea,
               in 2021. He is currently doing his M.S. at Incheon National University, South Korea.
               His research interests includes data science, deep learning, next-generation mobile
               communication systems, image processing, and computer vision.
               		
            
            
            
               			Hoon Kim has been working as an Associate Professor at the Department of Electronics
               Engineering, Incheon National University, South Korea, since 2008. His research interest
               include radio resource management, optimization techniques, 5G mobile communication
               systems, machine learning and the internet of things. He is a Member of KICS, IEIE,
               IEEE, and IEICE.