Smart tourism information can be used to improve the tourism experience, provide personalized
suggestions, and improve the operational efficiency of tourism practitioners. This
section focuses on the technical means used in the smart tourism information search
method designed by the research institute.
3.1 Design of Mining Method based on Operator Data
With the rapid development of the global tourism industry, smart tourism has become
a highly concerned field. Operator data include user communication records, location
information, and other travel-related data, providing important data foundations for
building smart tourism systems (Nimrah and Saifullah, 2022; Huang et al., 2022). On
the other hand, powerful DM and information retrieval technologies are required to
utilize operator data and achieve smart tourism (Maihulla et al., 2022). The Apriori
algorithm is capable of mining frequent itemsets in the dataset and generating association
rules and has relatively loose requirements for data preprocessing. Therefore, tourism
DM methods have been designed based on Apriori. Fig. 1 presents the basic process of Apriori research and design.
Fig. 1. Basic flow of Apriori.
The ultimate goal of researching and designing algorithms is to generate a frequent
itemset that cannot be further changed (Fig. 1). It sets the number of times a candidate itemset is generated three times, first
scanning the database and counting the number of occurrences of the current level
candidate itemset. After clearing the candidate set that does not meet the requirements,
the next level candidate set is generated on the current level frequent itemset. After
counting the number of occurrences of the third-level candidate itemset, the method
determines whether a new frequent itemset can be generated. If it can be generated,
it continues to generate a new candidate set and scans the database to count the number
of occurrences of the candidate itemset. If it cannot be generated, the operation
is complete, and the algorithm is ended. The frequent itemset is determined based
on support, and a minimum support value is set in advance. If the support of the generated
itemset is greater than the minimum support value, it is determined to be a frequent
itemset. The support calculation is expressed as Eq. (1).
where $S$ represents support; $X$ and $Y$ represent the itemsets that do not intersect
with each other; $N\left(X,Y\right)$ is the number of records containing $X$ and$Y$;
$N$ is the total number of records. A candidate hash tree is constructed for selecting
frequent items based on the hash function. The entire dataset is scanned to obtain
all possible frequent itemsets and extract the frequent itemsets of the candidates.
The method then compares the frequent itemsets of candidates with the data in the
hash tree to calculate the confidence level of the frequent itemsets. The confidence
level calculation is shown in Eq. (2).
where $c$ represents the confidence level. When constructing a dataset, each item
must be sorted according to specific rules derived from the association relationships
between the elements. When there is a subset of the $X$-term set, the subset representation
is expressed as Eq. (3).
where $X'$ represents a subset of the $X$ term set, and $K$ is an alternate item.
Eq. (4) expresses the constraint with minimum confidence.
where $\alpha $ represents the minimum confidence level. In a set containing multiple
items, the total number of rules is calculated using Eq. (5).
where $R$is the total number of rules, and $d$ is the number of items in the set.
When conducting tourism DM based on operator data, it is necessary to mine the tourists’
travel routes and trends and predict future time periods accordingly. When conducting
data analysis, it is necessary to construct a binary attribute transaction set, as
shown in Fig. 2.
Fig. 2. Binary attribute transaction set construction procedure.
Fig. 2 shows that when constructing a binary attribute transaction set, it is necessary
to preset the transaction category and collect information records from the system.
It is represented by 1 and 0 pairs of transactions being accessed and not being accessed,
constructing a binary attribute transaction set matrix. The Apriori algorithm for
DM based on operator data undergoes multiple iterations and itemset filtering to mine
frequent itemsets containing binary attributes, obtain valuable association rules
in the data, and obtain effective information content.
3.2 Optimized Design of Improved Apriori Information Mining Method
The number of tourists in the tourism industry constantly changes, and the data volume
will increase significantly on some special dates. The Apriori algorithm, which has
been studied and constructed, has the potential risk of a high load when processing
large-scale data, which could decrease the operational efficiency. Hence, optimization
is needed (Taher et al., 2021). Although some association rules mined meet strong
association conditions, they are unsuitable for practical application scenarios and
may mislead subsequent data analysis. Therefore, it is necessary to exclude inappropriate
association rules (Eskandari et al., 2022). To make the excavated data more suitable
for the actual scene, the tourists’ travel orders at different locations are constrained
as conditions, and the data from the first travel location best reflects the priority
of different travel locations. The matching conditions for statistics are expressed
as Eq. (6).
where $V$, $S$, $lac$, and $ci$ are the tourist, tourist destination, location code,
and sector, respectively. Eq. (7) expresses the first set of tourist destinations and frequent $k$-item sets obtained.
where $first\_ v\_ \sec $ is the first gathering of tourist destinations, and $k_{item}$is
a frequent set of$k$ terms. The constraint that the $k$-term set will no longer be
calculated backward is expressed as Eq. (8).
where $X_{i}$ represents the set of items with fewer than $k$ terms. The conditions
for two tourist destinations not to be strongly correlated are expressed as Eq. (9).
where $A$ represents the collection of tourist destinations, and $A_{i}$ represents
any element in the collection of tourist destinations. If each item in the frequent$k$
item set does not meet the conditions for the first tourist location, the conditions
for continuing to excavate the $k+1$ item are expressed as Eq. (10).
where $T$ represents the label attribute. In different scenarios, there are differences
in the requirements of the original dataset and association rules. Hence, the description
of the degree of correlation is also relatively variable, which affects the computational
efficiency of the algorithm. This study introduces the correlation to reduce the number
of candidate sets and improve the computational efficiency of data. The association
rules need to satisfy Eq. (11).
where $B$ represents any collection of tourist destinations outside of set $A$. Eq.
(12) expresses whether the visit behavior of two tourist destinations is independent.
where $P$ is the probability of tourists visiting a certain tourist destination. The
representation of association rules based on the probability correlation is expressed
as Eq. (13).
where $P\left(A\rightarrow B\right)$ is the probability correlation between two tourist
destinations. Eq. (14) expresses the correlation analysis between itemsets.
The probability of one event occurring increases when two itemsets are positively
correlated, and the conditions for constructing strong association rules are expressed
as Eq. (15).
The information will be filtered if the condition does not meet Eq. (15). A parallel computing framework is introduced to optimize the Apriori algorithm and
further improve the computational speed of the algorithm in tourism information mining.
Fig. 3 presents the parallel operation method.
Fig. 3. Parallel operation mode.
Parallel operations divide the dataset that needs to be processed into small blocks
based on the number of CPU threads and memory capacity (Fig. 3). Subsequently, the dataset is grouped, and different small blocks in each group
are input into the corresponding allocated CPU threads and memory for parallel computation.
Fig. 4 shows the actual running process of parallel computing based on memory.
Fig. 4. Parallel calculation of the actual running process.
During actual parallel computing, the data is first loaded from memory, and the loaded
data is used as the data source (Fig. 4). It then groups the data sources and allocates the grouped data to the preset nodes.
The data set is transformed on the node to generate new memory variables. It then
broadcasts or shares variables to reduce the time consumption of data transmission
and adjusts the parallel computing parameters until the data skew is eliminated. The
model then performs column action operations on the data set and generates calculation
results. The calculation result set and outputs of the tourism information mining
results are then summarized. Combined with the distributed programming architecture,
the algorithm process is optimized into a two-stage form. The first stage completes
the generation of frequent 1-itemsets, as shown in Fig. 5.
Fig. 5. First stage of the operation process.
During the first calculation stage, the initial memory dataset is first read from
the transaction using the flatMap function (Fig. 5). It uses the map function to transform the transaction items into a combination
of transaction items and values. The support of candidate 1-itemsets is generated
and counted using reduceByKey. The pruning operation is completed with the preset
minimum support, and the retained itemsets form a frequent 1-itemset. The second stage
completes the generation of frequent itemsets, as shown in Fig. 6.
Fig. 6. Second stage of operation.
During the second calculation stage, the frequent itemsets are loaded and transformed
into a form that combines things and counting (Fig. 6). A candidate itemset is then generated and broadcasted to send data to each working
node. It then performs grouping calculations and evaluates the correlation, calculating
the support for itemsets that meet the correlation. The model prunes the data according
to the preset minimum support threshold to obtain the final frequent itemset. Therefore,
the tourism information search behavior constructed by the research institute mines
the historical data of tourists from the operators, achieving the collection and supplementation
of smart tourism information.