PengLei1,2
NongpongKwankamol1
PorntrakoonPaitoon1
-
(Vincent Mary School of Science and Technology, Assumption University / Bangkok, Thailand
amon5728@163.com, {kwan, paitoon}@scitech.au.edu
)
-
(Library and Information Science Center, Chongqing Three Gorges Medical College / Chongqing,
China )
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Website, Crawler, Topic model, Topic transfer, Development
1. Introduction
Educational data can often reflect the development trend of education, such as the
quality of teachers and teaching ability. Analyzing the website data of a college
is a very effective way to know how it develops. The website is like a gateway, and
the important decisions and major historical events are continuously published as
news announcements. In this way, these data can be mined to obtain important information.
Chongqing Three Gorges Medical College is a medical-orientated college in China, which
is responsible for cultivating medical graduates for society. Therefore, it is of
great social value to analyze the historical trends of the college. This paper attempts
to use the topic model to analyze the information on the official website of this
college and then draw the development trends at the decision-making level from topic
transfer over time to provide intellectual support for the comprehensive development
of the college.
2. Related Work
Analyzing website data has always been a major research topic with many applications.
Cha et al. improved the topic model to analyze the relationship graph of social-network
data. They then categorized the edges and nodes in the graph based on the topic similarity
[6]. Their model was quite effective in providing relevant recommendations within a social
network. Rohani et al. ran the Latent Dirichlet Allocation (LDA) model on 90,527 social
media records in Aviation and Airport Management domain in 2015. They detected the
topic facets and extracted their dynamics over time [7]. Fuchs et al. proposed a social media analysis model based on trending topics extracted
from Twitter and Google. This paper used the textual context to enrich the trends,
which helped capture the identification of semantically related topics. Their work
outperformed several baselines, including knowledge-graph modeling using DBPedia and
directly comparing articles or terms [8]. Wang et al. proposed a hot topic detection approach based on bursty term identification
to help people assimilate all the news immediately, which considered both frequency
and topicality properties to detect the bursty terms and hot topics [9]. Later in 2018, Latent Dirichlet Allocation (LDA) was used to model the correlation
of news items with stock price time series data [10]. The news items in the past were trained, and the similarity between the past and
current news items was calculated to build the model. The paper switched the time
points to predict the future. Li et al. extracted web news topics and used a model
to detect the topic content evolution based on the topic clusters. A quantification
method of the topic content was proposed in the model [11]. They showed the ``increase or decrease together'' law of topic intensity evolution.
Sulova constructed models by combining structured and unstructured data from databases,
web pages, and server log files to organize the data from web applications and then
provided a summary of the data [12]. Wu et al. proposed a multi-view learning framework to incorporate news titles and
bodies as different views of news to learn unified news representations. It could
achieve good performance in news topic prediction [13]. Zhang et al. proposed a dynamic topical community detection method to detect communities
and topics, which integrated link, text, and time [14]. The method could find communities and their topics with temporal variations. Since
the Covid-19 pandemic presented a challenge to the global research community in 2020,
Zhang et al. explored the research trace of the pandemic, which changed continuously,
based on resilience theory [15]. By extracting the characteristic words from the early Covid-19 research articles,
they found that the pandemic has significantly disrupted existing research. Bide et
al. proposed a cross-event evolutionary detection framework to detect cross-events
from similar time features [16]. Segmentation clustering was conducted based on similarity computation using the
Bidirectional Encoder Representations from Transformers (BERT) model to encode tweets
into vectors. In 2022, Zhang et al. designed a Bi-directional Long Short Term Memory-Conditional
Random Field (Bi-LSTM-CRF) model for patent entity extraction. They also proposed
a topic evolution path identifying method based on knowledge outflow and inflow to
calculate the semantic relationship between topics of shared entities [17]. ZareRavasan et al. used the topic model to compute the abstracts of 2824 articles
published between 1990 and 2020. They reported that topics, such as information system
(IS) social practice, IS emerging services, and IS sustainability had gained momentum
[18]. Feng et al. used Feature Maximization (FM) measurements to select features, combined
with the contrast ratio, to perform the diachronic analysis [19]. They developed an integrated method based on the Keywords-based Text Representation
Matrix (KTRM) and Lamirel’s EC index, and their method performed well in analyzing
the diachronic topic evolution. Ding et al. proposed an enhanced latent semantic model,
which was based on user comments and regularization factors, to capture the time evolution
features of potential topics [20]. Their model could capture the changes in the users' interests and provide the evolutionary
relationship between users' potential topics and product ratings. To make the research
work more comprehensive, Churchill et al. presented a survey on the topic models,
tracing back to the origin since the 1990s, comparing these models and their evaluation
metrics, and laying the foundation for the next generation of topic models [21].
3. Crawler Designing and Data Cleaning
A crawler was implemented to obtain the data on the website. First, all links were
obtained. They were then judged, and the non-standard content for each link was filtered
out. Finally, those pages that meet the conditions will stay. In particular, the criteria
were set to be met, as shown in Table 1. The crawled data were cleaned based on this goal.
The crawled data were transformed into a dictionary according to the Table 1 format, indicating that it is the complete page data. Simultaneously, the simplified
text was output with only the title and content into a ``.txt'' file, which is the
data fed into the LDA model.
Table 1. Qualified Page Criteria.
No.
|
Criterion Explanation
|
1
|
url exists
|
2
|
url string ends with "htm" or "html"
|
3
|
page contains title
|
4
|
page contains content
|
5
|
page contains publishing time
|
6
|
page contains publishing source
|
4. Topic Modeling
LDA [1] is a classic model that can generate topics automatically. It evolved from Unigram
Model (UM), Latent Semantic Analysis (LSA) [2], probabilistic Latent Semantic Analysis (PLSA) [3], and then to LDA. Its concept is easy to understand, i.e., the whole corpus results
from the term generation process. There are two distributions here, and for each document,
there is a corresponding document-topic distribution; for each topic, there is a corresponding
topic-term distribution. This generation process is that a topic was chosen from the
document-topic distribution for each term position in a document. A term was then
chosen from the topic-term distribution. Figs. 2 and 3 present the plate notation of LDA and the generation process, respectively.
There are two ways to calculate the LDA algorithm. The first way is Variational Inference,
which involves the EM (Expectation Maximization) algorithm [4]. In the E-step, the coupling relationship between the latent variables was canceled
out through the variational assumption, and the variational distribution was obtained.
In the M-step, the variational parameters were fixed, and the above expectations were
maximized through a series of Newton steps. Through iterations, the parameters were
obtained finally. In recent years, there has been an easier way of computing, Gibbs
sampling [5]. The idea behind it is that Markov stationary state can be achieved by continuously
sampling from the distribution. Gibbs sampling is a special case of MCMC (Markov Chain
Monte Carlo), used mainly in the multi-dimensional random variable situation. The
system continuously samples the marginal probability distribution to obtain the joint
probability distribution. In Gibbs sampling, the joint distribution function should
be obtained first, and then we sample according to the Gibbs sampling formula. The
joint distribution function can be obtained when the iteration number of samplings
is large enough.
LDA has been to be an effective model in topic analysis and word clustering. The selection
of the topic number K should be determined by evaluation, which we will describe in
the next chapter.
Fig. 1. Flow chart of the proposed topic-analyzing framework.
Fig. 2. LDA plate notation.
Fig. 3. LDA generating process.
5. Experiment
5.1 Data
The BeautifulSoup library of Python was used to help with the implementation of the
crawler, and 13682 total links were obtained. After data cleaning, 8264 qualified
links were obtained. The data were transformed into two file types. The first was
8264 ``.txt'' files ready to be fed into the LDA model; each ``.txt'' was composed
of only a title and content. The second was a large dictionary-form file comprised
of 8264 items. Fig. 4 presents the format of each item.
For the 8264 ``.txt'' files, that is, 8264 documents, the title length in each document
was from 10 to 20 Chinese characters, and the content was approximately 500 to 1000
Chinese characters.
After obtaining the pure 8264 documents, it is important to perform Chinese word segmentation
before they can be fed into LDA. Here, the famous Jieba library of Python was used.
The term-index map, index-term map, and the corresponding segmented documents were
obtained after the text preprocessing.
Fig. 4. Dictionary format of items.
5.2 Evaluation Metric
Since perplexity cannot represent the semantic coherence, two topic coherence calculation
methods were used to perform the evaluation. The first is the topic coherence score,
and its calculation formula is
Coherence-Score(t)$=\sum _{i=1}^{N-1}\sum _{j=i+1}^{N}\log \frac{C\left(t_{i},t_{j}\right)+1}{C\left(t_{i}\right)}$
where N denotes the number of terms that are on the top list of a specific topic,
and it is set by the researcher and was set it to 15 in the present study; $C\left(t_{i},t_{j}\right)$
is the count of documents that $t_{i}$ and $t_{j}$ both appear and $C\left(t_{i}\right)$
is the count of documents that $t_{i}$ appears. Another is the point-wise mutual information
(PMI) score, and its calculation can be expressed as follows:
PMI-Score(t)$=\frac{2}{N\left(N-1\right)}\cdot \sum _{i=1}^{N-1}\sum _{j=i+1}^{N}\log
\frac{p\left(t_{i},t_{j}\right)}{p\left(t_{i}\right)p\left(t_{j}\right)}$
where N is the same as the explanation above; $p\left(t_{i},t_{j}\right)$ is the co-occurrence
probability of $t_{i}$ and $t_{j}$, and was computed as ``(the count of documents
that $t_{i}$ and $t_{j}$ both appear) / the count of total documents in the corpus''.
$p\left(t_{i}\right)$ is the occurrence probability of $t_{i}$, and was computed as
``the count of documents that $t_{i}$ appears / the count of total documents in the
corpus''. Note that the above is just the coherence score and PMI score for one topic,
and the final coherence score and PMI score are the averages of all topics.
5.3 Results of Topics
The coherence score fluctuated with the highest value, around 15.9, when the topic
number was 10 (Fig. 5). Topic numbers of 15, 25, 45, and 40 were obtained, which were the larger ones in
the figure. Four points existed at the lower position, of which the topic numbers
were 5, 20, 30, and 50, respectively. In Fig. 6, the PMI score increased when the topic number ranged from 5 to 40 and decreased
drastically after that. The PMI score reached a peak when the topic number was 40.
Therefore, 40 was chosen as the topic number because, at this point, the topic coherence
was larger with the second-largest standard deviation, and the PMI score was the largest
one with the second-largest standard deviation. Note that the status in each topic
should be distinguished largely from others, so it is a good division.
Overall, this paper considered K = 40 as a proper value for the topic number. All
40 topics with corresponding titles given by human interpretation were then listed,
as shown in Table 3. Eight representative topics were chosen to categorize due to the limited space,
with the top 10 Chinese phrases in each topic, as shown in Table 4. For example, topic-1, topic-12, and topic-13 were categorized as ``Welcome'', but
the facet that each topic focused on was different from the others. Topic-1 concerns
the aspect of orientation programs for the students, like some training and the dean’s
message. Topic-12 tells the story about student enrollment, such as sign-up and specialty
selection. Topic-30 focuses on the welcoming preparations done by the college.
Fig. 5. Topic coherence score with different topic numbers K.
Fig. 6. PMI score with different topic numbers K.
Table 2. Details of Coherence Score and PMI with the Respective Standard Deviation.
Topic number
|
Coherence score
|
Coherence score std
|
PMI
|
PMI std
|
5
|
-17.0769
|
1.700748
|
0.782046
|
0.033435
|
10
|
-15.9396
|
3.794480
|
0.779588
|
0.056387
|
15
|
-16.1289
|
2.706822
|
0.783537
|
0.050057
|
20
|
-16.6534
|
4.594621
|
0.788284
|
0.061898
|
25
|
-16.1298
|
3.916092
|
0.787955
|
0.069197
|
30
|
-16.7025
|
3.558920
|
0.794719
|
0.062617
|
35
|
-16.6258
|
3.692576
|
0.793037
|
0.062819
|
40
|
-16.2859
|
4.276845
|
0.804979
|
0.068083
|
45
|
-16.2009
|
3.541288
|
0.796885
|
0.060292
|
50
|
-17.1378
|
3.734030
|
0.786428
|
0.052768
|
Table 3. Title of all topics.
Topic ID
|
Topic title
|
Topic id
|
Topic title
|
0
|
Poverty alleviation
|
20
|
Traditional Chinese medicine
|
1
|
New semester starts
|
21
|
Party construction
|
2
|
Cooperation
|
22
|
Punishment
|
3
|
Innovation & Research
|
23
|
College’s mission
|
4
|
Dean’s message
|
24
|
Students’ competitions
|
5
|
Medical alliance
|
25
|
Meetings
|
6
|
Staff promotion
|
26
|
Jobs & Employment
|
7
|
Teacher training
|
27
|
Project declaration
|
8
|
Organization construction
|
28
|
Academic lectures
|
9
|
Security & Pandemics
|
29
|
Students’ internship
|
10
|
Vocational education
|
30
|
Welcome new students
|
11
|
Inspection of teaching
|
31
|
Nation development
|
12
|
Student sign up
|
32
|
College activities
|
13
|
Discipline construction
|
33
|
Construction of work style
|
14
|
Specialties construction
|
34
|
Enrollment affairs
|
15
|
Community & Volunteers
|
35
|
Examination
|
16
|
College development
|
36
|
Caring for teachers
|
17
|
Faculty & Department
|
37
|
Teaching ability
|
18
|
Academic affairs
|
38
|
Practical education
|
19
|
Strong rainfall
|
39
|
Recruitment
|
Table 4. Some Representative Topics with Their Top 10 Chinese Phrases.
Category
|
Topic ID
|
Top 10 Chinese phrase
|
Welcome
|
1
|
representative, classmate, health training, students, hope, graduation, ceremony,
youth, scholarship, sports meeting, newborn;
|
12
|
sign up, students, examinee, examination, Admission, time, top up exam, complete,
recruit students, information, ordinary, selection;
|
30
|
work, students, prepare, school opens, service, welcome new students, library, parents,
guarantee, scene, canteen, security;
|
Activities
|
15
|
activity, healthy, volunteer, society, service, resident, science popularization,
community, practice, theme;
|
24
|
competition, contest, skill, big match, competitor, national, the first prize, the
final round, won, test;
|
32
|
activity, classmate, dorm, culture, apartment, campus, exhibition, civilization, cultural
festival, the most beautiful star in campus;
|
Teaching
|
7
|
train, skill, theory, participate in, assistant, practicing, operation, ability, conduct,
physician;
|
37
|
teacher, teaching, curriculum, promote, basics, level, design, teaching quality, young
teachers, develop, to open up;
|
5.4 Topic Transfer
First, these documents were sorted using ``date'' feature in chronological order,
and the topics for each document were sorted according to the probability. Hence,
the earliest document was recorded since 2014. The ground truth is that in 2014, the
website underwent a revision and upgrade, and the previous data were cleared. Therefore,
the historical data was available only after 2014. After that, the topics were counted
for all documents in the same year under the condition that only top N topics were
considered. Here, N was set to 5. For the top five topic positions, the number of
these topics was sorted, and the top five topics were taken. Table 5 lists the topic evolution sorted by ``year''. Similarly, the documents within the
same month were grouped to compute and obtain the topic evolution sorted by ``month'',
as listed in Table 6.
Table 5 shows that the news in 2014 focused on topic-12 (student signing up), topic-14 (specialty
construction), topic-18 (students’ academic affairs), topic-38 (practical education),
and topic-39 (recruitment), but the information the present study obtained did not
have a strong belief because there was only one document in 2014 in the real world.
In 2015, the articles started to express a proper number. The focusing points are
topic-5 (medical alliance), topic-36 (caring about teaching staff), topic-4 (dean’s
message), topic-32 (student activities), and topic-13 (discipline construction). In
2016, the issues moved to topic-35 (examination), topic-11 (inspection of teaching),
topic-25 (faculty meeting), topic-10 (vocational education), and topic-36, which appeared
last year. In 2017, two topics overlapped with 2016, which were topic-35 and topic-36.
In addition, three new topics were born, which are topic-38 (practical education),
topic-37 (teaching ability), and topic-10 (vocational education). In 2018, two new
topics appeared, which were topic-14 (specialty construction) and topic-17 (faculty
and department), with the other three topics that appeared in the two previous years,
which were topic-10, topic-11, and topic-36. In 2019, three topics overlapped with
the previous year, which were topic-36, topic-11, and topic-14. Two topics, topic-36
and topic-11, remained in 2020. Note that topic-9 (security and pandemics) climbed
to the top five here, which was mainly because of the Covid-19 pandemic outbroke at
the beginning of 2020. Topic-16 (construction of school running and new development)
also appeared for the first time here, with the background of receiving large financial
support from the Chongqing city government. The topics in 2021 were quite different
from 2020; among the five topics, only topic-16 appeared again. The other four topics
were topic-21 (party construction), topic-31 (nation development), topic-14 (appeared
in 2014 and 2019), and topic-29 (students’ internship). There were only 331 articles
in 2022 because the data collection ended in August. Three topics in 2020 and one
in 2021 were seen again in 2022, which were topic-35, topic-36, topic-16, and topic-29.
Throughout history, topic-36 appeared almost every year, and topic-35 and topic-11
were the second most frequent, followed by topic-10.
Some interesting information can be obtained from the data in Table 6. (1) The months with the fewest articles are February and August because they fell
in the winter and summer vacation, respectively. (2) Chinese colleges often start
the new academic year in September. Thus, topic-1 (new semester starts) and topic-4
(dean’s message) appear in September. Occasionally, the start would be postponed to
October, so topic-1 also appears in October. (3) January and July are often the end
of the semesters. Therefore, topic-35 (examination) appears in both months. (4) Summer
vacation is always in August, and it can last for more than a month. Teachers often
choose this period to improve themselves. Thus, topic-7 (teacher training) appears
in August.
Table 5. Topics transition over the years.
Year
|
Count of Articles
|
Top 5 Topics (count)
|
2014
|
1
|
12 (1), 14 (1), 18(1), 38(1), 39 (1)
|
2015
|
583
|
5 (195), 36 (132), 4 (111), 32(111), 13(106)
|
2016
|
1363
|
35(301), 11(289), 25(287), 10(284), 36(283)
|
2017
|
1611
|
35 (394), 38 (355), 36 (352), 37 (317), 10 (313)
|
2018
|
1228
|
14 (304), 10 (287), 17 (258), 11 (257), 36 (256)
|
2019
|
1118
|
36 (359), 2 (270), 11 (254), 21 (254), 14 (253)
|
2020
|
761
|
36 (212), 16 (204), 35 (175), 25 (144), 11 (142)
|
2021
|
1268
|
21 (302), 16 (297), 31 (288), 14 (275), 29 (269)
|
2022
|
331
|
35 (87), 36 (80), 17 (70), 16 (65), 29 (64)
|
Table 6. Topics transition over months.
Month
|
Count of Articles
|
Top 5 Topics (count)
|
Jan.
|
536
|
36 (181), 35 (164), 11 (142), 25 (118), 21 (114)
|
Feb.
|
149
|
30 (58), 25 (44), 35 (42), 12 (41), 5 (31)
|
Mar.
|
655
|
36 (194), 35 (191), 11 (137), 25 (130), 5 (128)
|
Apr.
|
798
|
36 (215), 10 (170), 35 (167), 21 (155), 25 (155)
|
May
|
919
|
10 (218), 24 (211), 36 (195), 11 (180), 14 (173)
|
Jun.
|
1132
|
36 (254), 37 (243), 14 (235), 17 (232), 24 (224)
|
Jul.
|
594
|
36 (154), 35 (151), 23 (118), 37 (115), 21 (108)
|
Aug.
|
179
|
35 (66), 30 (55), 14 (45), 7 (40), 2 (38), 10 (36)
|
Sept.
|
865
|
36 (205), 30 (192), 1 (172), 4 (172), 11 (168)
|
Oct.
|
635
|
10 (143), 36 (130), 1 (121), 38 (120), 11 (119)
|
Nov.
|
925
|
11 (201), 10 (198), 24 (184), 5 (180), 31 (170)
|
Dec.
|
877
|
10 (198), 36 (194), 14 (180), 11 (179), 38 (177)
|
6. Conclusion
Manually analyzing the huge amount of information on the network and obtaining knowledge
from it is an exhausting task. An analyzing framework was constructed to analyze its
topic evolution automatically and solve this problem in educational data. Taking Chongqing
Three Gorges Medical College as a case, a crawler was designed to obtain website data,
which was cleaned and re-constructed. The data were fed into the topic-processing
module. The optimal topic number, the term distribution in each topic, and the topic
transfer over the period were obtained. Owing to time limits, Gaussian LDA has not
been considered to perform experiment comparisons, and frequent item mining has not
been designed for this paper, which could be conducted later. This paper can provide
an administration summary for the leadership level of an educational organization,
enable them to have better control of the development direction, and give ideas and
management experiences of other colleges.
ACKNOWLEDGMENTS
This work was supported by the Chongqing Three Gorges Medical College of China
(No. 2019XZYB13) and by the Chongqing Association of Higher Education under Chongqing
Municipal of China (No. CQGJ21B128). Here author Lei would like to express his gratitude
to his spouse, Zheng Teng.
REFERENCES
So H. S., 2006, Environmental Influences and Assessment of Corrosion Rate of Reinforcing
Bars using the Linear Polarization Resistance Technique, Journal of Korean Society
of Civil Engineering, Vol. 22, No. 2, pp. 107-114
Author
Lei Peng is a Ph.D. candidate at the Vincent Mary School of Science and Technology
at Assumption University, Thailand. He received his B.Eng. in Computer Science and
Technology department at Shangqiu University of China and M.Eng. in Computer Science
and Technology department at the Guizhou University of China. His research area includes
information retrieval, text mining, topic modeling, and time series data mining.
Kwankamol Nongpong is a professor at the Vincent Mary School of Science and Technology,
Assumption Univer-sity, Thailand. Her research interests are text processing, natural
language processing, big data processing and analysis, program analysis, and enterprise
applications. Nongpong is also working closely with the industry, where she has consulted
for companies and government agencies on data standard, project management, business
process improvement, and ERP system development.
Paitoon Porntrakoon received his Ph.D. in Information Technology from Assumption
University, Thailand in the year 2018. He is currently the Graduate Program Director
in Information Technology of the Vincent Mary School of Science and Technology at
Assumption University, Thailand. His research interests are similarity searching,
location detection, trust and distrust, social commerce, and Thai sentiment analysis.