• 대한전기학회
Mobile QR Code QR CODE : The Transactions of the Korean Institute of Electrical Engineers
  • COPE
  • kcse
  • 한국과학기술단체총연합회
  • 한국학술지인용색인
  • Scopus
  • crossref
  • orcid

  1. (Digital Solution Laboratory, Korea Electric Power Research Institute(KEPRI), Korea.)



Deep Learning, Natural Language Processing, Power Generation, Diagnostic Service, Text Mining, Framework

1. 서 론

Natural language processing (NLP) is emerging as one of the most frequently used technologies in a wide range of artificial intelligence areas. With the recent advancement of NLP technology and the continuous increase in the value and amount of data, programs capable of recognizing natural language such as human speaking and writing have already been widely used in relation to clinical documents, airline reservations, and vehicle roadside support.

The global market for software, hardware, and services in the NLP sector is forecast as follows. Tractica forecast that the NLP market, which was estimated at $\$$277.2 million in 2015, would grow by an average of 25% annually to reach $\$$2.1 billion by 2024. Owing to the high demand in the NLP sector, its market size has been increasing (1).

Deep learning technology related to embedding that vectorizes the similarity between words has been attracting particular attention in the NLP sector. However, there are no cases of applying this technology to the electric power industry and no corresponding service frameworks. Moreover, thousands of knowledge documents produced by electric generator operation experts have been collected for about 20 years by Korea Electric Power Corporation (KEPCO), but they have rarely been applied to electric generator operation.

Therefore, in this report, we propose Gen2Vec, an NLP framework for electric generator operation knowledge services using deep learning technology. Gen2Vec is a framework that includes a preprocessing function that extracts nouns from search sentences or words, a word recommendation function that recommends words related to search words using deep learning-based word embedding, and a document recommendation function that recommends documents related to search words. In the future, Gen2Vec can be the core engine of the knowledge system for electric generator operation experts and will be used in search and chatbot services for electric generator operation programs and new employee education.

2. Related Work

2.1 Word Embedding based on Deep Learning

Embedding is the state in which a graph can be drawn so that edges do not cross each other on the surface (2). Recently, embedding techniques have been widely employed in the NLP sector. Typical embedding techniques include word embedding, in which words are represented as vectors, and the term frequency (TF)–inverse document frequency (IDF) approach, in which the importance of each word in a document is quantified.

First, for word embedding, there are prediction-based models, including neural probabilistic language models such as Word2Vec and FastText and matrix factorization-based models such as latent semantic analysis and the global word vector model (3-8). In this study, we selected Word2Vec, which showed good performance in evaluating word similarity and was widely used in various industries, as the word embedding technique. As shown in Figure 1, there are two training methods in Word2Vec, that is, the continuous bag of word (CBOW) and Skip-gram methods. The CBOW technique is a training method that predicts a target word based on the surrounding words (Fig. 1(a)). On the other hand, Skip-gram is a training method that predicts the surrounding words using a target word (Fig. 1(b)).

Second, TF-IDF, as a weight used in information searching and text mining, is a statistical number indicating how important a word is in a particular document (9). The TF indicates how often a specific word appears in a document, and the IDF is the inverse of the document frequency. The TF-IDF can be obtained by multiplying the TF times the IDF. The TF-IDF is used to rank search results in search engines and to measure the similarity between documents within a document cluster.

Minarro-Gimenez et al. conducted a study on improving the accessibility of medical knowledge by applying Word2Vec to medical documents, and Husain and Dih developed a mobile application recommendation system for travelers based on TF-IDF content (10-11). Similarly, research using deep learning-based word embedding and the TF-IDF has been actively underway in various industries. However, research on the application of this technology in the electric power sector is insufficient.

그림. 1. CBOW와 Skip-gram 구조

Fig. 1. CBOW and Skip-gram Structure

../../Resources/kiee/KIEE.2020.69.12.1808/fig1.png

2.2 Framwork

A framework is a software environment that provides a reusable design and implementation of parts corresponding to specific software functions in a collaborative form to make the development of a software platform effective (12). The framework can be maintained through systematic code management and is highly reusable. It has high development productivity by providing a function library.

Regarding research on frameworks, Bedi and Toshniwal developed a deep learning framework for forecasting electricity demand using long short-term memory (13). Dowling et al. proposed an optimization framework for evaluating revenue opportunities provided by multi-scale hierarchies in the electric power market and determining optimal participation strategies for individual participants (14). In addition, Pinheiro and Davis Jr. endeavored to improve user convenience by managing the characteristics and structure of data collection target themes by developing ThemeRise, which was a framework for producing a volunteered geographic information application, a type of cloud sourcing, and Jack Jr. proposed the National Institute on Aging and Alzheimer's Association (NIA-AA) research framework to assist research on the biological definition of Alzheimer’s disease (15-16). As shown in those studies, framework research has been gaining attention not only in the electric power industry, but also in other industries. The introduction of a framework facilitates the utilization of existing technologies conveniently from the perspective of users and has the advantage of efficient platform development.

3. Proposed Gen2Vec Framework

3.1 Framework Architecture

The framework architecture of the proposed Gen2Vec is shown in Fig. 2. Pretraining is performed using deep learning- based Word2Vec utilizing 1,348 expert knowledge documents for the electric generator. When the user enters a sentence or word in the search box, the preprocessing function is performed to extract only nouns. Then, the word recommendation function is performed by embedding based on the extracted and pretrained words. Next, Gen2Vecscore is calculated using the words extracted by the word recommendation function and the TF-IDF value of each document. Lastly, the document recommendation function is performed to recommend documents related to the search word.

그림. 2. Gen2Vec 프레임워크 구조

Fig. 2. Framework Architecture of Gen2Vec

../../Resources/kiee/KIEE.2020.69.12.1808/fig2.png

3.2 Preprocessing Function

The preprocessing function of Gen2Vec is performed after the user enters a word or sentence that he/she wants to search in the search box. Firstly, the word or sentence is tokenized to separate each word into tokens, and then part of speech tagging is used to add the part of speech to each token. After this, the word recommendation function is performed by extracting only the noun tokens. KoNLPy that korean natural language processing package for python was used for the preprocessing function of Gen2Vec (17).

3.3 Word Recommendation Function

The word recommendation function was developed using the Gensim framework-based Word2Vec for the words extracted from the preprocessing function (18). The training parameters of the Skip-gram model for extracting embedding words are matrices u and v. The size of each matrix is the size of the vocabulary set (|v|) by the number of embedding dimensions (d). The probability that the target word (t) and content word (c) are positive samples is calculated using Eq. (1), and the probability that t and c are negative samples is calculated using Eq. (2).

Eq. 1(식 (1)) Positive Sample Calculation of Skip-gram

(1)
$P(+ | t,\:c)=\dfrac{1}{1+\exp(-u_{t}v_{c})}$

Eq. 2(식 (2)) Negative Sample Calculation of Skip-gram

(2)
$P(- | t,\:c)=\dfrac{\exp(-u_{t}v_{c})}{1+\exp(-u_{t}v_{c})}$

The log-likelihood function of Skip-gram is Equation (3), and the word recommendation function can vectorize words in a document cluster after training to optimize Eq. (3). Using this approach, the words related to the search word can be recommended.

Eq. 3(식 (3)) Log-likelihood Function of Skip-gram

(3)
$L(\theta)=\log P(+ | t_{p},\:c_{p})+\sum_{i=1}^{k}\log P(- | t_{n_{i}},\:c_{n_{i}})$

3.4 Document Recommendation Function

The document recommendation function of the proposed Gen2Vec is as follows. When the user enters a necessary word or sentence in the search box, the preprocessing function is used to extract the noun words ($Word_{x_{i}}$). The extracted words are input into the trained Word2Vec model to extract the TopN words ($Word_{y_{i}}$) with high cosine similarity for each $Word_{x_{i}}$. The formula for obtaining the cosine similarity is expressed in Eq. (4), and the method for obtaining $Word_{y_{i}}$ is expressed in Eq. (5).

Eq. 4(식 (4)) The formula for obtaining the cosine similarity

(4)
$\cos(\theta)=\dfrac{\sum_{i=1}^{n}Word A_{i}\bullet Word B_{i}}{\sqrt{\Sigma_{i=1}^{n}\left(Word A_{i}\right)^{2}}\bullet\sqrt{\Sigma_{i=1}^{n}\left(Word B_{i}\right)^{2}}}$

Eq. 5(식 (5)) The method for obtaining $Word_{y_{i}}$

(5)
$Word_{y_{i}}$ = TopN(Cos($Word_{x_{i}}$))

After defining each word (w), each target word (t), each document (d), the total number of documents (D), and the total frequency (frequency (f ())) for the documents, TF-IDF is produced using Eq. (6). Then, the data frame is extracted for TF-IDF, which has the word list consisting of $Word_{x_{i}}$ and $Word_{y_{i}}$ as the column value and has the documents as the row values. Table 1 shows an example of the extraction results.

Eq. 6(식 (6)) The formula of TF-IDF

(6)
$$TFIDF(t,\:d,\:D)=TF(t,\:d)\bullet IDF(t,\:D)$$ $$=\left[0.5+\dfrac{0.5\bullet f(t,\:d)}{\max\{f(w,\:d):w\in d\}}\right]\bullet\left[\log\dfrac{| D |}{|\{d\in D:t\in d\}|}\right]$$

표 1. 문서 추천 기능을 위한 데이터 프레임 예시

Table 1. Data Frame Example for Document Recommendation Function

$Word_{x_{1}}$

..

$Word_{x_{n}}$

$Word_{y_{1}}$

..

$Word_{y_{m}}$

Doc1

0.27

..

0.32

0.41

..

0.19

..

..

..

..

..

..

..

DocZ

0.13

..

0.11

0.21

..

0.02

The above data frame contains the word $Word_{x_{i}}$ that the user directly enters in the search box and the related word $Word_{y_{i}}$ extracted by deep learning. Therefore, Gen2Vecweight is defined because it is necessary to give different weights to $Word_{x_{i}}$ and $Word_{y_{i}}$, as expressed in Eq. (7). Using this approach, the TF-IDF value in the data frame is updated.

Eq. 7(식 (7)) The formula of Gen2Vecweight

(7)
Gen2Vecweight $=\left\{\begin{aligned}TFIDF\left(Word_{x_{i}},\: d,\: D\right),\:{if} \quad Word_{x_{i}}\\ \cos\left(Word_{y_{i}},\: d,\: D\right)\bullet TFIDF\left(Word_{y_{i}},\: d,\: D\right),\: otherwise\end{aligned}\right .$

Next, Gen2Vecweight for each document in the data frame is summed and the TopK documents are extracted. The function used for this calculation was defined as Gen2Vecscore. If T is defined as the total number of $Word_{x_{i}}$ and $Word_{y_{i}}$, the method of obtaining Gen2Vecscore can be expressed as shown in Eq. (8).

Eq. 8(식 (8)) The formula of Gen2Vecscore

(8)
Gen2Vecscore = TopK($\sum_{i=1}^{T}$Gen2Vecweight)

4. Experiments and Results

4.1 Expert Documents for Diagnostic Services of Power Generation Facility

KEPCO has operated electric generators at each of their power plants for about 20 years, and experts have directly diagnosed the generators, accumulating 1,348 documents. The experts utilized categories of boiler, electric generator, performance, gas turbine, and steam turbine in the diagnosis and designated subcategories such as fault diagnosis and precision diagnosis. Table 2 shows the statistics of expert documents on electric generator operation collected from 2000 to 2018. The Gen2Vec developed in this study was trained for these documents.

표 2. 발전설비 진단을 위한 전문가 문서 군집

Table 2. Expert Documents for Diagnostic Services of Power Generation Facility

Category

Subcategory

Number of documents

Boiler

Fault diagnosis

52

Precision diagnosis

380

Gas turbine

Fault diagnosis

34

Precision diagnosis

53

Steam

turbine

Fault diagnosis

33

Precision diagnosis

57

Electric

generator

Leak absorption

122

Prevention diagnosis

37

Performance

Insulation diagnosis

546

Precision diagnosis

34

Total

1,348

표 3. 단어 추천 기능의 결과 예시

Table 3. Result Example of Word Recommendation Function

$Word_{x_{1}}$$~ Word_{x_{5}}$

[Korean]

(Cosine Similarity)

Gas turbine

[가스터빈]

Compressor

[압축기]

Blade

[블레이드]

Crack

[균열]

Occurrence

[발생]

$Word_{y_{1}}$

$~ Word_{y_{5}}$

Trial run

[시운전]

(0.88)

Blade

[블레이드]

(0.97)

Compressor

[압축기]

(0.97)

Tiny

[미세]

(0.84)

Majority

[다수]

(0.86)

$Word_{y_{6}}$

$~ Word_{y_{10}}$

Gunsan

[군산]

(0.85)

Bucket

[버켓]

(0.91)

Bucket

[버켓]

(0.93)

Progress

[진전]

(0.83)

Similarity

[유사]

(0.85)

$Word_{y_{11}}$

$~ Word_{y_{15}}$

Low pressure

[저압]

(0.83)

Past

[과거]

(0.88)

Vane

[베인]

(0.91)

Fault

[결함]

(0.83)

Order

[차례]

(0.85)

$Word_{y_{16}}$

$~ Word_{y_{20}}$

Turbine

[터빈]

(0.81)

Rotor

[로터]

(0.85)

Recommendation

[권고]

(0.88)

Expansion

[확대]

(0.81)

Estimation

[추정]

(0.85)

$Word_{y_{21}}$

$~ Word_{y_{25}}$

Component

[부품]

(0.80)

Type

[종류]

(0.84)

Rotor

[로터]

(0.88)

Discovery [발견]

(0.81)

Many

[여러]

(0.84)

그림. 3. 전처리 기능 예시

Fig. 3. An Example of Preprocessing Function

../../Resources/kiee/KIEE.2020.69.12.1808/fig3.png

4.2 Preprocessing and Word Recommendation

The experimental example and results of the preprocessing function are shown in Fig. 3. After entering a word or sentence that the user wants to search in the search box, tokenization is applied to separate the word or sentence into tokens when the preprocessing function is applied (Fig. 3 (a), (b)). Next, the part of speech is tagged to each token, and the nouns are extracted (Fig. 3(c), (d)). The extracted nouns are input into the word recommendation function.

The experimental example and results of the word recommendation function are shown in Table 3. In this experiment, the nouns ($Word_{x_{i}}$) extracted by the preprocessing function as shown in Fig. 3 were pretrained with the Word2Vec model. For pretraining, the word vector dimension was set to 1,000, the window size was set to 4, and the downsample setting for frequently appearing words was set to 1e-3. By applying these parameters, the embedding words ($Word_{y_{i}}$) corresponding to TopN were extracted. Here, the embedding words were output in order of descending cosine similarity. In this experiment, N was assumed to be 5, and each extracted $Word_{y_{i}}$ had a cosine similarity value.

The experimental results show that the words that were highly related to each $Word_{x_{i}}$ were extracted as $Word_{y_{i}}$. The extracted words included duplicate words. Other than the duplicate words with high cosine similarity, the duplicate words were excluded from $Word_{y_{i}}$ to be used in the document recommendation function.

4.3 Document Recommendation Function

The experimental example and results of the document recommendation function are presented in Table 4. The TF-IDF value that was pretrained for the nouns ($Word_{x_{i}}$) extracted by the preprocessing function and words recommended by the word recommendation function ($Word_{y_{i}}$) was updated with the proposed Gen2Vecweight. Next, the documents extracted using Gen2Vecscore were presented to the user. Defining the K value used to obtain TopK as 10, in relation to the search words in the example employed in this study, the extracted documents are listed in Table 4.

표 4. 문서 추천 기능의 결과 예시

Table 4. Result Example of Document Recommendation Function

Rank

Document name

Gen2Vecscore

1

Yeongwol natural gas power plant gas turbine report (1)

2.44

2

Seo-incheon 5GT OH technical support report (1)

2.39

3

Yeongwol natural gas power plant gas turbine report (2)

2.35

4

Busan 3GT report

2.18

5

Seo-incheon unit 1 gas turbine maintenance work

technical support report

2.17

6

Bundang 8GT 1 blade damage report

2.16

7

Pyeongtaek 3GT composite report (1)

2.10

8

Pyeongtaek 3GT composite report (2)

2.08

9

Seo-incheon 5GT OH technical support report (2)

2.07

10

Yeongwol 2GT high temperature parts damage report

1.98

The results in Table 4 confirm that the documents related to the nouns extracted from the search words, such as the gas turbine report, blade damage report, and high-temperature parts damage report, were extracted. The accuracy result of the document recommendation function is in Table 5. Precision, Recall, and F1 results were derived for Gen2Vec, Word2Vec, and TF-IDF. Gen2Vec proved about 3.9% and 10.8% higher than Word2Vec and TF-IDF.

표 5. 문서 추천 기능의 성능평가

Table 5. Evaluated Performance of Document Recommendation Function

Algorithm

Precision(%)

Recall(%)

Accuracy(%)

Gen2Vec

81.3

84.9

83.1

Word2Vec

78.2

80.1

79.2

TF-IDF

71.9

72.7

72.3

5. Conclusion

In this report, we proposed Gen2Vec, a knowledge service framework required for electric generator operation, based on user search words. Gen2Vec offers three functions to provide efficient knowledge services. First, there is a preprocessing function that separates a sentence entered by the user into tokens and extracts only nouns. Second, there is a word recommendation function that recommends words related to the search words by applying a model trained by deep learning. Last, there is a document recommendation function that extracts highly related documents by applying Gen2Vecweight and Gen2Vecscore to the TF-IDF values pretrained for words extracted with the preprocessing and word recommendation functions.

When using Gen2Vec in this way, experts and new employees who operate electric generators can quickly extract expert documents accumulated over 20 years by KEPCO when diagnosing generators. Consequently, operators and new employees can obtain expert knowledge without any experts in the power plants and can easily apply this knowledge in the field.

In the future, we are planning on extending the word and document recommendation functions of Gen2Vec into a person- alized recommendation function by developing optimization functions related to the search words of individuals. Furthermore, we are developing Gen2Vec with training in multiple languages such as English and Chinese and working on improving user-friendliness by developing a chatbot service and voice-based search service by equipping this as the core engine of knowledge services for electric generator operation.

Acknowledgements

This work was funded by the Korea Electric Power Corporation (KEPCO).

References

1 
R. Madhavan, 2018, Natural language processing current appli- cations and future possibilities, Tractica OmdiaGoogle Search
2 
A. Ittai, B. Yair, N. Ofer, Sep 2011, Advances in metric embedding theory, Advances in Mathematics, Vol. 228, pp. 3026-3126DOI
3 
Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin, Feb 2003, A neural probabilistic language model, Journal of Machine Learning, Vol. 3, pp. 1137-1155Google Search
4 
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Dec 2013, Distributed representations of words and phrases and their compositionality, Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS), Australia, Vol. 2, pp. 3111-3119Google Search
5 
T. Mikolov, K. Chen, G. Corrado, J. Dean, Jan 2013, Efficient estimation of word representations in vector space, Proceedings of the International Conference on Learning Representations (ICLR), USAGoogle Search
6 
A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, April 2017, Bag of tricks for efficient text classification, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics Spain, Vol. 2, pp. 427-431Google Search
7 
S. T. Dumais, 2005, Latent semantic analysis, Annual Review of Information Science and Technology, Vol. 38, pp. 188-230DOI
8 
J. Pennington, R. Socher, C. D. Manning, Oct 2014, Glove: Global vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Qatar, pp. 1532-1543Google Search
9 
G. Salton, M. J. McGill, 1983, Introduction to Modern Inform- ation retrieval, McGraw-HillDOI
10 
J. A. Minarro-Gimenez, O. Marin-Alonso, M. Samwald, 2014, Exploring the application of deep learning techniques on medical text corpora, 2014 European Federation for Medical Informatics and IOS Press, pp. 584-588Google Search
11 
W. Husain, L. Y. Dih, July 2012, A framework of a personalized location-based traveler recommendation system in mobile application, International Journal of Multimedia and Ubiquitous Engineering, Vol. 7, pp. 11-18Google Search
12 
A. Gachet, Software frameworks for developing decision support systems – A new component in the classification of DSS development tools, Journal of Decision Systems, Vol. 12, No. 3, pp. 271-281DOI
13 
J. Bedi, D. Toshniwal, Jan 2019, Deep learning framework to forecast electricity demand, Applied Energy, Vol. 238, pp. 1312-1326DOI
14 
A. W. Dowling, R. Kumar, V. M. Zavala, Jan 2017, A multi- scale optimization framework for electricity market partici- pation, Applied Energy, Vol. 190, pp. 147-164DOI
15 
M. B. Pinheiro, C. A. Davis Jr, Jun 2018, ThemeRise: A theme- oriented framework for volunteered geographic information applications, Journal of Open Geospatial Data, Software and Standards, Vol. 1, pp. 3-9DOI
16 
C. R. Jack Jr, D. A. Bennett, K. Blennow, M. C. Carrillo, B. Dunn, X. B. Haeberlein, D. M. Holtzman, W. Jagust, F. Jessen, J. Karlawish, E. Lilu, J. L. Molinuevo, T. Montine, C. Phelps, K. P. Rankin, C. C. Rowe, P. Scheltens, E. Siemers, H. M. Snyder, R. Sperling, 2018, NIA-AA research framework: Toward a biological definition of Alzheimer’s disease, Alzheimers Dement, Vol. 14, pp. 535-562DOI
17 
E. L. Park, S. Cho, 2014, KoNLPy: Korean natural language processing in Python, Proceedings of the 26th Annual Conference on Human and Cognitive Language Technology, pp. 133-136Google Search
18 
R. Rehurek, P. Sojka, 2011, Gensim-Statistical semantics in Python, The 4th European Meeting on Python in ScienceGoogle Search

저자소개

황명하 (Myeong-Ha Hwang)
../../Resources/kiee/KIEE.2020.69.12.1808/au1.png

He has received B.S. degree in Department of Information and Communication Engineering, from Chungnam National University (CNU), Korea in 2015 and M.E. degree in Information and Communication Network Technology from University of Science and Technology (UST), Korea in 2018, and currently work for Korea Electric Power Research Institute (KEPRI).

His current research interests Deep Learning and Natural Language Processing (NLP).

이인태 (In-Tae Lee)
../../Resources/kiee/KIEE.2020.69.12.1808/au2.png

He is currently working as a principal resear- cher in KEPCO Research Institute (KEPRI), Daejeon, Korea.

He received his M.S. of com- puter science from Korea University.

채창훈 (Chang-Hun Chae)
../../Resources/kiee/KIEE.2020.69.12.1808/au3.png

He received M.S. degree in Information and Mechanical Engineering from Gwangju Institute of Science and Technology (GIST).

His Major is Computer Science on general and, in specific, Augmented Reality and Computer Vision.

정남준 (Nam-Joon Jung)
../../Resources/kiee/KIEE.2020.69.12.1808/au4.png

He received his Ph.D. degree in computer engineering from Hanbat University.

His research interests are AI, VR/AR and Drone Applications.