Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 12, No. 06, p.526-534

ISSN (online) :

2287-5255

Received : 24 July 2023Revised : 24 August 2023Accepted : 14 September 2023

DOI :

https://doi.org/10.5573/IEIESPC.2023.12.6.526

Regular Paper

Unveiling the Power of Deep Learning: A Comparative Study of LSTM, BERT, and GRU for Disaster Tweet Classification

UllahIhsan¹ JamilAnum² HassanImtiaz Ul³ KimByung-Seo^1,^*

(Department of Software and Communication Engineering, Hongik University, Korea danish1852@gmail.com, jsnbs@hongik.ac.kr )
(Department of Physics, NED University of Engineering &Technology, Karachi, Pakistan jamilanum47@gmail.com)
(Department of Computer Science and Information Technology, NED University of Engineering &Technology, Karachi, Pakistan zahidbooni1@gmail.com)

^* Corresponding Author: Byung-Seo Kim

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Disasters have serious effects on people's lives and buildings. Therefore, social media platforms, such as Twitter, have become more critical. They are crucial tools for responding to and managing disasters effectively. This study examined the effectiveness of various deep learning models, such as bidirectional encoder representations from transformers (BERT), gated recurrent units (GRU), and long short-term memory (LSTM) for classifying disaster-related tweets. Twitter data related to different disasters were collected using hashtags. The data were then cleaned, preprocessed, and manually annotated by a team. The annotated data were divided into training, validation, and testing sets. The data were used to train three models based on BERT, GRU, and LSTM for the categorical classification of disaster tweets. Finally, the three models were evaluated and compared using the test data. BERT achieved an accuracy of 96.2%, making it the most effective model. In contrast, the LSTM and GRU models achieved an accuracy of 93.2% and 88.4%, respectively. These findings underscore the potential effectiveness of deep learning models in classifying disaster-related tweets, offering insights that could enhance disaster management strategies, refine social media monitoring processes, bolster public safety, and provide directions for future research.

Keywords

Text mining, Text classification, Sentiment analysis, Supervised machine learning, BERT, GRU, LSTM

1. Introduction

Natural disasters have become frequent worldwide, causing significant destruction and loss of life. With the rise of social media platforms, particularly Twitter, people now have an easy and immediate way of sharing information about these disasters. Twitter's real-time nature enables people to post updates and emergency information about disasters as they occur. The information shared on Twitter can benefit first responders and disaster relief organizations because they can quickly assess the situation and allocate resources accordingly. Studies have shown that social media platforms, such as Twitter, can provide critical information to help manage natural disasters. One study examined the role of Twitter in disseminating information during Hurricane Harvey in 2017 ^[1]. They reported that Twitter was a valuable tool for sharing situational updates and emergency information, especially in the early stages of a disaster when the traditional sources of information were limited. Another study analyzed the tweets during the 2017 Mexico earthquake and reported that Twitter users effectively shared information about missing persons and relief efforts ^[2].

On the other hand, the vast amount of unstructured and noisy data on Twitter poses challenges for effective disaster response and management. Various Natural Language Processing (NLP) techniques have been employed to classify and analyze disaster-related tweets to address these challenges ^[3]. These techniques automatically categorize tweets into different types, such as informative, supportive, and observational, to enable efficient filtering and analysis of disaster-related information.

Several studies have demonstrated the effectiveness of NLP techniques in classifying disaster-related tweets, such as the workers who used a deep learning approach to classify tweets related to the California wildfires ^[4]. These studies showed promising results in tweet classification tasks using deep learning models, such as recurrent neural networks (RNNs) and transformers. For example, using a bi-directional long short-term memory (LSTM) model with an attention mechanism to classify the tweets related to natural disasters into four categories: casualty, damage, donations, and sentiment. They achieved an accuracy of 89.7 % and outperformed several baseline models. Similarly, ^[5] used a pre-trained bidirectional encoder representations from transformers (BERT) model to classify tweets related to the COVID-19 pandemic into four categories: news, opinions, advisories, and miscellaneous.

The present study compared the performance of three different NLP models, namely BERT, gated recurrent units (GRU), and LSTM, for tweet classification of disaster data. The proposed studies provide significant contributions to the field of crisis informatics, particularly in the use of natural language processing (NLP) models for disaster detection and response. The specific contributions of this research can be encapsulated in the following points.\begin{enumerate}[1.]

1. This paper presents a unique study comparing three distinct NLP models on disaster-related tweet classification, a topic previously unexplored. A dataset of 5545 tweets was manually annotated to assess the strengths and limitations of each model and guide future research in this domain.

2. This work introduces a robust framework for extracting disaster-relevant information from Twitter, aiming to enhance the efficiency and depth of disaster management strategies by interpreting social media data more effectively.

3. This study aimed to develop a mechanism for identifying and categorizing disaster-related tweets to sift through the vast Twitter data. The goal is to provide real-time updates and emergency information during natural disasters, enabling stakeholders to gain immediate insights and respond quickly and effectively.

This research aims to demonstrate the potential of sophisticated NLP techniques in aiding disaster response and management.

2. Related Work

Natural disasters, such as earthquakes, floods, accidents, and hurricanes, have significant social and economic impacts on the affected communities. Social media platforms, such as Twitter, have emerged as valuable sources of real-time information during disasters ^[6]. Twitter users often share first-hand accounts, photographs, and videos of the disasters, as well as requests for help, information, and donations ^[7]. On the other hand, the vast amount of unstructured and noisy data on Twitter poses challenges to effective disaster response and management.

In recent years, there has been a growing interest in leveraging social media for disaster management and response. A previous study developed a system to utilize Twitter data for coordination in disaster response scenarios ^[8]. Their study focused on clustering tweets and categorizing them based on their relevance to disasters. They demonstrated how social media can serve as a real-time source of disaster-related information.

On the other hand, the task of tweet classification has proven to be a challenge because of the short, noisy, and unstructured nature of the text. Studies have made into this problem, examining the use of convolutional neural networks (CNNs) for text classification ^[9]. They showed that CNNs can effectively handle the short and sparse nature of tweets, paving the way for a further exploration of deep learning techniques in this context.

Despite this, research has shown that different deep learning models may perform better on different tasks. A study examined the performance of several models, including LSTM and GRU, in the context of sentiment analysis. The research found that GRU models generally outperformed LSTM models, highlighting the need for further investigation into the optimal contexts for each model.

Despite the initial impressions, various studies highlight that the effectiveness of different deep learning models can hinge heavily on the task at hand. For example, a notable study examined several models, including LSTM and GRU, within the realm of sentiment analysis ^[10]. This investigation illuminated the comparative effectiveness of these models, revealing a general trend of GRU models outperforming their LSTM counterparts. The possible causes for this difference can be attributed to the unique structural and functional characteristics of GRUs, including their simplified gating mechanism and lower computational complexity, which may have advantages in specific scenarios, such as sentiment analysis. Such nuanced performance disparities underline the criticality of choosing an appropriate deep learning model based on the distinct requirements and nature of the task. Therefore, these findings underscore the importance of further detailed, task-specific research to unearth the optimal model-context pairings, enhancing the knowledge surrounding comprehensive model evaluations and benchmarking studies across many tasks.

The present study compared the performance of three NLP models, such as BERT, GRU, and LSTM, for tweet classification of disaster data. To the best of the authors’ knowledge, no study has compared the performance of these models on disaster-related tweets.

3. The Proposed Scheme

This section describes the dataset used for disaster tweet classification, including data collection and preprocessing information. The section also presents the proposed methodology for the classification task on the collected dataset.

3.1 Data Collection

The Tweepy library, a popular Python package, was used for data collection. Tweepy provides a convenient and easy-to-use interface for accessing the Twitter API. With Tweepy, researchers could authenticate and establish a connection with the Twitter platform, enabling them to retrieve tweets based on specific search queries and hashtags related to disasters. This study used hashtags, such as \#Disaster, \#Earthquake, \#Floods, \#Accidents, and \#Disasters, to collect many disaster-related tweets for further analysis and classification. This library streamlined the data collection process and ensured the inclusion of relevant tweets on different types of disasters. A dataset of 5545 tweets was collected using Tweepy, providing a diverse and comprehensive dataset for analysis.

3.2 Data Annotation

The 5545 collected tweets were manually annotated into different disaster categories, including 'Earthquake', 'Flood', 'Accident', and 'Other disaster'. The annotation process carefully analyzes the content of each tweet and assigns it to the appropriate category based on its context and keywords. This manual annotation was carried out using a team of trained annotators who followed a predefined set of guidelines to ensure consistency and accuracy in the categorization.

3.3 Data Preprocessing

In the data preprocessing phase, several steps were undertaken to prepare the collected tweets for further analysis. Initially, common words that do not carry significant meaning, such as stop words, were removed from the text. This step helped reduce noise and improve the efficiency of subsequent processes. Furthermore, a technique called lemmatization was applied to transform words into their base or root form, consolidating the variations of the same word. This step enhanced the accuracy of the classification task by reducing the dimensionality of the data and capturing the essence of the tweet content. The dataset was refined and optimized for subsequent analysis and classification tasks by performing these preprocessing steps. A team of trained annotators followed a predefined set of guidelines to ensure consistency and accuracy in the categorization.

3.4 Data Visualization

Data visualization is a powerful tool that provides insights, identifies patterns, and communicates complex information effectively through visual representations. A count plot was produced to visualize the distribution of different disaster types in the dataset (Fig. 1). The data showed that during the collection period, the highest number of incidents recorded was related to earthquakes, with a count of 2065. This was followed by other disasters with 1348 occurrences, floods with 1215 occurrences, and accidents with 917 occurrences. The higher count of earthquake incidents can be attributed to the data being collected when a significant earthquake event occurred in Turkey.

Another type of visualization that can be performed on textual data is wordcloud. The word cloud produced from a dataset of disaster tweets revealed the prominent terms associated with different types of disasters. The most frequent terms in the word cloud, which can be observed in Fig. 2, include "Earthquake," "Hurricane," "Accident," and "Flood Warning." These terms indicate the prevalence of these specific disaster types in the dataset and highlight the significance of these events in the context of the analyzed tweets. The word cloud provides a visual representation that quickly identifies the most commonly mentioned disaster types in the dataset.

Fig. 1. Tweet counts of each class.

Fig. 2. Word Cloud for all the tweets.

3.5 Data Transformation

This section performs data preparation that includes label encoding, tokenization, text-to-sequence conversion, and padding to prepare the text data for disaster tweet classification.

Label encoding is a technique used to convert categorical labels into numerical values. In disaster tweet classification, the labels ['Accident', 'Earthquake', 'Flood', and 'Other disaster'] were assigned the corresponding numerical labels 0, 1, 2, and 3, respectively, using the scikit-learn label encoder. This allows the machine learning model to understand and process the labels effectively.

Tokenization, however, is the process of splitting text into individual words or tokens. In this case, the vocabulary size was determined to be 14509, meaning there are 14509 unique words in the given disaster tweet dataset. Tokenization is an essential step in natural language processing tasks because it allows the model to understand and analyze the text data at a granular level.

After tokenization, the next step involved converting the text data into sequences. This conversion is necessary to represent each word in the text as a numerical sequence that machine learning models can process. Each unique word in the vocabulary is assigned a unique integer value. The conversion of text into sequences allows the model to understand and analyze the text data numerically.

Following tokenization, a maximum length of 27 words was set as the longest length of a tweet. This was done by padding the sentences with zeros (post-zero-padding) to ensure all sentences have the same length. This uniformity in sentence length is beneficial for training machine learning models that require fixed-length input sequences. By performing this preprocessing step, the text data is ready to be fed into a model for disaster tweet classification.

3.6 Data Splitting: Training, Validation, and Testing Sets

Data splitting is a crucial step in machine learning, where the available dataset is divided into separate subsets, such as training, validation, and testing sets, to facilitate model development, evaluation, and optimization. In the case of 5545 tweets, the data was divided into 15% for testing (831 tweets), 10% for validation (554 tweets), and 75% for training (4160 tweets).

3.7 Model Selection

Once the data have been prepared for training, deep learning models are trained on this data for performing disaster tweet classification. Three different models are trained and compared: GRU, LSTM, and BERT.

3.7.1 GRU (Gated Recurrent Unit)

The GRU introduced by Cho et al. ^[11] is a type of RNN that has gained popularity in deep learning. It is designed to address the vanishing gradient problem in traditional RNNs. The GRU includes two key Gates: update and reset gates. The update gate determines how much of the previous hidden state should be passed on to the current time step, while the reset gate controls how much of the previous hidden state should be ignored. These gates play a crucial role in governing the flow of information in the GRU, allowing it to capture long-term dependencies in sequential data.

Eq. (1) depicts the functioning of the Update gate, a key component in the Gated Recurrent Unit (GRU) architecture.

(1)

$ z_{t}=\sigma \left(W_{z}\cdot \left[x_{t}+U_{z}\right]\cdot h_{t-1}\right) $

where Z$_{\mathrm{t}}$ represents the update gate activation at time step t. ${\sigma}$ denotes the sigmoid activation function. W$_{\mathrm{z}}$ and U$_{\mathrm{z}}$ are the weight matrices that control the influence of the current input x$_{\mathrm{t}}$ and the previously hidden state h$_{\mathrm{t-1}}$, respectively.

Eq. (2) captures the functionality of the Reset gate.

(2)

$ t=\sigma \left(W_{r}\cdot x_{t}+U_{r}\cdot h_{t-1}\right) $

Similarly, r$_{\mathrm{t}}$ is the reset gate activation at time step t. W$_{\mathrm{r}}$ and U$_{\mathrm{r}}$ are the weight matrices determining the impact of the current input x$_{\mathrm{t}}$ and the previously hidden state h$_{\mathrm{t-1}}$ on the reset gate activation.

These equations and subsequent calculations help the GRU model decide how much information to retain from the previous time step and how much to update with new inputs, enabling it to capture and process sequential dependencies effectively.

The GRU-based model initiates with an embedding layer that converts integer-encoded words into dense vectors using the given vocabulary size and embedding dimensions. This is succeeded by a Bidirectional GRU layer with 256 neurons, using a ReLU activation for adept bidirectional sequence processing. A Global Average Pooling1D layer then summarizes this temporal information. A dense layer with 64 neurons and ReLU activation is then used, followed by a 0.4 rate dropout layer to mitigate overfitting. The architecture culminates in a Dense layer with four neurons and a softmax activation, targeting the classification of distinct disaster classes in tweets.

3.7.2 LSTM (Long Short-term Memory)

LSTM is a well-established RNN architecture that effectively addresses the vanishing gradient problem, a common issue in training traditional RNNs. The model achieves this by introducing memory cells and three essential gating mechanisms: the input gate, forget gate, and output gate. These gates play a critical role in regulating the flow of information through the network, enabling the LSTM to capture and retain long-range dependencies in the input sequence. LSTM has widespread applications in various tasks, including speech recognition, language modeling, and text classification, owing to its robustness in modeling sequential data. Its exceptional ability to capture long-term dependencies makes it particularly well-suited for understanding the context and semantics of the text, which is essential for accurate classification, such as in disaster-related tweets.

Input Gate: Eq. (3) represents the functioning of the input gate (i$_{\mathrm{t}}$). The input gate controls how much the current input (x$_{\mathrm{t}}$) should be used to update the cell state (C$_{\mathrm{t}}$). It is calculated using the sigmoid activation function.

(3)

$ i_{t}=sigmoid\left(W_{i}*\left[h_{t-1},\,\,x_{t}\right]+b_{i}\right) $

where W$_{\mathrm{i}}$ is the weight matrix for the input gate and [h$_{\mathrm{t-1}}$, x$_{\mathrm{t}}$] represents the concatenation of the previous hidden state and the current input. b$_{\mathrm{i}}$ is the bias vector for the input gate. Sigmoid is the activation function, which scales the output between 0 and 1.

Forget Gate (f$_{\mathrm{t}}$): The forget gate determines the extent to which the previous cell state (C$_{\mathrm{t-1}}$) should be forgotten when processing the current input (x$_{\mathrm{t}}$) and the previous hidden state (h$_{\mathrm{t-1}}$). The gate is also calculated using the sigmoid activation function. The forget gate mathematical functioning can be explained using Eq. (4).

(4)

$ f_{t}=sigmoid\left(W_{f}*\left[h_{t-1},\,\,x_{t}\right]+b_{f}\right) $

where W$_{\mathrm{f}}$ is the weight matrix for the forget gate. [h$_{\mathrm{t-1}}$, x$_{\mathrm{t}}$] represents the concatenation of the previous hidden state and the current input. b$_{\mathrm{f}}$ is the bias vector for the forget gate. "sigmoid" is the sigmoid activation function.

Output Gate (O$_{\mathrm{t}}$): The output gate controls the extent to which the current cell state (C$_{\mathrm{t}}$) should influence the computation of the current hidden state (h$_{\mathrm{t}}$). The gate is calculated using the sigmoid activation function. Eq. (5) represents the mathematical equation of the output gate.

(5)

$ O_{t}=sigmoid\left(W_{o}*\left[h_{t-1},\,\,x_{t}\right]+b_{o}\right) $

where Wo is the weight matrix for the output gate. [h$_{\mathrm{t-1}}$, x$_{\mathrm{t}}$] represents the concatenation of the previous hidden state and the current input. b$_{\mathrm{o}}$ is the bias vector for the output gate.

These gating mechanisms in LSTM and the memory cell enable the network to update and forget information selectively, allowing it to learn long-term dependencies and effectively model sequential data. The ability to capture complex patterns and context in the input sequence makes LSTM a powerful tool for various natural language processing tasks, including disaster-related tweet classification, where an accurate understanding of the text's semantics is crucial.

The architecture of an LSTM-based model begins with an embedding layer, which transforms integer-encoded words into dense vectors. This feeds into a Bidirectional LSTM with 256 neurons, enhanced by a ReLU activation for efficient bidirectional sequence processing. A Global Average Pooling1D layer then distills this temporal data, leading to a Dense layer with 64 neurons and ReLU activation for intricate pattern recognition. A dropout layer with a 0.4 rate was employed to prevent overfitting. Concluding the architecture, a softmax-activated dense layer outputs class probabilities, making this design particularly adept at classifying disaster-related tweets.

3.7.3 BERT (Bidirectional Encoder Representation for Transformers)

BERT is a transformer-based model that has revolutionized natural language processing tasks. The BERT model follows a two-step framework: pre-training and fine-tuning ^[12]. In the pretraining phase, the model undergoes training on a vast unlabeled corpus. For the fine-tuning stage, the model starts with pre-trained parameters, which are then fine-tuned using labeled data specific to the tasks.

As a transformer-based model, BERT has revolutionized natural language processing tasks with its bidirectional capabilities. Unlike LSTM and GRU, which are unidirectional models processing input from left to right, BERT considers both the left and right contexts of each word in a sentence, providing a more comprehensive understanding of the context. This bidirectional nature allows BERT to capture long-range dependencies efficiently. In addition, BERT differs from LSTM and GRU regarding the training objectives. BERT uses unsupervised pretraining, learning from large amounts of unannotated text data through masked language modeling and next-sentence prediction. In contrast, LSTM and GRU typically undergo supervised training with labeled data.

BERT uses a multi-layer bidirectional transformer encoder ^[12], which consists of N = 6 identical layers, each with two sub-layers. In the initial sub-layer, a multi-head self-attention mechanism captures the relationships between different words in the input sequence, allowing the model to comprehend the context effectively. The subsequent sub-layer uses a position-wise fully connected feedforward network to process the output of the self-attention layer further. The scaled dot-product attention mechanism, represented as Eq. (6), is a fundamental building block within the self-attention layer.

(6)

$$ \operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q^* K^{\wedge} T}{\sqrt{d_{-} k}}\right) * V $$

where Q, K, and V represent the queries, keys, and values, respectively. This mechanism calculates attention scores by measuring the relevance of the queries to the keys. The softmax function normalizes these scores, determining the importance of each value (V) with respect to the given queries and keys. By focusing on the most relevant parts of the input sequence, this mechanism captures the contextual dependencies, leading to meaningful contextualized word representations. This mechanism is a critical component in the Transformer architecture, contributing to the success of models, such as BERT, in various natural language processing tasks.

The "Bert-base-uncased" variant of BERT, which comprises 12 layers, was used as the foundation for classifying disaster-related tweets via transfer learning. This base model is augmented using a fully connected output layer with neurons with a softmax activation. Fine-tuning is facilitated using the Adam optimizer with a learning rate of 1 ${\times}$ 10$^{-5}$ and a decay of 1${\times}$ 10$^{-7}$, optimizing the training efficacy. Categorical Crossentropy was selected as the loss function, given its suitability in measuring the discrepancies between predicted and actual class probabilities, making it particularly suitable for multi-class classification.

3.8 Evaluation Metrics

The evaluation metrics are essential for assessing the performance of machine learning models. In this study, the performance of the GRU, LSTM, and BERT machine learning models can be assessed using three pivotal metrics: accuracy, precision, and recall. These metrics provide a comprehensive insight into the ability of the model to make correct predictions, its proportion of true positive predictions, and its sensitivity to positive instances.

Eq. (7) measures the overall correctness of the model predictions. This is the proportion of the total number of correct predictions. Mathematically, accuracy can be expressed as follows:

(7)

$ Accuracy=\frac{T_{P}+T_{N}}{T_{P}+F_{P}+F_{N}+T_{N}} $

Eq. (8), also known as the positive predictive value, quantifies the proportion of positive class predictions that are correct. The value measures the model reproducibility or the closeness of the measurements to each other.

(8)

$ Precision=\frac{T_{p}}{T_{p}+F_{p}} $

Eq. (9) is Recall, also known as sensitivity, hit rate, or true positive rate, quantifies the proportion of actual positive class observations that were correctly classified. This value is a measure of the completeness or the quantity it can correctly identify.

(9)

$ Recall=\frac{T_{p}}{T_{p}+F_{N}} $

where T$_{\mathrm{P}}$, T$_{\mathrm{N}}$, F$_{\mathrm{P}}$, and F$_{\mathrm{N}}$ are the true positives, true negatives, false positives, and false negatives, respectively.

These evaluation metrics are crucial for understanding the strengths and weaknesses of each model in different aspects of performance. This study aimed to determine if the model exhibits the optimal performance for a specific task by comparing these metrics across the GRU, LSTM, and BERT models.

4. Results and Discussions

This section discusses the results of the training of disaster tweet classification models. The training was conducted in Google Colab, which offered GPU acceleration. In particular, the GPU used for training was GPU 0: Tesla T4 with a memory capacity of 15360MiB, which is equivalent to 16 GB. This GPU acceleration, along with its high memory capacity, provided significant computational power and helped improve the efficiency of the training process.

4.1 Comparison of GRU, LSTM, and BERT Models

Table 1 compares the results of the three models based on accuracy, precision, and recall of test data. According to the results presented in Table 1, BERT achieved the highest testing accuracy (0.962), followed by LSTM with a testing accuracy of 0.932 and GRU with a testing accuracy of 0.8847. BERT also had the highest testing precision (0.963), indicating that BERT, a powerful language model, achieves impressive results in various natural language processing tasks, including text classification. In the specific disaster classification task, BERT showed its effectiveness in accurately predicting the class of a given text. Fig. 3 presents the confusion matrix for the performance of BERT on the disaster classification task, using the classes disaster classes 'Accident', 'Earthquake', 'Flood', and 'Other disaster'. The confusion matrix provides valuable insights into the performance of the model by showing the number of correct and incorrect predictions for each class. In this case, the rows represent the true classes, while the columns represent the predicted classes.

From the confusion matrix, BERT has achieved high accuracy in predicting the 'Accident', 'Earthquake', and 'Other disaster' classes, complexity, with most predictions falling into these categories being correct.

On the other hand, there are few instances where 'Accident' correctly identifies positive instances out of all instances predicted as positive. LSTM also performed well in this aspect, with a precision of 0.952. In contrast, GRU had a slightly lower precision of 0.8923. Regarding testing recall, BERT achieved the highest score of 0.9625, followed by LSTM with a recall of 0.917. GRU had a slightly lower recall of 0.8811. BERT demonstrated the best performance across all metrics, achieving high accuracy, precision, and recall. LSTM also performed well, and GRU showed slightly lower accuracy, precision, and recall performance.

Table 1. Comparison of three models based on accuracy, recall, and precision.

Fig. 3. Confusion matrix for BERT.

4.2 Performance Analysis of the BERT Model

'Earthquake' classes were misclassified as 'Flood' or 'Other disaster'.

Similarly, the 'Flood' class had some misclassifications, with a few instances being predicted as 'Accident' or 'Other disaster'. The 'Other disaster' class also had a few misclassifications, with some instances being predicted as 'Accident', 'Earthquake', or 'Flood'.

Overall, BERT demonstrated its effectiveness in disaster classification, achieving high accuracy in predicting the majority of instances correctly. Nevertheless, there is still room for improvement, particularly in reducing misclassifications between certain classes.

The history plot of the BERT model validation and training accuracy over 20 epochs reveals interesting trends, as shown in Fig. 4. Initially, the validation accuracy started at 94 % but experienced a significant jump to 96 % at the 3$^{\mathrm{rd}}$ epoch. Subsequently, the validation accuracy remains constant throughout the remaining epochs. On the other hand, the training accuracy started at 88 % and increased steadily to 98 % by the 3$^{\mathrm{rd}}$ epoch. Subsequently, the training accuracy continued to increase slightly, reaching 99.2 %. This suggests that the BERT model performs well in terms of training and validation accuracy, with the validation accuracy showing stability after an initial improvement. The consistent increase in training accuracy indicates that the model is effective in learning and improving its performance over time.

An eight-epoch comparison of three model variants was performed using BERT for disaster tweet classification. The objective was to analyze the influence of the number of hidden layers on the model performance. The original model, featuring one hidden layer, exhibited impressive progress during training, consistently enhancing accuracy, precision, recall, and F1-score on the validation dataset. This highlighted the proficiency of the model in capturing the essential patterns from the text data.

In Variant 1, designed with two hidden layers, the performance of the model was competitive. Despite an initial metric lag compared to the original model, rapid convergence led to commendable evaluation scores. This suggests that the additional hidden layers facilitated nuanced pattern recognition, contributing to either equivalent or improved outcomes.

Variant 2, leveraging four hidden layers, demonstrated swift pattern discernment and efficient convergence. Despite its increased complexity, this architecture achieved notable precision, recall, and F1-score values, underscoring its capability to learn intricate text features.

These observations underscore the interplay between the number of hidden layers and model performance. Both simpler and deeper architectures yielded promising results, potentially due to the enhanced feature extraction capabilities. On the other hand, careful consideration of overfitting risks is essential when adjusting the model. In summary, this analysis, conducted using BERT for disaster tweet classification, sheds light on the impact of hidden layer variations, offering valuable insights for architectural decisions in natural language processing tasks.

Fig. 4. Plot illustrating the validation and training accuracy of the BERT model over 20 epochs.

Conclusion and Future Work

This study analyzed the efficacy of BERT, GRU, and LSTM deep learning models in classifying disaster-related tweets. The results showcased the superior performance of BERT in precision, recall, and accuracy. This highlights the potential BERT for improved disaster management by analyzing tweets, identifying the disaster type, and formulating appropriate response strategies. The study also highlighted the importance of location information in disaster management and the varied word usage based on the type of disaster.

The study provides promising insights. Therefore, future research should extend to different disaster types, such as wildfires or pandemics, to explore the adaptability of these models. In addition, how these models can integrate with current disaster management systems for improved efficiency will also be a subject for future research. Furthermore, as these models advance, ethical considerations of content filtration and information prioritization should be evaluated to ensure responsible and transparent utilization that does not infringe on ethical norms or human rights.

ACKNOWLEDGMENTS

This work was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No.2022R1A2C1003549) and in part by the 2023 Hongik University Innovation Support Program Fund.

REFERENCES

Zou, Lei, Danqing Liao, Nina SN Lam, Michelle A. Meyer, Nasir G. Gharaibeh, Heng Cai, Bing Zhou, and Dongying Li. "Social media for emergency rescue: An analysis of rescue requests on Twitter during Hurricane Harvey." International Journal of Disaster Risk Reduction 85 (2023): 103513.

Karimiziarani, Mohammadsepehr, Keighobad Jafarzadegan, Peyman Abbaszadeh, Wanyun Shao, and Hamid Moradkhani. "Hazard risk awareness and disaster management: Extracting the information content of twitter data." Sustainable Cities and Society 77 (2022): 103577.

Samuel, Jim, G. G. Md. Nawaz Ali, Md. Mokhlesur Rahman, Ek Esawi, and Yana Samuel. 2020. "COVID-19 Public Sentiment Insights and Machine Learning for Tweets Classification" Information 11, no. 6: 314.

Piyush Jain, Sean C.P. Coogan, Sriram Ganapathi Subramanian, Mark Crowley, Steve Taylor, and Mike D. Flannigan. 2020. A review of machine learning applications in wildfire science and management. Environmental Reviews. 28(4): 478-505.

Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., Tesconi, M. (2016). The paradigm-shift of social spambots: Evidence, theories, and tools for the arms race. IEEE Communications Magazine, 54(3), 100-107.

Imran, M., Elbassuoni, S. M., Castillo, C., Diaz, F., Meier, P. (2016). Practical extraction of disaster-relevant information from social media. Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1023-1026.

R. Ni and H. Cao, "Sentiment Analysis based on GloVe and LSTM-GRU," 2020 39th Chinese Control Conference (CCC), Shenyang, China, 2020, pp. 7492-7497.

Ashktorab, Zahra, Christopher Brown, Manojit Nandi, and Aron Culotta. "Tweedr: Mining twitter to inform disaster response." In ISCRAM, pp. 269-272. 2014.

Nguyen, Dong, Rilana Gravel, Dolf Trieschnigg, and Theo Meder. "" How old do you think I am?" A study of language and age in Twitter." In Proceedings of the International AAAI Conference on Web and Social Media, vol. 7, no. 1, pp. 439-448. 2013.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pretraining of deep bidirectional transformers for language understanding" (2018).

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.,& Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding, 2018.

Author

Ihsan Ullah

Ihsan Ullah received his B.S. in Computer Systems Engineering from the University of Engineering and Technology Peshawar, Pakistan, and his M.S. in Computer and Wireless Networks from COMSATS University, Islamabad, in 2021. He was a research assistant in the Wireless and Communication lab for half a year. He is pursuing his Ph.D. in Software and Communication Engineering at Hongik University, South Korea, under Prof. Byung-Seo Kim. His research interests encompass NDN, Underwater Wireless Sensor Networks, Cloud and Fog Computing, Vehicular Networks, and aspects of Machine Learning and Artificial Intelligence.

Anum Jamil

Anum Jamil is a final-year B.S. student in Applied Physics at NED University of Engineering and Technology, Karachi. She is currently interning at the university's Smart City Lab. She is also engaged with the distinguished President's Initiative of Artificial Intelligence, demonstrating her dedication to machine learning and AI. Her research interests lie in Natural Language Processing (NLP), the Internet of Things (IoT), and Artificial Intelligence.

Imtiaz ul Hassan

Imtiaz ul Hassan holds a B.S. degree in Computer Systems Engineering from the University of Engineering and Technology Peshawar, Pakistan. Currently, he is pursuing his M.S. in Data Science from NEDUET Karachi. In addition to his studies, Imtiaz is actively engaged as a research associate in the Smart City LAB at the National Center for Artificial Intelligence. His research interests primarily involve computer vision, natural language processing (NLP), autonomous vehicles, and robotics.

Byung-Seo Kim

Byung-Seo Kim received his B.S. degree in Electrical Engineering from In-Ha University, In-Chon, Korea in 1998 and his M.S. and Ph.D. degrees in Electrical and Computer Engi-neering from the University of Florida in 2001 and 2004, respectively. His Ph.D. study was supervised by Dr. Yuguang Fang. Between 1997 and 1999, he worked for Motorola Korea Ltd., PaJu, Korea, as a CIM Engineer in ATR&D. From January 2005 to August 2007, he worked for Motorola Inc., Schaumburg, Illinois, as a Senior Software Engineer in Networks and Enterprises for designing the protocol and network architecture of wireless broadband mission-critical communications. He is a professor in the Department of Software and Communications Engineering at Hongik University, Korea. He is an IEEE Senior Member and is an Associative Editor of IEEE Access, Telecommunication Systems, and Journal of the Institute of Electrics and Information Engineers. His studies have appeared in approximately 260 publications and 32 patents. His research interests include designing and developing efficient wireless/wired networks, including link-adaptable/cross-layer-based protocols, multi-protocol structures, wireless CCNs/NDNs, Mobile Edge Computing, physical layer design for broadband PLC, and resource allocation algorithms for wireless networks.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Unveiling the Power of Deep Learning: A Comparative Study of LSTM, BERT, and GRU for Disaster Tweet Classification

Abstract

Keywords

1. Introduction

2. Related Work

3. The Proposed Scheme

3.1 Data Collection

3.2 Data Annotation

3.3 Data Preprocessing

3.4 Data Visualization

Fig. 1. Tweet counts of each class.

Fig. 2. Word Cloud for all the tweets.

3.5 Data Transformation

3.6 Data Splitting: Training, Validation, and Testing Sets

3.7 Model Selection

3.7.1 GRU (Gated Recurrent Unit)

(1)

(2)

3.7.2 LSTM (Long Short-term Memory)

(3)

(4)

(5)

3.7.3 BERT (Bidirectional Encoder Representation for Transformers)

(6)

3.8 Evaluation Metrics

(7)

(8)

(9)

4. Results and Discussions

4.1 Comparison of GRU, LSTM, and BERT Models

Table 1. Comparison of three models based on accuracy, recall, and precision.

Fig. 3. Confusion matrix for BERT.

4.2 Performance Analysis of the BERT Model

Fig. 4. Plot illustrating the validation and training accuracy of the BERT model over 20 epochs.

Conclusion and Future Work

ACKNOWLEDGMENTS

REFERENCES

Author

Ihsan Ullah

Anum Jamil

Imtiaz ul Hassan

Byung-Seo Kim

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing