1. Introduction
Every day, e-commerce users on websites such as Amazon write thousands of inquiries
about individual products. After analysing a sizable dataset that was scraped from
Amazon's website [1] which included product reviews and question-answering data, we found that: (i) the
majority of inquiries have response times of several days, with an average of approximately
two days for each question (aside from those that receive no response at all); (ii)
the product reviews are rather detailed and educational when compared to the responses
provided to particular questions; and we found that over half of the questions can
be addressed (at least in part) by using reviews and the description content that
have already been posted on the webpage [2].
During a user's journey through these platforms, a user may find difficulty in navigation
or may find the amount of data available to consume a little overwhelming which may
result in the user losing interest or changing the platform, and may even shift to
local vendors and sellers as well. Under these circumstances, a user may shift towards
reviews or available question-answers or the information available on the platform's
webpage, but the amount of data could be confusing for any type of user to consume.
This problem has motivated a significant amount of research in this area from opinion-based
question answering systems to extraction based, mainly four types of systems are prominent.
Namely, Opinion, Retrieval, Extraction, Generation. Hence to address this issue automatic
answer generation for user's questions seems to be a feasible solution.
This paper proposes a solution to the above problem with the architecture of Chrome
extension and a QA model that reads the data from the user's opened webpage automatically
by leveraging DOM manipulation techniques, then serving it to the Model with the user's
respective query and answer the questions accordingly. We take a pre-trained model
BERT and fine-tune it on dataset SQuAD 2.0 dataset then we use the transfer learning
technique to further train the final QA model on the downstream task for Product Question
Answering using the ePQA dataset. Transfer Learning helps us to leverage the existing
knowledge in other various tasks efficiently. This study is divided in the following
sections: Section 2 discusses the literature on question answering systems, Section
3 presents the workings and methodology of our proposed model and framework, Section
4 presents the results, and Section 5 concludes the study.
2. Related Work
The Product Question~Answering (PQA) system seeks to deliver an answer A to the appropriate
user's query~Q based on some set of supporting knowledge bases (review, specifications,
etc.). B. Extraction-based PQA studies, like traditional extraction-based QA [3] (also known as Machine Reading Comprehension (MRC)), seek to extract a certain span
of a document to serve as the solution to supplied product-related questions. SQuAD
[3,4] is a dataset made up of only one document. The solutions are made up of multiple
words from the context. To address the challenge of developing QA systems capable
of managing extended contexts, SearchQA [5] provides contexts that include multiple documents. Because the supporting documents
are retrieved via information retrieval once the (question, answer) pairings have
been established, there is no guarantee that the questions in this case will need
reasoning across several documents. [6] created the first extraction-based PQA dataset, ReviewRC, based on SemEval-2016 Task
5 reviews [7]. Similarly, [8] performed significant pre-processing on the Amazon dataset [1]. [9] to create a dataset for extraction-based PQA named AmazonQA. Based on the available
reviews, it annotates each question as answerable or unanswerable and heuristically
generates an answer span from the reviews that most accurately response the query.
[10] present the SubjQA dataset to explore the relationship between subjectivity and PQA
in the context of product reviews, which includes six different domains based on the
TripAdvisor [11], Yelp, and Amazon [1] datasets [12].
Due to the scarcity of training data for extraction-based PQA, [6] use two popular pretraining objectives, masked language modelling and next sentence
prediction, to post-train the BERT encoder on both the general MRC dataset [3], SQuAD, and E-Commerce review datasets, such as Amazon Review [9] and Yelp datasets. In real-world applications, there will be many irrelevant reviews,
and the question may be unanswerable. To accomplish this, [8]] first uses IR approaches to extract top review snippets for each topic and then
creates an answerability classifier to detect unanswerable questions based on the
available reviews. The extraction-based PQA is then implemented using a span-based
QA model, specifically R-Net [13]. Furthermore, [14] created a subjectivity-aware QA model that conducts multi-task learning for extraction-based
PQA and subjectivity categorization. Experimental results suggest that introducing
subjectivity significantly improves performance.
The significance of fine-tuning pre-trained language models, particularly focusing
on BERT [15] and XLNet [16], for achieving cutting-edge performance in diverse natural language processing tasks
[17]. It highlights the limitations of BERT and introduces XLNet as a promising alternative,
showcasing its superiority in several NLP applications. The methodology section outlines
the baseline models provided by MRQA organizers, including BERT-base and BERT-large.
It introduces XLNet as a powerful alternative, explaining its architecture and pre-training
procedure using various datasets. The fine-tuning strategy is emphasized, highlighting
XLNet's effectiveness in achieving superior results across multiple tasks, including
QA. Attention-over-Attention (AoA) mechanism is discussed as an effective component
in extractive QA systems, known for its capability to learn the importance of distribution
over inputs.
TransTQA [18] utilizes the ALBERT [19] model configured with the Hugging face Transformer for question-answering tasks.
It addresses the challenge of retrieving proper answers in non-factoid QA scenarios
by introducing a Siamese ALBERT network. This network employs a Siamese encoder to
generate contextualized representations of tokenized elements in input questions and
candidate answers, followed by a matching layer that computes the relevance between
question-and-answer embeddings. The system optimizes response selection through multiple
negative ranking losses during training. Transfer learning uses a three-step fine-tuning
procedure. First, the pre-trained ALBERT model is fine-tuned on a technical corpus
using a masked language modelling job. Second, a Siamese ALBERT fine-tunes with source
technical QA data. Lastly, the Siamese ALBERT further fine-tunes with target QA data.
This process enhances the system's performance by incorporating technical domain knowledge
and addressing data scarcity in emerging technical forums. TransTQA offers automatic
responses by leveraging siamese ALBERT, demonstrating quick and accurate question
answering. The adoption of transfer learning enhances the system's performance on
technical domain QA, showcasing its adaptability and effectiveness across different
datasets.
3. The Proposed Methodology and Framework
We are working on the extractive question answering the problem, which classically
suggests Given a product-related question q and a supporting document $d = \{t1$,
..., $tn\}$ $\mathrm{\in }$ D, which consists of one or more product reviews, the
goal is to discover a sequence of tokens (a text span) $a = \{ts$, ..., $te\}$ in
d that answers $q$ properly, where $1 \le s < n$, $1 \le e \le n$, and $s \le e$ [12]. To solve this problem, we propose our QA model and incorporate it into the user's
journey over e-commerce by proposing a framework, shown in Fig. 1. Thus, the user gets a flawless way to register their queries and get them resolved.
We scrape the data from the webpage by following the company's terms of use and scraping
policy using the script that leverages the DOM manipulation techniques to get our
model the data. We then feed this data to our model along with the user's question
and answer it accordingly. In Fig. 2, model learning method has been shown.
SQuAD 2.0 [4], an extension of the Stanford Question Answering Dataset, introduces a more challenging
dimension to extractive reading comprehension by combining existing SQuAD data with
over 50,000 unanswerable questions crafted adversarially. Unlike its predecessor SQuAD
1.0, which focused solely on answerable questions, SQuAD 2.0 includes intentionally
unanswerable questions, challenging systems not only to locate answers but also to
abstain when appropriate. The unanswerable questions, generated by crowd workers to
resemble answerable ones, encompass diverse linguistic phenomena, surpassing the complexity
of previous rule-based methods. SQuAD 2.0's uniqueness lies in its demand for models
to discern when a question lacks contextual support for an answer. This dataset addresses
the limitations of SQuAD 1.1, where human accuracy was likely underestimated, and
surpasses the diversity of rule-based approaches. Human evaluation confirms the cleanliness
of the dataset, revealing that a robust neural system achieving 86% F1 Score on SQuAD
1.1 drops to 66% on SQuAD 2.0 [20], emphasizing its increased difficulty. This dataset does not address the domain of
Product question answering but has a vast amount of diverse data with 130319 QA pairs
in the training set and 11873 in the validation set [21]. Fig. 3 below shows the answerable and unanswerable QA pair ratio in the training dataset.
ePQA dataset: To make our model align with the domain and further train it for Product
Question answering, we leverage another dataset ePQA developed by Amazon-sciences
[22]. It contains 121750 QA pairs in the training set, 9770 pairs in the dev set and 20142
QA pairs in the testing set. It has higher annotation quality after multiple rounds
of checking. The error rate is less than 5 percent. It does not restrict product categories.
Fig. 5 depicts the question length distribution for answerable and non-answerable pairs.
Each value is compared to the surrounding sentences to ensure that the label is correct.
The dataset includes the following fields: ASIN: ASIN number of product, question:
question text, qid: question id, qa\_pair\_id: question-answer id, source: source
of the context as shown in Fig. 4 with their respective QA pair counts, context: surrounding sentences, and label:
2 signifies completely answering, 1 means not fully answering, and 0 means irrelevant
or unable to answer. answer: a manually~written, natural-sounding response (assuming
the context completely responds).
The BERT model involves leveraging natural language processing techniques to mimic
human speech, particularly focusing on BERT's contextualized learning. BERT, initially
pre-trained on an unlabelled English Wikipedia corpus, utilizes transformer architecture
to enhance contextual awareness and understand searcher intent. Unlike traditional
word embedding models, BERT employs concealed dialect modelling, enabling it to derive
word characteristics based on context rather than fixed identities [15]. The model works by first performing unsupervised pre-training and then supervised
fine-tuning on labelled data. The veiled dialect model (MLM) enables bidirectional
learning, which allows BERT to efficiently memorize context. The conclusion emphasizes
the continuous growth of data and anticipates the system's utility in providing a
summarized view and instant answers for detailed content. We utilise these capabilities
of BERT to train the model on SQuAD 2.0 dataset by using a baseline model and fine
tuning. We start with the BERT-Large model (uncased, 24-layer, 768-hidden, 340M parameters)
and then fine-tune it using SQuAD's training set (v2.0). The model's inputs are padded
to 384 tokens, the learning rate is set to $3 \times 10^{-5}$, and all other default
settings are used. Training is done for three epochs with a batch size of 24. We now
implement transfer learning to train our model further on Specific tasks of Question
Answering which too related to Product Questions and Answers. The dataset we have
used is the ePQA dataset. We have discussed the structure of the dataset above already.
We use our trained model further upon the Product Question answering Dataset using
transfer learning techniques and fine-tuning the model [23]. Transfer learning is a machine learning technique in which a model created for one
job is utilized as the foundation for a model for another task. In other words, the
information obtained from solving one problem is applied to another that is related
but not identical. This is particularly useful when the second task has less data
available for training. Similarly, we have used this approach since our Product QNA
data is not that efficient in comparison to the SQuAD 2.0 dataset. We utilise a similar
approach to train the model on the ePQA dataset as well, but with a larger number
of epochs since the dataset we have is not that rigorous as compared to the SQuAD
2.0 dataset.
Scraping module: We conducted a rigorous analysis of the legalities and constraints
associated with web scraping for the data required in our project. With this understanding,
we have successfully implemented web scraping techniques to extract relevant information
from prominent e-commerce platforms using DOM manipulation by leveraging JavaScript
in our scraping module.
Fig. 1. Proposed framework for the Chrome extension.
Fig. 2. Flow diagram of model learning method.
Fig. 3. Ratio of answerable and unanswerable questions in SQuAD 2.0 dataset.
Fig. 4. QA pairs count from various sources in the ePQA dataset. (Attribute, the description
given by seller bullet points shown on the webpage, user reviews and community question
answers (CQA)).
Fig. 5. Question length distribution for ePQA dataset.
4. Performance Evaluation
Model's predictions are evaluated using a variety of evaluation metrics. We leverage
the validation test provided by [4] and their evaluation script to evaluate the predictions of our model on the SQuAD
2.0 dataset, similarly, for the ePQA dataset, we leverage the testing dataset with
QA pairs provided by [22].
The evaluation uses four elementary matrices: True Positive (TP), True Negative (TN),
False Positive (FP), and False Negative (FN). True Positives are circumstances in
which the real outcome is positive and the machine learning model likewise reports
it as such. True negatives are cases that are correctly predicted as negative. False
positives are cases that are anticipated to be positive but turn out to be negative.
False negatives are circumstances that are designated as negative but are actually
positive. Precision, recall, and accuracy are then calculated, as shown below.
F1-score: The F1 score is calculated as the harmonic mean of the precision and recall
scores, as shown below.
The exact Match and F1 score of our model are shown below in Table 1.
Table 2 presents a comparative overview of performances with respect to other existing models.
Table 1. Our model performance.
Datasets
|
F1-Score
|
Exact Match
|
SQuAD 2.0
|
78.39
|
74.09
|
ePQA
|
72.31
|
68.98
|
Table 2. Comparison of performance on the SQuAD 2.0 dataset with other existing models.
Model
|
F1 Score
|
EM
|
SAN [24]
|
71.44
|
68.65
|
BiDAF [25]
|
71.58
|
68.02
|
SLQA
|
74.43
|
71.46
|
Hydra BERT
|
74.58
|
71.29
|
BERT-base [15]
|
77.81
|
74.07
|
AskAI
|
78.76
|
74.82
|
5. Future Directions
The field of Question Answering Systems (QAS) continues to evolve, and there remains
substantial room for improvement in model performance. Researchers should explore
avenues to enhance existing models. This could involve fine-tuning hyperparameters,
experimenting with novel architectures, or incorporating additional pre-training data.
By pushing the boundaries of model effectiveness, we can achieve more accurate and
reliable answers. Some other directions that we find appropriate and require attention
are listed below
1. Multilingual QA Systems
While English has been the primary focus of QAS research, there is a significant gap
in deep QA systems for non-English languages. Most existing models lack high-quality
word embeddings and balanced annotated datasets in languages other than English. Future
work should address this limitation by developing robust multilingual QA systems.
These models should be trained on diverse language-specific data and consider the
unique linguistic characteristics of each language.
2. Resource-Scarce Languages
In addition to major languages, attention should be given to resource-scarce languages.
These languages often lack comprehensive QA datasets and well-annotated corpora. Researchers
can contribute by creating domain-specific datasets and adapting existing models to
handle these languages effectively. Such efforts would democratize access to QA technology
across linguistic diversity.
3. Contextual Web Scraping Techniques
Scraping relevant context from product webpages remains a challenge. Future research
should focus on developing adaptable techniques that can swiftly extract context from
various websites and adapt to different product categories. This involves addressing
variations in webpage layouts, dynamic content, and domain-specific information.
In summary, the future of QAS lies in improving model performance, embracing multilingualism,
addressing resource scarcity, refining scraping techniques, and maintaining ethical
standards. By addressing these challenges, we can advance the field and create more
effective and inclusive QA systems.
6. Conclusion
Even the most modern NLP methods struggle to answer questions about long and complex
texts. This paper tries to achieve a landmark by proposing this solution to the e-commerce
question and answering it. We were able to show significant performance in evaluation
through different evaluation matrices and were able to develop a real-world solution
for the users.
REFERENCES
J. McAuley and A. Yang, ``Addressing complex and subjective product-related queries
with customer reviews,'' arXiv preprint arXiv:1512.06863, Dec. 2015.

M. Gupta, N. Kulkarni, R. Chanda, A. Rayasam, and Z. C. Lipton, ``AmazonQA: A review-based
question answering task,'' arXiv preprint arXiv:1908.04364, Aug. 2019.

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, ``SQuAD: 100,000+ questions for
machine comprehension of text,'' arXiv preprint arXiv:1606.05250, Jun. 2016.

P. Rajpurkar, R. Jia, and P. Liang, ``Know what you don't know: Unanswerable questions
for SQuAD,'' arXiv preprint arXiv:1806.03822, Jun. 2018.

M. Dunn, L. Sagun, M. Higgins, V. U. Guney, V. Cirik, and K. Cho, ``SearchQA: A new
Q&A dataset augmented with context from a search engine,'' arXiv preprint arXiv:1704.05179,
Apr. 2017.

H. Xu, B. Liu, L. Shu, and P. S. Yu, ``BERT post-training for review reading comprehension
and aspect-based sentiment analysis,'' arXiv preprint arXiv:1904.02232, Apr. 2019.

M. Pontiki, D. Galanis, H. Papageorgiou et al., ``SemEval-2016 task 5: Aspect based
sentiment analysis,'' Proc. of 10th International Workshop on Semantic Evaluation
(SemEval), pp. 19-30, Jan. 2016,.

M. Gupta, N. Kulkarni, R. Chanda, A. Rayasam, and Z. C. Lipton, ``AmazonQA: A review-based
question answering task,'' Proc. of the Twenty-Eighth International Joint Conference
on Artificial Intelligence (IJCAI), pp. 4996-5002, Aug. 2019.

R. He and J. McAuley, ``Ups and downs: Modeling the visual evolution of fashion trends
with one-class collaborative filtering,'' Proc. of the 25th International Conference
on World Wide Web, pp. 507-517, Apr. 2016.

J. Bjerva, N. Bhutani, B. Golshan, W.-C. Tan, and I. Augenstein, ``SUBJQA: A dataset
for subjectivity and review comprehension,'' Proc. of the Conference on Empirical
Methods in Natural Language Processing (EMNLP), pp. 5480-5494, Jan. 2020.

H. Wang, Y. Lu, and C. Zhai, ``Latent aspect rating analysis on review text data:
A rating regression approach,'' Proc. of the 16th ACM SIGKDD International Conference
on Knowledge Discovery and Ddata Mining, pp. 783-792, Jul. 2010.

D. Yang, W. Zhang, Y. Qian, and W. Lam, ``Product question answering in E-Commerce:
A survey,'' arXiv preprint arXiv:2302.08092, Feb. 2023.

W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou, ``Gated self-matching networks for
reading comprehension and question answering,'' Proc. of the 55th Annual Meeting of
the Association for Computational Linguistics (ACL), pp. 189-198, Jan. 2017.

J. Bjerva, N. Bhutani, B. Golshan, W.-C. Tan, and I. Augenstein, ``SUBJQA: A dataset
for subjectivity and review comprehension,'' Proc. of the Conference on Empirical
Methods in Natural Language Processing (EMNLP), pp. 5480-5494, Jan. 2020.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ``BERT: Pre-training of deep bidirectional
transformers for language understanding,'' arXiv preprint arXiv:1810.04805, Oct.,
2018.

Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, ``XLNET: Generalized
autoregressive pretraining for language understanding,'' arXiv preprint arXiv:1906.08237,
Jun. 2019.

J. Li, ``Fine-grained sentiment analysis with a fine-tuned BERT and an improved pre-training
BERT,'' Proc. of IEEE International Conference on Image Processing and Computer Applications
(ICIPCA), pp. 1031-1034, 2023.

W. Yu et al., ``A technical question answering system with transfer learning,'' Proc.
of the 2020 Conference on Empirical Methods in Natural Language Processing: System
Demonstrations, pp. 92-99, Jan. 2020

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, ``ALBERT: A lite
BERT for self-supervised learning of language representations,'' arXiv preprint, arXiv:1909.11942S,
Sep. 2019.

C. Wu, L. Li, Z. Liu, and X. Zhang, ``Machine reading comprehension based on SpanBERT
and dynamic convolutional attention,'' Proc. of the 4th International Conference on
Advanced Information Science and Syste, pp. 1-5, Nov. 2022.

X. Li, Z. Cheng, Z. Shen, H. Zhang, H. Meng, X. Xu, and G. Xiao, ``Building a question
answering system for the manufacturing domain,'' IEEE Access, vol. 10, pp. 75816-75824,
2022.

X. Shen, A. Asai, B. Byrne, and D. G. Adrià, ``xPQA: Cross-lingual product question
answering across 12 languages,'' arXiv preprint arXiv:2305.09249, May 2023.

D. Dashenkov, K. Smelyakov, nad O. Turuta, ``Methods of multilanguage question answering,”
Proc. of IEEE 8th International Conference on Problems of Infocommunications, Science
and Technology (PIC S&T), Oct. 05, 2021.

X. Liu, Y. Shen, K. Duh, and J. Gao, ``Stochastic answer networks for machine reading
comprehension,'' arXiv preprint arXiv:1712.03556, Dec. 2017.

W. Wang, M. Yan, and C. Wu, ``Multi-granularity hierarchical attention fusion networks
for reading comprehension and question answering,'' Proc. of the 56th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1705-1714,
Jan. 2018

Author
Prashant Upadhyay is an Assistant Professor at, the Department of CSE, School of
Engineering and Technology, Sharda University Greater Noida, Uttar Pradesh, India.
He received his Doctorate of Philosophy in computer science and engineering, from
Gautam Buddha University. He has more than 6 years of teaching and research experience.
He published more than 16 research papers in internationally reputed journals indexed
in Web of Science, SCOPUS & SCIE and top-rank conferences as author or co-author.
His research findings are published in Archives of Computational Methods in Engineering
Springer, Neural Computing and Application Springer, IGI Global USA, IEEE Xplore,
Taylor and Francis. He served as a TPC and reviewer in Various International Conferences.
He has authored seven books with various publishers, including IGI Global, Bentham
Science, and CRC Press, among others. He served as session chair in several international
conferences- ICDT-2024, ICDETGT-2023, and ICDT-2023. He has authored a book for Wiley
Publication- Introduction to Python, ISBN: 978-93-5746-219-8. He has published four
Indian patents. His research area includes Computer Vision and NLP.
Tuhina Panda is an Assistant Professor & HOD in the Department of Computer Science
and Engineering at Hi-Tech Institute of Technology, khordha, Bhubaneswar, Odisha.
She has completed her B.Tech. degree in information technology from BPUT in 2010,
an M.Tech. degree in computer science and engineering in the year 2013 from SOA University,
Bhubaneswar and she is perusing a Ph.D. degree in the Department of Computer Science
and Engineering at SOA University, Bhubaneswar. She is having more than 12 years of
UG and PG teaching experience and has published 1 Indian and 1 International patents
(Granted). Her research areas include Machine Learning, Artificial Intelligence, Image
Processing, Cloud Computing and Data Mining.
Preeti Jaidka is an Assistant Professor in the Department of Electronics and Communication
Engineering at theJSS Academy of Technical Education Noida, where herresearch focuses
on machine learning algorithms and their applications in agriculture and healthcare.Her
ongoing research includes the development of convolutional neural networks (CNNs)
and generative adversarial networks (GANs) for improving diagnostic accuracy and reducing
false positives in cancer imaging. In addition to healthcare, Dr. Jaidka 's deep learning
models have been used to identify plant leaf diseases, aiming to enhance precision
farming and crop yield sustainability. She has presented her research at major conferences,
including the IEEE International Conference on Data Science. In addition to her research,
she is actively involved in mentoring undergraduate students and collaborating with
experts across the fields of AI, medicine, and agriculture.
Nidhi Gupta is an Associate Professor in the School of Engineering and Technology,
the Department of Computer Science and Engineering, Sharda University, Greater Noida.
She has more than 17 years of academic experience. She received her Ph.D. degree in
computer science engineering from Jaypee Institute of Information Technology, Noida,
India. She has broad research interests in Machine Learning, Data Analysis, Information
Retrieval and Databases. She has published more than 15 research papers in various
SCI/WOS/Scopus and UGC-Care journals.