Mobile QR Code QR CODE

2024

Acceptance Ratio

21%


  1. (Department of Computer Science and Engineering, School of Computing Science and Engineering Sharda University, Greater Noida, India prashanttheace@gmail.com)
  2. (Department of Computer Science and Engineering, Hitech Institute of Technology, Khordha, Bhubaneswar Odisha, India hello.tuhina@gmail.com)
  3. (Department of Electronics and Communication Engineering, JSS Academy of technical Education, India preetijaidka@jssaten.ac.in)
  4. (Department of Computer Science and Engineering, School of Computing Science and Engineering, Sharda University, Greater Noida, India nidhi0208@gmail.com)



BERT, Transfer learning, Extractive question answering, Natural language processing

1. Introduction

Every day, e-commerce users on websites such as Amazon write thousands of inquiries about individual products. After analysing a sizable dataset that was scraped from Amazon's website [1] which included product reviews and question-answering data, we found that: (i) the majority of inquiries have response times of several days, with an average of approximately two days for each question (aside from those that receive no response at all); (ii) the product reviews are rather detailed and educational when compared to the responses provided to particular questions; and we found that over half of the questions can be addressed (at least in part) by using reviews and the description content that have already been posted on the webpage [2].

During a user's journey through these platforms, a user may find difficulty in navigation or may find the amount of data available to consume a little overwhelming which may result in the user losing interest or changing the platform, and may even shift to local vendors and sellers as well. Under these circumstances, a user may shift towards reviews or available question-answers or the information available on the platform's webpage, but the amount of data could be confusing for any type of user to consume. This problem has motivated a significant amount of research in this area from opinion-based question answering systems to extraction based, mainly four types of systems are prominent. Namely, Opinion, Retrieval, Extraction, Generation. Hence to address this issue automatic answer generation for user's questions seems to be a feasible solution.

This paper proposes a solution to the above problem with the architecture of Chrome extension and a QA model that reads the data from the user's opened webpage automatically by leveraging DOM manipulation techniques, then serving it to the Model with the user's respective query and answer the questions accordingly. We take a pre-trained model BERT and fine-tune it on dataset SQuAD 2.0 dataset then we use the transfer learning technique to further train the final QA model on the downstream task for Product Question Answering using the ePQA dataset. Transfer Learning helps us to leverage the existing knowledge in other various tasks efficiently. This study is divided in the following sections: Section 2 discusses the literature on question answering systems, Section 3 presents the workings and methodology of our proposed model and framework, Section 4 presents the results, and Section 5 concludes the study.

2. Related Work

The Product Question~Answering (PQA) system seeks to deliver an answer A to the appropriate user's query~Q based on some set of supporting knowledge bases (review, specifications, etc.). B. Extraction-based PQA studies, like traditional extraction-based QA [3] (also known as Machine Reading Comprehension (MRC)), seek to extract a certain span of a document to serve as the solution to supplied product-related questions. SQuAD [3,4] is a dataset made up of only one document. The solutions are made up of multiple words from the context. To address the challenge of developing QA systems capable of managing extended contexts, SearchQA [5] provides contexts that include multiple documents. Because the supporting documents are retrieved via information retrieval once the (question, answer) pairings have been established, there is no guarantee that the questions in this case will need reasoning across several documents. [6] created the first extraction-based PQA dataset, ReviewRC, based on SemEval-2016 Task 5 reviews [7]. Similarly, [8] performed significant pre-processing on the Amazon dataset [1]. [9] to create a dataset for extraction-based PQA named AmazonQA. Based on the available reviews, it annotates each question as answerable or unanswerable and heuristically generates an answer span from the reviews that most accurately response the query. [10] present the SubjQA dataset to explore the relationship between subjectivity and PQA in the context of product reviews, which includes six different domains based on the TripAdvisor [11], Yelp, and Amazon [1] datasets [12].

Due to the scarcity of training data for extraction-based PQA, [6] use two popular pretraining objectives, masked language modelling and next sentence prediction, to post-train the BERT encoder on both the general MRC dataset [3], SQuAD, and E-Commerce review datasets, such as Amazon Review [9] and Yelp datasets. In real-world applications, there will be many irrelevant reviews, and the question may be unanswerable. To accomplish this, [8]] first uses IR approaches to extract top review snippets for each topic and then creates an answerability classifier to detect unanswerable questions based on the available reviews. The extraction-based PQA is then implemented using a span-based QA model, specifically R-Net [13]. Furthermore, [14] created a subjectivity-aware QA model that conducts multi-task learning for extraction-based PQA and subjectivity categorization. Experimental results suggest that introducing subjectivity significantly improves performance.

The significance of fine-tuning pre-trained language models, particularly focusing on BERT [15] and XLNet [16], for achieving cutting-edge performance in diverse natural language processing tasks [17]. It highlights the limitations of BERT and introduces XLNet as a promising alternative, showcasing its superiority in several NLP applications. The methodology section outlines the baseline models provided by MRQA organizers, including BERT-base and BERT-large. It introduces XLNet as a powerful alternative, explaining its architecture and pre-training procedure using various datasets. The fine-tuning strategy is emphasized, highlighting XLNet's effectiveness in achieving superior results across multiple tasks, including QA. Attention-over-Attention (AoA) mechanism is discussed as an effective component in extractive QA systems, known for its capability to learn the importance of distribution over inputs.

TransTQA [18] utilizes the ALBERT [19] model configured with the Hugging face Transformer for question-answering tasks. It addresses the challenge of retrieving proper answers in non-factoid QA scenarios by introducing a Siamese ALBERT network. This network employs a Siamese encoder to generate contextualized representations of tokenized elements in input questions and candidate answers, followed by a matching layer that computes the relevance between question-and-answer embeddings. The system optimizes response selection through multiple negative ranking losses during training. Transfer learning uses a three-step fine-tuning procedure. First, the pre-trained ALBERT model is fine-tuned on a technical corpus using a masked language modelling job. Second, a Siamese ALBERT fine-tunes with source technical QA data. Lastly, the Siamese ALBERT further fine-tunes with target QA data. This process enhances the system's performance by incorporating technical domain knowledge and addressing data scarcity in emerging technical forums. TransTQA offers automatic responses by leveraging siamese ALBERT, demonstrating quick and accurate question answering. The adoption of transfer learning enhances the system's performance on technical domain QA, showcasing its adaptability and effectiveness across different datasets.

3. The Proposed Methodology and Framework

We are working on the extractive question answering the problem, which classically suggests Given a product-related question q and a supporting document $d = \{t1$, ..., $tn\}$ $\mathrm{\in }$ D, which consists of one or more product reviews, the goal is to discover a sequence of tokens (a text span) $a = \{ts$, ..., $te\}$ in d that answers $q$ properly, where $1 \le s < n$, $1 \le e \le n$, and $s \le e$ [12]. To solve this problem, we propose our QA model and incorporate it into the user's journey over e-commerce by proposing a framework, shown in Fig. 1. Thus, the user gets a flawless way to register their queries and get them resolved. We scrape the data from the webpage by following the company's terms of use and scraping policy using the script that leverages the DOM manipulation techniques to get our model the data. We then feed this data to our model along with the user's question and answer it accordingly. In Fig. 2, model learning method has been shown.

SQuAD 2.0 [4], an extension of the Stanford Question Answering Dataset, introduces a more challenging dimension to extractive reading comprehension by combining existing SQuAD data with over 50,000 unanswerable questions crafted adversarially. Unlike its predecessor SQuAD 1.0, which focused solely on answerable questions, SQuAD 2.0 includes intentionally unanswerable questions, challenging systems not only to locate answers but also to abstain when appropriate. The unanswerable questions, generated by crowd workers to resemble answerable ones, encompass diverse linguistic phenomena, surpassing the complexity of previous rule-based methods. SQuAD 2.0's uniqueness lies in its demand for models to discern when a question lacks contextual support for an answer. This dataset addresses the limitations of SQuAD 1.1, where human accuracy was likely underestimated, and surpasses the diversity of rule-based approaches. Human evaluation confirms the cleanliness of the dataset, revealing that a robust neural system achieving 86% F1 Score on SQuAD 1.1 drops to 66% on SQuAD 2.0 [20], emphasizing its increased difficulty. This dataset does not address the domain of Product question answering but has a vast amount of diverse data with 130319 QA pairs in the training set and 11873 in the validation set [21]. Fig. 3 below shows the answerable and unanswerable QA pair ratio in the training dataset.

ePQA dataset: To make our model align with the domain and further train it for Product Question answering, we leverage another dataset ePQA developed by Amazon-sciences [22]. It contains 121750 QA pairs in the training set, 9770 pairs in the dev set and 20142 QA pairs in the testing set. It has higher annotation quality after multiple rounds of checking. The error rate is less than 5 percent. It does not restrict product categories. Fig. 5 depicts the question length distribution for answerable and non-answerable pairs. Each value is compared to the surrounding sentences to ensure that the label is correct. The dataset includes the following fields: ASIN: ASIN number of product, question: question text, qid: question id, qa\_pair\_id: question-answer id, source: source of the context as shown in Fig. 4 with their respective QA pair counts, context: surrounding sentences, and label: 2 signifies completely answering, 1 means not fully answering, and 0 means irrelevant or unable to answer. answer: a manually~written, natural-sounding response (assuming the context completely responds).

The BERT model involves leveraging natural language processing techniques to mimic human speech, particularly focusing on BERT's contextualized learning. BERT, initially pre-trained on an unlabelled English Wikipedia corpus, utilizes transformer architecture to enhance contextual awareness and understand searcher intent. Unlike traditional word embedding models, BERT employs concealed dialect modelling, enabling it to derive word characteristics based on context rather than fixed identities [15]. The model works by first performing unsupervised pre-training and then supervised fine-tuning on labelled data. The veiled dialect model (MLM) enables bidirectional learning, which allows BERT to efficiently memorize context. The conclusion emphasizes the continuous growth of data and anticipates the system's utility in providing a summarized view and instant answers for detailed content. We utilise these capabilities of BERT to train the model on SQuAD 2.0 dataset by using a baseline model and fine tuning. We start with the BERT-Large model (uncased, 24-layer, 768-hidden, 340M parameters) and then fine-tune it using SQuAD's training set (v2.0). The model's inputs are padded to 384 tokens, the learning rate is set to $3 \times 10^{-5}$, and all other default settings are used. Training is done for three epochs with a batch size of 24. We now implement transfer learning to train our model further on Specific tasks of Question Answering which too related to Product Questions and Answers. The dataset we have used is the ePQA dataset. We have discussed the structure of the dataset above already.

We use our trained model further upon the Product Question answering Dataset using transfer learning techniques and fine-tuning the model [23]. Transfer learning is a machine learning technique in which a model created for one job is utilized as the foundation for a model for another task. In other words, the information obtained from solving one problem is applied to another that is related but not identical. This is particularly useful when the second task has less data available for training. Similarly, we have used this approach since our Product QNA data is not that efficient in comparison to the SQuAD 2.0 dataset. We utilise a similar approach to train the model on the ePQA dataset as well, but with a larger number of epochs since the dataset we have is not that rigorous as compared to the SQuAD 2.0 dataset.

Scraping module: We conducted a rigorous analysis of the legalities and constraints associated with web scraping for the data required in our project. With this understanding, we have successfully implemented web scraping techniques to extract relevant information from prominent e-commerce platforms using DOM manipulation by leveraging JavaScript in our scraping module.

Fig. 1. Proposed framework for the Chrome extension.

../../Resources/ieie/IEIESPC.2025.14.4.520/fig1.png

Fig. 2. Flow diagram of model learning method.

../../Resources/ieie/IEIESPC.2025.14.4.520/fig2.png

Fig. 3. Ratio of answerable and unanswerable questions in SQuAD 2.0 dataset.

../../Resources/ieie/IEIESPC.2025.14.4.520/fig3.png

Fig. 4. QA pairs count from various sources in the ePQA dataset. (Attribute, the description given by seller bullet points shown on the webpage, user reviews and community question answers (CQA)).

../../Resources/ieie/IEIESPC.2025.14.4.520/fig4.png

Fig. 5. Question length distribution for ePQA dataset.

../../Resources/ieie/IEIESPC.2025.14.4.520/fig5.png

4. Performance Evaluation

Model's predictions are evaluated using a variety of evaluation metrics. We leverage the validation test provided by [4] and their evaluation script to evaluate the predictions of our model on the SQuAD 2.0 dataset, similarly, for the ePQA dataset, we leverage the testing dataset with QA pairs provided by [22].

The evaluation uses four elementary matrices: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). True Positives are circumstances in which the real outcome is positive and the machine learning model likewise reports it as such. True negatives are cases that are correctly predicted as negative. False positives are cases that are anticipated to be positive but turn out to be negative. False negatives are circumstances that are designated as negative but are actually positive. Precision, recall, and accuracy are then calculated, as shown below.

$ Precision=\frac{TP}{TP+FP}, \\ Recall= \frac{TP}{TP+FN}. $

F1-score: The F1 score is calculated as the harmonic mean of the precision and recall scores, as shown below.

$ F1\text{-}score= \frac{2\times Precision \times Recall}{Precision+Recall}. $

The exact Match and F1 score of our model are shown below in Table 1.

Table 2 presents a comparative overview of performances with respect to other existing models.

Table 1. Our model performance.

Datasets

F1-Score

Exact Match

SQuAD 2.0

78.39

74.09

ePQA

72.31

68.98

Table 2. Comparison of performance on the SQuAD 2.0 dataset with other existing models.

Model

F1 Score

EM

SAN [24]

71.44

68.65

BiDAF [25]

71.58

68.02

SLQA

74.43

71.46

Hydra BERT

74.58

71.29

BERT-base [15]

77.81

74.07

AskAI

78.76

74.82

5. Future Directions

The field of Question Answering Systems (QAS) continues to evolve, and there remains substantial room for improvement in model performance. Researchers should explore avenues to enhance existing models. This could involve fine-tuning hyperparameters, experimenting with novel architectures, or incorporating additional pre-training data. By pushing the boundaries of model effectiveness, we can achieve more accurate and reliable answers. Some other directions that we find appropriate and require attention are listed below

1. Multilingual QA Systems

While English has been the primary focus of QAS research, there is a significant gap in deep QA systems for non-English languages. Most existing models lack high-quality word embeddings and balanced annotated datasets in languages other than English. Future work should address this limitation by developing robust multilingual QA systems. These models should be trained on diverse language-specific data and consider the unique linguistic characteristics of each language.

2. Resource-Scarce Languages

In addition to major languages, attention should be given to resource-scarce languages. These languages often lack comprehensive QA datasets and well-annotated corpora. Researchers can contribute by creating domain-specific datasets and adapting existing models to handle these languages effectively. Such efforts would democratize access to QA technology across linguistic diversity.

3. Contextual Web Scraping Techniques

Scraping relevant context from product webpages remains a challenge. Future research should focus on developing adaptable techniques that can swiftly extract context from various websites and adapt to different product categories. This involves addressing variations in webpage layouts, dynamic content, and domain-specific information.

In summary, the future of QAS lies in improving model performance, embracing multilingualism, addressing resource scarcity, refining scraping techniques, and maintaining ethical standards. By addressing these challenges, we can advance the field and create more effective and inclusive QA systems.

6. Conclusion

Even the most modern NLP methods struggle to answer questions about long and complex texts. This paper tries to achieve a landmark by proposing this solution to the e-commerce question and answering it. We were able to show significant performance in evaluation through different evaluation matrices and were able to develop a real-world solution for the users.

REFERENCES

1 
J. McAuley and A. Yang, ``Addressing complex and subjective product-related queries with customer reviews,'' arXiv preprint arXiv:1512.06863, Dec. 2015.DOI
2 
M. Gupta, N. Kulkarni, R. Chanda, A. Rayasam, and Z. C. Lipton, ``AmazonQA: A review-based question answering task,'' arXiv preprint arXiv:1908.04364, Aug. 2019.DOI
3 
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, ``SQuAD: 100,000+ questions for machine comprehension of text,'' arXiv preprint arXiv:1606.05250, Jun. 2016.DOI
4 
P. Rajpurkar, R. Jia, and P. Liang, ``Know what you don't know: Unanswerable questions for SQuAD,'' arXiv preprint arXiv:1806.03822, Jun. 2018.DOI
5 
M. Dunn, L. Sagun, M. Higgins, V. U. Guney, V. Cirik, and K. Cho, ``SearchQA: A new Q&A dataset augmented with context from a search engine,'' arXiv preprint arXiv:1704.05179, Apr. 2017.DOI
6 
H. Xu, B. Liu, L. Shu, and P. S. Yu, ``BERT post-training for review reading comprehension and aspect-based sentiment analysis,'' arXiv preprint arXiv:1904.02232, Apr. 2019.DOI
7 
M. Pontiki, D. Galanis, H. Papageorgiou et al., ``SemEval-2016 task 5: Aspect based sentiment analysis,'' Proc. of 10th International Workshop on Semantic Evaluation (SemEval), pp. 19-30, Jan. 2016,.DOI
8 
M. Gupta, N. Kulkarni, R. Chanda, A. Rayasam, and Z. C. Lipton, ``AmazonQA: A review-based question answering task,'' Proc. of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI), pp. 4996-5002, Aug. 2019.DOI
9 
R. He and J. McAuley, ``Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering,'' Proc. of the 25th International Conference on World Wide Web, pp. 507-517, Apr. 2016.DOI
10 
J. Bjerva, N. Bhutani, B. Golshan, W.-C. Tan, and I. Augenstein, ``SUBJQA: A dataset for subjectivity and review comprehension,'' Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5480-5494, Jan. 2020.DOI
11 
H. Wang, Y. Lu, and C. Zhai, ``Latent aspect rating analysis on review text data: A rating regression approach,'' Proc. of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Ddata Mining, pp. 783-792, Jul. 2010.DOI
12 
D. Yang, W. Zhang, Y. Qian, and W. Lam, ``Product question answering in E-Commerce: A survey,'' arXiv preprint arXiv:2302.08092, Feb. 2023.DOI
13 
W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou, ``Gated self-matching networks for reading comprehension and question answering,'' Proc. of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 189-198, Jan. 2017.DOI
14 
J. Bjerva, N. Bhutani, B. Golshan, W.-C. Tan, and I. Augenstein, ``SUBJQA: A dataset for subjectivity and review comprehension,'' Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5480-5494, Jan. 2020.DOI
15 
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ``BERT: Pre-training of deep bidirectional transformers for language understanding,'' arXiv preprint arXiv:1810.04805, Oct., 2018.DOI
16 
Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, ``XLNET: Generalized autoregressive pretraining for language understanding,'' arXiv preprint arXiv:1906.08237, Jun. 2019.DOI
17 
J. Li, ``Fine-grained sentiment analysis with a fine-tuned BERT and an improved pre-training BERT,'' Proc. of IEEE International Conference on Image Processing and Computer Applications (ICIPCA), pp. 1031-1034, 2023.DOI
18 
W. Yu et al., ``A technical question answering system with transfer learning,'' Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 92-99, Jan. 2020DOI
19 
Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, ``ALBERT: A lite BERT for self-supervised learning of language representations,'' arXiv preprint, arXiv:1909.11942S, Sep. 2019.DOI
20 
C. Wu, L. Li, Z. Liu, and X. Zhang, ``Machine reading comprehension based on SpanBERT and dynamic convolutional attention,'' Proc. of the 4th International Conference on Advanced Information Science and Syste, pp. 1-5, Nov. 2022.DOI
21 
X. Li, Z. Cheng, Z. Shen, H. Zhang, H. Meng, X. Xu, and G. Xiao, ``Building a question answering system for the manufacturing domain,'' IEEE Access, vol. 10, pp. 75816-75824, 2022.DOI
22 
X. Shen, A. Asai, B. Byrne, and D. G. Adrià, ``xPQA: Cross-lingual product question answering across 12 languages,'' arXiv preprint arXiv:2305.09249, May 2023.DOI
23 
D. Dashenkov, K. Smelyakov, nad O. Turuta, ``Methods of multilanguage question answering,” Proc. of IEEE 8th International Conference on Problems of Infocommunications, Science and Technology (PIC S&T), Oct. 05, 2021.DOI
24 
X. Liu, Y. Shen, K. Duh, and J. Gao, ``Stochastic answer networks for machine reading comprehension,'' arXiv preprint arXiv:1712.03556, Dec. 2017.DOI
25 
W. Wang, M. Yan, and C. Wu, ``Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering,'' Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1705-1714, Jan. 2018DOI

Author

Prashant Upadhyay
../../Resources/ieie/IEIESPC.2025.14.4.520/au1.png

Prashant Upadhyay is an Assistant Professor at, the Department of CSE, School of Engineering and Technology, Sharda University Greater Noida, Uttar Pradesh, India. He received his Doctorate of Philosophy in computer science and engineering, from Gautam Buddha University. He has more than 6 years of teaching and research experience. He published more than 16 research papers in internationally reputed journals indexed in Web of Science, SCOPUS & SCIE and top-rank conferences as author or co-author. His research findings are published in Archives of Computational Methods in Engineering Springer, Neural Computing and Application Springer, IGI Global USA, IEEE Xplore, Taylor and Francis. He served as a TPC and reviewer in Various International Conferences. He has authored seven books with various publishers, including IGI Global, Bentham Science, and CRC Press, among others. He served as session chair in several international conferences- ICDT-2024, ICDETGT-2023, and ICDT-2023. He has authored a book for Wiley Publication- Introduction to Python, ISBN: 978-93-5746-219-8. He has published four Indian patents. His research area includes Computer Vision and NLP.

Tuhina Panda
../../Resources/ieie/IEIESPC.2025.14.4.520/au2.png

Tuhina Panda is an Assistant Professor & HOD in the Department of Computer Science and Engineering at Hi-Tech Institute of Technology, khordha, Bhubaneswar, Odisha. She has completed her B.Tech. degree in information technology from BPUT in 2010, an M.Tech. degree in computer science and engineering in the year 2013 from SOA University, Bhubaneswar and she is perusing a Ph.D. degree in the Department of Computer Science and Engineering at SOA University, Bhubaneswar. She is having more than 12 years of UG and PG teaching experience and has published 1 Indian and 1 International patents (Granted). Her research areas include Machine Learning, Artificial Intelligence, Image Processing, Cloud Computing and Data Mining.

Preeti Jaidka
../../Resources/ieie/IEIESPC.2025.14.4.520/au3.png

Preeti Jaidka is an Assistant Professor in the Department of Electronics and Communication Engineering at theJSS Academy of Technical Education Noida, where herresearch focuses on machine learning algorithms and their applications in agriculture and healthcare.Her ongoing research includes the development of convolutional neural networks (CNNs) and generative adversarial networks (GANs) for improving diagnostic accuracy and reducing false positives in cancer imaging. In addition to healthcare, Dr. Jaidka 's deep learning models have been used to identify plant leaf diseases, aiming to enhance precision farming and crop yield sustainability. She has presented her research at major conferences, including the IEEE International Conference on Data Science. In addition to her research, she is actively involved in mentoring undergraduate students and collaborating with experts across the fields of AI, medicine, and agriculture.

Nidhi Gupta
../../Resources/ieie/IEIESPC.2025.14.4.520/au4.png

Nidhi Gupta is an Associate Professor in the School of Engineering and Technology, the Department of Computer Science and Engineering, Sharda University, Greater Noida. She has more than 17 years of academic experience. She received her Ph.D. degree in computer science engineering from Jaypee Institute of Information Technology, Noida, India. She has broad research interests in Machine Learning, Data Analysis, Information Retrieval and Databases. She has published more than 15 research papers in various SCI/WOS/Scopus and UGC-Care journals.