Semeval2021 Task9 - Statement Verification & Evidence Finding with Tables

Information dirt is a severe problem that needs to be addressed. However, with the rise of social media, manually monitoring the validity of information has become infeasible. Thus fact verification is heavily investigated in the literature. However, structured (table) data-based fact verification remains mostly unexplored. To the best of our knowledge, the first dataset for table based fact verification is introduced recently by (Chen et al. 2020). As with most downstream natural language processing tasks, fact verification is also improved by large-scale pre-trained language models. (Chen et al. 2020) showed how BERT can be leveraged for table based fact verification. Moreover, (Eisenschlos, Krichene, and Müller 2020; Zhang et al. 2020) explored different ways to inject structural information into BERT, which substantially improved the performance of the model. In this work, we judiciously analyze various sequencing strategies and find that related information must be close to each other. Furthermore, we propose a two-level approach for table-based fact verification, which paves the way for specialized models for deciding whether the evidence contains sufficient information and whether the claim is entailed or refuted. It also enables to generate more training data without disturbing the balance and the quality of the dataset.

Introduction

Tables are ubiquitous. The most common way of representing data is via tables in various domains, i.e., scientific, governmental, business, etc. Although tables are explicit and concise, they are still open to misinterpretations, mostly if the analysis is done by people who are not competent in the respective domain. People tend to make rushed deductions instead of carefully analyzing the information that is displayed. While this is a convenient way of acquiring information in an ever-growing information-heavy world, it usually results in misinterpretations and information dirt. With the global reach of social media, information dirt is a severe concern and is only becoming more prominent. Hence the need for adequate fact verification services are of the utmost importance.

Fact verification aims to determine whether a claim is entailed or refuted given a piece of evidence. It is heavily investigated in the literature with various downstream tasks, e.g., textual entailment, natural language inference, and claim verification on various datasets. For instance, the famous (Thorne et al. 2018) dataset use Wikipedia pages as the evidence and modified sentences from the same data as claims.

Although unstructured data-based fact verification is heavily investigated in the literature, structured data-based fact verification is underrepresented. Structured data poses a new challenge to the fact verification problem. In addition to linguistic reasoning, symbolic reasoning is also vital for successful classifications. To the best of our knowledge, the first work on this area is introduced by (Chen et al. 2020). Although it provides a good baseline, the performance of the model is far behind the human performance.

In this paper, we investigate several approaches for statement verification using tables that are collected from scientific articles. We mainly focus on large-scale pre-trained language model-based implementations. We show the effectiveness of different pre-trained models. Furthermore, we propose a two-level approach, which first determines if the evidence contains sufficient information for classification. If there is enough information, then the claim is classified as entailed or refuted.

Related Work

For structured-data-based fact verification, Hasan et al., (Jo et al. 2019), presented the first-ever end-to-end fact-checking tool AggChecker. In AggChecker, claims are translated into database queries and validated against raw relational data. However, with the success of large-scale language models like BERT (Devlin et al. 2019) on downstream tasks, and as shown by Soleimani et al. (Soleimani, Monz, and Worring 2019) large-scale language models are gaining popularity over classical methods.

Soleimani et al. demonstrated a BERT-based system for claim verification. The system inputs a potential evidence sentence and a claim to BERT and then applies a soft-max layer to classify each claim. The system achieved the state-of-the-art on the FEVER dataset (Thorne et al. 2018).

Recently, Chen et al. (Chen et al. 2020) released a dataset for statement verification based on tables. The table data is collected from Wikipedia, and annotations are generated by amazon Mechanical-Turk. The authors demonstrated a classical NLP approach called Latent Program algorithm and also a modern approach that utilizes BERT for textual entailment. Although the former performs slightly worse, it is simpler and has a lot of room to grow. Zhang et al. (Zhang et al. 2020) demonstrated that injecting structural information into BERT improves the performance of the model on claim verification. The authors believe the cells that belong to the same column are critical for verification. For this purpose, they change the mask of the self-attention layer for the higher levels of BERT to enable cross row reasoning. However, Zhang et al. assume vertical information is critical, where Herzig et al. (Herzig et al. 2020) proposes TAPAS, a generalized approach to inject table structure information into the model. Herzig et al. propose to use three additional positional token embeddings as input to BERT, which are row, column, and rank embeddings. Influenced by (Herzig et al. 2020), Eisenschlos et al. (Eisenschlos, Krichene, and Müller 2020) create a balanced dataset of millions of training examples and pre-train TAPAS with this dataset before fine-tuning it with TabFact, achieving the state-of-the-art in table-based fact verification.

Large-scale pre-trained model based fact verification consists of two main parts: data to text generation as table sequencing, textual entailment as claim verification. Data-to-text generation is another field that has seen growing interest in recent years. Wiseman et al. (Wiseman, Shieber, and Rush 2017) proposed three evaluation metrics: content selection, relation generation, and content ordering. With the addition of the BLUE score, these four metrics provide a solid foundation for evaluation. Recently, Rebuffel, Clement, et al. (Rebuffel et al. 2019) achieved a new state-of-the-art on all four of the metrics using a hierarchical model that utilizes transformer encoders on Rotowire Dataset.¹ The model leverages the entity-record relationship present in the table, which mitigates the information lost by a linearized approach. However, data-to-text generation requires a supervised approach and is highly domain-specific.

Proposed Approach

Current claim verification tasks are represented as a three-labeled classification problem. The model is expected to classify the claim as entailed, refuted, or unknown by making use of the evidence provided.

We propose a generalized approach for fact verification, which can work with any large-scale pre-trained model. As shown in Figure 1, we adopt a two-level approach for fact verification, which effectively transforms a three classed classification problem (refuted, entailed, unknown) into two binary classification problems. At each level, we embody a language model for sequence classification. The first model is responsible for distinguishing whether the table contains sufficient information to determine whether the claim can be classified as refuted or entailed. Hence this is the entry point of our high-level model. The pre-trained second model is activated only if there is sufficient information, which decides whether the claim is refuted or entailed by the table.

Our motivation for such an approach is two-fold. First, inherently, determining whether there is sufficient information in the table and whether the claim refutes or entails are two different tasks. With a two level approach, models are specialized for the task at hand. Second, for table-based fact verification, entailed and refuted claims need to be collected manually, which is a costly task. On the other hand, unknown claims can easily be collected from the internet or copied from claims that belong to the other tables in the dataset. With a single level approach, to have a balanced dataset throughout the classes, we are limited by the amount of entailed or refuted samples for unknown samples. With a two-level approach, we relax this limitation and enable the data scientist two generate twice as many unknown samples.

Experimental Setting

For statement verification and evidence finding with tables, we follow task-A that is described in semeval2021-Task 9². For training, we use a single Titan X Pascal GPU with 12GB of memory.

Methods Used For Evaluation

Table Statement Support is a classification problem where the statement is labeled either entailed, refuted, or unknown. The models will be evaluated by two different metrics/ways. The first is standard precision and recall of the multi-class classification task. The second is the simple classification(binary), which will not penalize classifying unknown as entailed or refuted. The accuracy of the models are reported.

Data Set & Format

The data is collected from scientific articles by using the API of Science Direct. Tables contain domain-specific information, as shown in Figure 2, making the dataset challenging for claim verification. Annotations are collected by crowdsourcing and also generated automatically. There are 1033 tables, manually annotated 6240 statements, shown in Figure 3, and auto-generated 174948 statements in total. The auto-generated statements are created by using a random paraphraser with a table understanding parser. The statements are labeled as entailed and refuted. Unknown statements are collected from other tables. Classes are equally distributed throughout the dataset. For the experiments we used only the manually annotated statements due to time constraints. Lastly, we use a 90%-10% split for training and testing respectively.

Algorithms Used For Evaluation

BERT

For BERT, we used the model described in (Chen et al. 2020). Four pre-trained versions of BERT is used: bert-base-uncased, bert-base-cased, bert-large-uncased and bert-large-cased. The base and large models contain 12-layer, 768-hidden, 12-heads, and 110M parameters, and 24-layer, 1024-hidden, 16-heads, and 336M parameters, respectively. All the models are trained using English text from Wikipedia and Book Corpus. For the base models, a batch size of 4 and a batch size of 2 is used for the large model. For all the models, 500 warmup steps and a weight decay of 0.01 is used. The standard BERT tokenizer is used for data preprocessing. Each row is linearized by concatenating each cell. Then these rows are concatenated, and caption is prepended, resulting in the table sequence. This sequence is concatenated with the statement by adding a [SEP] token in between and a [CLS] token to the beginning.

TAPAS

The TAPAS model is built on top of bert-base and includes three additional positional embeddings for each token: column, row, and rank. The column and row embeddings represent the cell’s position, and they are zero if the token belongs to the statement. The rank embedding is used to define a relationship between the values of the cells. If the values are floats or dates, they are sorted, and their rank embedding becomes the respective position in the sorted array. As an input, the table is flattened into a sequence of words with a horizontal scan over the table. The question is concatenated at the beginning of the table sequence with a [SEP] token in between. The [SEP] token separates the two segments. Lastly, a [CLS] token is inserted at the start of the sequence.

TAPAS-TabFact

This model is the same model to the one described in Section 5.3.2. It is fine-tuned with the TabFact dataset. It achieves the state-of-the-art on table based fact verification.

2L-*

The details of the level algorithm are described in Section 4 We use three variants of this approach, each embodying a different pre-trained model; 2L-TAPAS-TabFact, 2L-TAPAS, and 2L-BERT.
For TAPAS, TAPAS-TabFact, and 2L-* models adam optimizer with a learning rate of 5e − 5 is used.

Experiments & Discussion

The first key takeaway from Table 1 is that all the models tend to classify the statements as entailed. This is expected since, while creating the table sequence, we are not forming full sentences. If the table only contains symbols, it will be hard for BERT to interpret the symbols’ meaning and their respective values. To tackle this problem, we plan to generate complete sentences for the table sequence.

We observe that the BERT-base models plato around nine epochs and, as expected, are inferior to BERT-large scores. However, there is still room for growth for the large models. Due to time limitations, we can not report large model results for a higher number of epochs than 9.

We observe that overall cased models perform better than the uncased ones except for BERT-base-uncased with 3 and 6 epochs. Although these two runs scored the highest and second-highest results, the other runs with BERT-uncased are inferior. Since our data is from scientific papers, cased characters are critical for representing symbols. We believe that these two results are an anomaly that would be eliminated with a better table sequencing strategy.

The results for various different BERT models with different number of epochs. For this experiment unknown claims are not included. The metrics; accuracy, F1, precision and recall are given in percentage.
Model	epochs	accuracy	F1	precision	recall
`BERT-large-cased`	3	61.86	76.44	61.86	100
	6	62.08	76.61	62.08	100
	9	63.63	77.78	63.63	100
	3	58.98	71.05	62.62	82.10
	6	60.86	75.64	60.82	100
	9	61.97	76.52	61.97	100
	15	57.21	67.06	64.53	69.80
	3	65.96	74.57	70.53	79.09
	6	65.74	79.33	65.74	100
	9	59.53	70.45	66.51	68.22
	3	62.08	76.61	62.08	100
	6	62.53	76.94	62.53	100
	9	62.53	76.94	62.53	100

We investigate different table sequencing strategies and report our results in Table [table:strat]. We compare a horizontal scan of the table, which is described in Section 5.3.1, with taking the horizontal scan by adding the column titles of each cell next to their value. We also investigate adding the table caption and the legend to the sequence if they are provided. As expected, table caption increases the accuracy; however, legend has a negative effect. When we analyzed the tables that have a legend, we observed that the legends contain a lot of text. When the legend is added to the sequence, the input surpasses the BERT input’s recommended length limit. Furthermore, we observe that attaching column titles to the cell values substantially improves the performance. This shows that BERT can not learn structural information. Related information must be closed to each other for the model to excel. Findings of (Zhang et al. 2020; Eisenschlos, Krichene, and Müller 2020), i.e., when structural information is injected into the model via additional positional tokens (Herzig et al. 2020) and changing the mask of self-attention, also support this claim.

Table [table:models] shows the superior performance of TAPAS compared to BERT. This indicates that injecting structural information into the pre-trained model is crucial for table-based fact verification. Moreover, we observe, as emphasized by Liu et al. (Liu et al. 2019), more data is always good. When TAPAS is fine-tuned with the TabFact dataset, it scores approximately 3% higher accuracy. Although TAPAS-base slightly outperforms 2L-TAPAS-base, 2L-TAPAS-base saturates much sooner.

Conclusion & Future Work

Fact verification is an important problem in an age where information is a very powerful commodity. There is extensive research on unstructured data-based fact verification but structured data-based fact verification is still mostly an unexplored area.

In this work, we analyze the effect of the size of the pre-trained model by using BERT-base, and -large. We also investigate different table sequencing strategies and find out that attaching column titles to values and adding the caption to the sequence is beneficial. Lastly, we propose a generalized approach for fact verification which adopts a two level model. The two-level approach transforms the three-class classification problem to two binary classification problems and allows to increase the size of the dataset without disturbing the quality and the balance.

We believe two-level approach to fact verification problem is promising and we leave level specific pre-trained model design as future work.

References

Chen, Wenhu, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2020. “TabFact: A Large-Scale Dataset for Table-Based Fact Verification.” http://arxiv.org/abs/1909.02164.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” http://arxiv.org/abs/1810.04805.

Eisenschlos, Julian Martin, Syrine Krichene, and Thomas Müller. 2020. “Understanding Tables with Intermediate Pre-Training.” http://arxiv.org/abs/2010.00571.

Herzig, Jonathan, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Martin Eisenschlos. 2020. “Tapas: Weakly Supervised Table Parsing via Pre-Training.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (volume 1: Long Papers). Seattle, Washington, United States. https://www.aclweb.org/anthology/2020.acl-main.398/.

Jo, Saehan, Immanuel Trummer, Weicheng Yu, Xuezhi Wang, Cong Yu, Daniel Liu, and Niyati Mehta. 2019. “AggChecker: A Fact-Checking System for Text Summaries of Relational Data Sets.” Proc. VLDB Endow. 12 (12): 1938–41. https://doi.org/10.14778/3352063.3352104.

Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” http://arxiv.org/abs/1907.11692.

Rebuffel, Clément, Laure Soulier, Geoffrey Scoutheeten, and Patrick Gallinari. 2019. “A Hierarchical Model for Data-to-Text Generation.” http://arxiv.org/abs/1912.10011.

Soleimani, Amir, Christof Monz, and Marcel Worring. 2019. “BERT for Evidence Retrieval and Claim Verification.” http://arxiv.org/abs/1910.02655.

Thorne, James, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. “FEVER: A Large-Scale Dataset for Fact Extraction and VERification.” http://arxiv.org/abs/1803.05355.

Wiseman, Sam, Stuart M. Shieber, and Alexander M. Rush. 2017. “Challenges in Data-to-Document Generation.” http://arxiv.org/abs/1707.08052.

Zhang, Hongzhi, Yingyao Wang, Sirui Wang, Xuezhi Cao, Fuzheng Zhang, and Zhongyuan Wang. 2020. “Table Fact Verification with Structure-Aware Transformer.” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1624–29. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.126.

https://github.com/harvardnlp/boxscore-data/↩︎
https://sites.google.com/view/sem-tab-facts↩︎

Abstract