Frustratingly Easy Performance Improvements for Low-resource Setups: A Tale on BERT and Segment Embe

Phản hồi
Báo xấu

19 Lượt xem Premium29/09/2022

As input representation for each sub-word, the original BERT architecture proposes the sum of the sub-word embedding, position embedding and a segment embedding. Sub-word and position embeddings are well-known and studied, and encode lexical information and word position, respectively. In contrast, segment embeddings are less known and have so far received no attention. The key idea of segment embeddings is to encode to which of the two sentences (segments) a word belongs—the intuition is to inform the model about the separation of sentences for the next sentence prediction pre-training task. However, little is known on whether the choice of segment impacts downstream prediction performance. In this work, we try to fill this gap and empirically study the impact of alternating the segment embedding during inference time for a variety of pre-trained embeddings and target tasks. We hypothesize that for single-sentence prediction tasks performance is not affected—neither in mono- nor multilingual setups—while it matters when changing the segment IDs in paired-sentence tasks. To our surprise, this is not the case. Although for classification tasks and monolingual BERT models no large differences are observed, particularly word-level multilingual prediction tasks are heavily impacted. For low-resource syntactic tasks, we observe impacts of segment embedding and multilingual BERT choice. We find that the default setting for the most used multilingual BERT model underperforms heavily, and a simple swap of the segment embeddings yields an average improvement of 2.5 points absolute LAS score for dependency parsing over 9 different treebanks.

Không được đăng tải lại nội dung khi chưa có sự cho phép của nhà sáng tạo