Lexical normalization, the translation of noncanonical data to standard language, has
shown to improve the performance of many
natural language processing tasks on social
media. Yet, using multiple languages in one utterance, also called code-switching (CS), is frequently overlooked by these normalization systems, despite its common use in social media.
In this paper, we propose three normalization
models specifically designed to handle codeswitched data which we evaluate for two language pairs: Indonesian-English (Id-En) and
Turkish-German (Tr-De). For the latter, we introduce novel normalization layers and their
corresponding language ID and POS tags for
the dataset, and evaluate the downstream effect of normalization on POS tagging. Results
show that our CS-tailored normalization models outperform Id-En state of the art and Tr-De
monolingual models, and lead to 5.4% relative
performance increase for POS tagging as compared to unnormalized input