text data - 2023 ARCADIA at Maryland

Question

Note to moderator: Read answerline carefully. This type of data is processed with the algorithm Snowball, which is an improved version of an algorithm developed by Martin Porter. This type of data may be organized into synsets, which can assist with the task of WSD that may be accomplished with the Lesk algorithm. This type of data is the focus of a tidyverse-based David Robinson and Julia Silge (“SILL-gee”) textbook that analyzes this type of data using tf-idf (“t-f i-d-f”). The Penn Treebank annotates this type of data with (*) POS tags. N-gram models may be trained on (10[1])this type of data that has been processed by being stemmed or tokenized. (10[1])This type of data may have its valence assessed in sentiment analysis carried (10[1])out on the decontextualized forms of its corpora. For 10 points, NLP involves the “natural processing” of what type of data used to train LLMs and chatbots? ■END■

ANSWER: text data [accept natural language processing data or language data; accept words; accept WordNet; accept writing or handwriting or written text; accept strings; accept documents; accept dictionary or dictionaries; accept thesauruses or thesauri; accept text corpus or text corpora until “corpora” is read; accept tokens until “tokenized” is read; accept stems until “stemmed” is read; accept lemmas or lemmatization; accept descriptions of written or transcribed speech or language; accept Text Mining with R; prompt on topics by asking “of what?”; prompt on speech or parts of speech by asking “in what format?”; reject “voice” or descriptions of recorded noises]

<CH, Other Science: Computer Science>

= Average correct buzz position

Back to tossups