The Nordic tweet stream: A dynamic real-Time monitor corpus of big and rich language data
Self archived versionpublished version
MetadataShow full item record
CitationLaitinen, Mikko. Lundberg, Jonas. Levin, Magnus. Martins, Rafael. (2018). The Nordic tweet stream: A dynamic real-Time monitor corpus of big and rich language data. 3rd Conference on Digital Humanities in the Nordic Countries, DHN 2018; Helsinki; Finland; 7 March 2018 through 9 March 2018, 2084, 349-362.
This article presents the Nordic Tweet Stream (NTS), a cross-disciplinary corpus project of computer scientists and a group of sociolinguists interested in language variability and in the global spread of English. Our research integrates two types of empirical data: We not only rely on traditional structured corpus data but also use unstructured data sources that are often big and rich in metadata, such as Twitter streams. The NTS downloads tweets and associated metadata from Denmark, Finland, Iceland, Norway and Sweden. We first introduce some technical aspects in creating a dynamic real-time monitor corpus, and the following case study illustrates how the corpus could be used as empirical evidence in sociolinguistic studies focusing on the global spread of English to multilingual settings. The results show that English is the most frequently used language, accounting for almost a third. These results can be used to assess how widespread English use is in the Nordic region and offer a big data perspective that complement previous small-scale studies. The future objectives include annotating the material, making it available for the scholarly community, and expanding the geographic scope of the data stream outside Nordic region.