The Nordic tweet stream: A dynamic real-Time monitor corpus of big and rich language data
Files
Self archived version
published versionDate
2018Author(s)
Metadata
Show full item recordMore information
Self-archived item
Citation
Laitinen, Mikko. Lundberg, Jonas. Levin, Magnus. Martins, Rafael. (2018). The Nordic tweet stream: A dynamic real-Time monitor corpus of big and rich language data. 3rd Conference on Digital Humanities in the Nordic Countries, DHN 2018; Helsinki; Finland; 7 March 2018 through 9 March 2018, 2084, 349-362.Rights
Abstract
This article presents the Nordic Tweet Stream (NTS), a cross-disciplinary corpus project of computer scientists and a group of sociolinguists interested in language variability and in the global spread of English. Our research integrates two types of empirical data: We not only rely on traditional structured corpus data but also use unstructured data sources that are often big and rich in metadata, such as Twitter streams. The NTS downloads tweets and associated metadata from Denmark, Finland, Iceland, Norway and Sweden. We first introduce some technical aspects in creating a dynamic real-time monitor corpus, and the following case study illustrates how the corpus could be used as empirical evidence in sociolinguistic studies focusing on the global spread of English to multilingual settings. The results show that English is the most frequently used language, accounting for almost a third. These results can be used to assess how widespread English use is in the Nordic region and offer a big data perspective that complement previous small-scale studies. The future objectives include annotating the material, making it available for the scholarly community, and expanding the geographic scope of the data stream outside Nordic region.
Keywords
Link to the original item
http://ceur-ws.org/Vol-2084/short10.pdfCollections
- Filosofinen tiedekunta [479]