Q&A

Commonly asked questions regarding the MMWAH corpus and its collection.

Nedan finns svar på vanliga frågor kring korpusen MMWAH. Om du inte finner svar på din fråga ber vi dig kontakt oss personligen på martti.makinen@hanken.fi eller ines.frojdo@hanken.fi.

The corpus MMWAH stands for the Multilingual Multimodal WhatsApp corpus Hanken. It is a curated text collection consisting of WhatsApp chats conducted among Finnish-Swedes in the age group 18-30.

The chats have been voluntarily donated to the corpus in connection with the language research project Instant Messaging in Multiple Languages: focus on WhatsApp in Finland-Swedish digital communication.

Corpora are collections of texts created for the purpose of research according to certain criteria, such as text type, language, or genre. They form the basis of the majority of modern linguistic research.

Corpora are also used in computer science for NLP - natural language processing - enabling large language models used for ChatGPT and similar applications.

Martti Mäkinen is the project manager and Ines Fröjdö is the research assistant. In the first period of the project, Leyla Shojaeifard also worked with us on the technical solutions and data management.

As speakers of a minority language, Finland-Swedes are able to use multiple languages to navigate Finland-Swedish society. In MMWAH, the linguistic skills of the speakers are reflected, for example, in switching between languages.

Digital communication on platforms such as WhatsApp combines features of written and spoken language use. In traditional contexts, Finnish-Swedes tend to follow the rules of writing in standard Swedish, and the unique Finnish-Swedish features are often lost. However, we often retain these features in our less formal everyday conversations. The multimodal tools used on communication platforms such as WhatsApp, i.e. emojis, audio messages or memes, distinguish this form of language from the other documented variants of Finnish-Swedish language use.

As a result of MMWAH, mapping of natural language mixtures between Swedish, Finnish and English will be made possible. The coexistence of languages and everyday code-switching between languages at present is a relevant issue in linguistics. The corpus will also capture the change in stylistic phenomena in digital language environments, such as punctuation and emojis.

In short, we are creating material for research on Finland-Swedish identity that will be available to other researchers according to the principles of Open Science. In this way, linguistic changes and phenomena specific to digital communication among Finland-Swedes are captured.

  1. Donate a chate
    • Open chat in the WhatsApp-app
    • Click Settings > More > Export Chat > Include media
    • Email the material to mmwah@hanken.fi
  2. Consent to research participation through form
  3. Answer short questionnaire on linguistic background (approx. 5 min.)
  4. Edit the donation (if need be)

The research team will anonymise the donated material once informed consent has been collected from each chat participant. The participants may remove data that they do not want to be included in the research from the donated material before the processing of the data has begun.

 

Data on the use of Swedish in Finland is needed to map and, above all, record the language as it is currently used. The language is changing rapidly and without research material it is not possible to study the changes or trends in the language. To create as complete a representation of the language as possible, we need to reach many different language users from different backgrounds.

Kort svar: Jo!

Korta såsom långa WhatsApp chattar duger väl. Du behöver inte vara orolig kring innehållet i chattarna, för allt språkbruk välkomnas i MMWAH. Språkforskning centrerar kring hur människor utrycker sina tankar och idéer, dvs. själva innehållet får vara precis vad som helst. Konversationerna får handla om helt vardagliga saker; det är just det enklaste vardagspratet vi vill komma åt. 

Chatten kan innehålla multimodala element som bilder, videor eller ljudmeddelanden. Dessa anonymiseras precis som resten av materiaet.

Kompischattar, gruppchattar, sportlagschattar eller dylika är alla passliga för MMWAH-korpusen. Så länge vi kan kontakta de individuella chattdeltagarna för samtycke kan du donera vilken chatt du önskar. Deltagarantalet kan alltså ligga på allt mellan 2-20 deltagare. Det kan löna sig att dubbelkolla med de andra chattdeltagarna innan du skickar in din donation, så ökar chansen att donationen lyckas!

Även om syftet är att fånga finlandssvenskt språkbruk, betyder det inte att chatten nödvändigtvis måste vara på svenska. Språkblandningar är lika språkligt värdefulla. Förutsatt att forskargruppen på ett säkert sätt kan utföra anonymisering är alla språk och språkblandningar välkomna!

Vi samlar i första hand in språkdata från människor i åldergruppen 18 till 30. Det förorsakar inga hinder ifall enstaka deltagare faller utanför åldersgruppen. Chattdeltagarna bör dock vara minst 15 år gamla för att samtycka till att delta i forskningen.

Chatten får innehålla multimodala element såsom bilder, videor eller ljudmeddelanden. Dessa anonymiseras och/eller ersättas med kod.

Users of the finalised corpus will not be able to identify the donors of the material in the corpus. The content will be pseudonymised (personal names have been replaced by code names) and anonymised (identifiable content deleted). Once the individual chats have been processed and anonymised, they will be aggregated into the corpus. Donors should feel confident that their data cannot be linked back to them.

The research team will collect consent and essential background information from each of the people participating in donated chats. The background data will enable the corpus to be filterable, allowing users to search for messages by, for example, age group, geographical area or the speaker's native language. Participants remain anonymous in the corpus even when carefully selected metadata is published with the corpus.

Det är möjligt att återkalla ditt samtycke att delta i projektet. Ifall du ångrar ditt deltagande kan du kontakt oss och be att vi raderar materialet du skänkt eller de instanser där du är författaren bakom meddelanden. I samband med detta raderas även de enkäter och kontaktuppgifter vi samlat av dig.

Ifall du vill återkalla ditt samtycke ber vi dig kontakta oss via e-post.