domingo, maio 4, 2025
HomeArtificial IntelligenceLocalized data for globalized AI

Localized data for globalized AI


Pilot data

As part of the pilot, Makerere AI Lab and Google Research collected 8,091 annotated adversarial queries in English and six African languages (e.g., Pidgin English, Luganda, Swahili, Chichewa). The queries are adversarial in nature and have a high likelihood of producing unsafe responses from an LLM as a means of testing and mitigating for potential harm. This dataset in turn can be used to evaluate models for their safety and cultural relevance within the context of these languages. The dataset is open-source and available for exploration.

Experts from seven sensitive domains (e.g., culture and religion, employment) annotated these queries with ten topics within their domain of expertise (i.e., “corruption and transparency” for politics and government domain), five generative AI themes (e.g., public interest, misinformation) and 13 sensitive characteristics (e.g., age, tribe) that are relevant to the African context.

The most prominent domains were health (2,076) and education (1,469), with the top topics being chronic disease (373) and education assessment and measurement (245), respectively. Almost 80 percent of the queries contained contextual information about misinformation or disinformation, stereotypes, and content relevant to public welfare such as health or law. The majority of the queries were about social groups belonging to gender (e.g., “Chibok girls”), age (e.g., “newborns”), religion or belief (e.g., “Traditional African” religions), and education level (e.g., “uneducated”).

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments