aLLMA Lab | projects

Encoder-only foundation models for Azerbaijani

We are working to build BERT-class foundation models for Azerbaijani. These models are suitable for various NLU tasks.

Deliverables:

SentencePiece tokenizer for Azerbaijani
aLLMA: a series of Azerbaijani-only foundation models
DOLLMA: the text corpus used to train aLLMA models
Presented in the 1st SIGTURK Conference. Best paper award (honorale mention).

Semantic embedding models for Azerbaijani

Building a foundation model does not necessarily give us a semantic embedding model. We may need to fine-tune these models further for this task. This has been quite tricky for Azerbaijani.

Web-scale text corpus for Azerbaijani

There are several Azerbaijani text corpora at the scale of hundreds of millions of words. We intend to push this number to billions without sacrificing the quality. This requires sophisticated automation pipelines in several stages.