Encoder-only foundation models for Azerbaijani
We are working to build BERT-class foundation models for Azerbaijani. These models are suitable for various NLU tasks.
Semantic embedding models for Azerbaijani
Building a foundation model does not necessarily give us a semantic embedding model. We may need to fine-tune these models further for this task. This has been quite tricky for Azerbaijani.
Web-scale text corpus for Azerbaijani
There are several Azerbaijani text corpora at the scale of hundreds of millions of words. We intend to push this number to billions without sacrificing the quality. This requires sophisticated automation pipelines in several stages.