Skip to Content

Macedonian AI Data 2026: A Strategy for Digital Sovereignty

SUAD SEFERI
9 февруари 2026 од
Macedonian AI Data 2026: A Strategy for Digital Sovereignty
Suad

Macedonian language data for artificial intelligence reached 3.53 billion words in January 2026. This figure sounds like a success. We must look past the total volume. High-quality filtering reduces this amount to 1.47 billion words of trainable text. We are building on a digital bottleneck. Your AI performance depends on the quality and variety of these words.

The Current Inventory

The LVSTCK Macedonian Corpus provides the primary foundation for local development. This resource consolidates several years of international and local effort.

  • Total Raw Volume: 37.6 GB.
  • Cleaned and Unique Data: 16.78 GB.
  • Trainable Words: 1.47 billion.
  • Approximate Tokens: 1.9 billion.

Processing scripts removed 55% of raw data through deduplication. This high redundancy reflects heavy overlap between sources. Web crawls provide over 80% of the words. The source breakdown includes specific contributors:

  • HPLT-v2: 1.49 billion words (42.21%).
  • FineWeb2: 1.33 billion words (37.66%).
  • CLARIN MaCoCu-mk 2.0: 0.49 billion words (13.92%).
  • Institutional Documents: 0.14 billion words (4.07%).
  • Wikipedia: 0.07 billion words (1.96%).
  • SETimes News: 0.004 billion words (0.13%).

The Scale Gap

Macedonian remains a low-resource language. We are working with a data pool 100 to 200 times smaller than English or German corpora. This scarcity creates a performance ceiling. Models trained on 1.47 billion words struggle with reasoning and depth compared to global baselines.

Comparison to Related Languages:

  • Macedonian: 34 GB per million speakers.
  • Bulgarian: 1.5–2.3 GB per million speakers.
  • Serbian: 2–2.5 GB per million speakers.
  • Slovenian: 0.5–1 GB per million speakers.

The total volume for neighboring Slavic languages often exceeds our curated totals. We must bridge this gap to ensure digital sovereignty.

The Quality Trap

Quantity does not equal intelligence. Over-reliance on web scraping introduces noise.

  • Low Human Agreement: Human annotators agree on the usefulness of Macedonian text in the OSCAR dataset only 48.5% of the time.
  • Script Imbalance: 99% of training data uses Cyrillic. Users often communicate using Latin transliteration in digital settings. Models fail to process these scripts effectively.
  • Syntactic Issues: Reliance on machine-translated benchmarks creates unnatural phrasing. Models learn to speak Macedonian using English or Serbian logic.

Accessing the Data

Access these resources through specific portals.

  • HuggingFace: Hosts the LVSTCK raw, cleaned, and deduplicated datasets.
  • CLARIN.SI: Provides the MaCoCu-mk 2.0 web corpus and annotated Wikipedia articles.
  • OPUS: Offers parallel corpora for translation tasks.
  • Leipzig Corpora Collection: Contains 433.4 million tokens from 2019 web crawls.

Missing Information

Critical gaps exist in the current data ecosystem. Your AI lacks access to professional and historical contexts.

  • Legal and Regulatory: Coverage remains limited to the government gazette. Court decisions and specific regulations are not digitized for AI use.
  • Academic and Scientific: No unified archive exists for university research or theses. Scientific terminology is scarce.
  • Historical and Literary: Texts from before 2000 are rare. Web crawls emphasize recent content from 2019 to 2024.
  • Audiovisual: Broadcast speech transcripts and oral histories are missing from training sets.

Methodology and Verification

Research into these datasets follows a multi-stage approach.

  • Search Strategy: Sources include specialized repositories like HuggingFace and CLARIN.SI. Academic papers from 2020 to 2025 provide context.
  • Selection Criteria: Datasets must be explicitly tagged as Macedonian and publicly accessible. Summaries without original documentation are excluded.
  • Size Estimation: Reported metrics take priority. Token counts are inferred using a conversion of 1 word to 1.3 tokens.
  • Verification: Direct download links are checked against source repositories. Quality metrics come from peer-reviewed studies published in 2024.

A Strategic Roadmap for North Macedonia

We must move from passive scraping to strategic curation. North Macedonia requires a structured plan to survive the age of generative intelligence.

Short-Term Priorities:

  • Digitize Government Records: Systematic access to laws and administrative records will add 0.5 to 1 billion words.
  • Consolidate News Archives: A unified archive of journalism provides high-quality formal language.
  • Audit Existing Data: We must verify the quality of multilingual datasets to remove contamination.

Medium-Term Goals:

  • Community Digitalization: Partner with MANU and UKIM to scan historical books and literary works.
  • Build Domain Corpora: Create specialized datasets for medicine, engineering, and finance.
  • Bridge the Script Gap: Develop parallel Cyrillic-Latin data to improve model robustness.

Long-Term Strategy:

  • Establish a National Corpus: North Macedonia needs a standardized and balanced national corpus. Neighbors like Bulgaria and Serbia already use these infrastructures.
  • Diaspora and Heritage Corpus: Document Macedonian language use in international communities.
  • Language Tracking: Build a temporal corpus to track grammatical and lexical changes over time.

Risks and Biases

Data scarcity leads to specific risks for your AI strategy.

  • Overfitting: Models adopt informal syntax from web comments and struggle with formal writing.
  • Vocabulary Sparsity: Technical terms are insufficient for building professional tools.
  • Representation Bias: Web data skews toward urban and educated populations.
  • Geopolitical Bias: Web crawls reflect the editorial decisions of dominant news sources on sensitive topics.

Final Assessment

The foundation for Macedonian AI exists. The LVSTCK corpus provides the primary milestone. But words alone are insufficient. We must focus on curation and institutional partnerships. Digital sovereignty requires local ownership of our linguistic assets.

Download the whole report

The Chatbot Version of Truth
Why AI Is Shaping Narratives in the Balkans