BOS Component Categorization. NEAR Protocol, chatGPT fine-tuning

BOS Component Categorization. NEAR Protocol, chatGPT fine-tuning

Optimizing Component Categorization with AI: Enhancing Discoverability in NEAR Protocol

Project Overview: AI-Powered Component Categorization

At Pagoda Inc., we tackled a complex challenge: refining the categorization of over 15,000 components within the NEAR Protocol Blockchain Operating System (BOS). The goal was to enhance searchability, improve developer experience, and optimize recommender systems for open-source components.

Leveraging AI-driven categorization and entropy-based filtering, we developed a scalable solution that integrates machine learning, human-in-the-loop validation, and dimensionality reduction techniques to improve metadata tagging and component discoverability.


Methodology: AI, Entropy Reduction, and Human-AI Collaboration

1. Automated Labeling & Entropy Reduction

A fundamental challenge in recommender systems for open-source software is maintaining label accuracy across a constantly growing dataset. We applied entropy reduction techniques to remove ambiguous or redundant labels, ensuring that only the most high-signal metadata remained. This increased the precision of search filters and semantic clustering in the gallery.

2. AI and Human Collaboration in Label Optimization

Rather than relying purely on large language models (LLMs) like GPT-4, we designed a hybrid approach:

  • Sparse AI Labeling: GPT-generated predictions were filtered for accuracy, avoiding overly generic classifications.
  • Human-in-the-Loop Validation: Crowdsourced human labels were ranked based on semantic distinctiveness, ensuring only the most relevant terms were used in training data.
  • GPT-3.5 Fine-Tuning: A custom fine-tuned model was developed to refine and reinforce accurate label assignments, improving both precision and recall.

3. Model Evaluation & Quality Assurance

To validate our approach, we used:

  • Dimensionality Reduction: Techniques such as t-SNE and PCA allowed us to visualize category separability and optimize clusters.
  • Entropy Analysis: Evaluated the information gain of each label to ensure clarity and specificity.
  • Visual Clustering: Heatmaps and embeddings helped refine and debug our classification model.

Outcome: Enhanced Developer Experience & Component Discoverability

By integrating AI-driven taxonomy optimization, we built a developer-friendly component categorization system, now deployed on near.org/applications. The system significantly improves semantic search, making it easier for developers to find BOS components based on functional attributes rather than arbitrary labels.


Future Directions: Fine-Tuned Open-Source Models for Efficient Categorization

Today, smaller, fine-tuned open-source models provide a more efficient alternative. A Mistral-based model with an expanded context window would better capture long-tail dependencies in developer-contributed metadata, surpassing our original approach, which required a large output window. As open-source AI advances, these models will enable more dynamic, adaptive categorization frameworks in blockchain and decentralized ecosystems.