From Chaos to Signal

Categorizing Heterogeneous JavaScript Components in NEAR BOS

Entropy reduction, staged inference, and the limits of LLMs in open ecosystems


Abstract

This article describes a pragmatic attempt to categorize approximately 15,000 JavaScript components in the NEAR Blockchain Operating System (BOS).
The system faced extreme heterogeneity in code structure, metadata quality, and semantic intent. Naive LLM-based classification approaches failed early due to noise, incomplete context, and framework-specific abstractions.

We instead treated categorization as a progressive signal compression problem, applying staged entropy reduction, lossy structural normalization, weak LLM inference, and human-in-the-loop signal selection.
The outcome was not a perfect taxonomy, but a materially improved developer search and recommendation experience.

Although BOS has since been retired, the underlying lessons remain relevant for open, decentralized software ecosystems.


1. System Reality: Why BOS Components Were Hard to Categorize

At first glance, the problem appeared straightforward:
a large corpus of JavaScript components, all built for the same platform, all publicly accessible.

In practice, the corpus exhibited extreme variance along multiple axes:

  • Component size ranged from “Hello World” snippets (<50 LOC) to deeply nested UI systems with state management, lists, games, and application logic.
  • Semantic intent was rarely explicit. Many components acted as glue layers, RPC wrappers, or UI fragments whose purpose only emerged at runtime.
  • Metadata was noisy and inconsistent:
    • Tags such as devcon, denver, test, or event-specific labels were common.
    • Functional labels like game, wallet, or recommender were rare, overloaded, or missing entirely.
  • BOS-specific abstractions obscured behavior:
    • RPC calls to opaque endpoints
    • Indirect data access via URLs with no semantic signal
    • Non-runnable or incomplete code fragments

Static code analysis alone was insufficient.
Metadata alone was misleading.
And sending raw code to LLMs produced confident but unreliable results.


2. Why Naive LLM Approaches Fail

Early experiments followed the obvious path:
chunk components, send them to large language models, ask for labels.

This failed for predictable reasons:

  • Context fragmentation: meaningful behavior was often split across chunks.
  • Hallucinated intent: LLMs inferred what a component should do based on surface cues, not what it actually did.
  • Framework mismatch: BOS abstractions were largely out-of-distribution.
  • Token pressure: large components required aggressive truncation, losing the very signal needed for classification.

The core issue was not model capability.
It was semantic noise density.

This reframed the problem:
categorization was not a classification task, but a signal extraction and compression task under uncertainty.


3. Guiding Principle: Categorization as Progressive Compression

Instead of attempting end-to-end classification, we applied a staged, lossy refinement pipeline.

Each stage intentionally removed information to increase downstream signal quality.

The process functioned like a hand plane:
each pass shaved off noise, not to reach perfection, but to approach an acceptable and useful categorization.


4. Stage 1 — Entropy-Based Metadata Pruning

Tags were treated as weak priors, not ground truth.

We analyzed label distributions across the corpus and removed or down-weighted labels that exhibited:

  • extremely high global frequency (low discriminative power), or
  • high conditional entropy across components (semantic ambiguity)

This step deliberately sacrificed recall in favor of separability.

The objective was not correctness, but to make wrongness less diverse.

Reducing the candidate label space early proved more valuable than attempting to correct noisy labels later.


5. Stage 2 — Structural Compression Before Inference

Before involving LLMs, components were transformed into lossy structural representations.

Rather than preserving full code, we extracted pattern-level signals such as:

  • presence of RPC calls
  • stateful vs. stateless behavior
  • UI rendering vs. data transport
  • list iteration, game-loop–like constructs
  • nesting depth and dependency structure

This step acknowledged several realities:

  • Much of the code was not runnable.
  • BOS abstractions obscured execution semantics.
  • Pattern recognition mattered more than syntactic completeness.

The result was a compressed representation that significantly reduced token size and variance while preserving coarse semantic cues.


6. Stage 3 — Weak LLM Inference Under Hard Constraints

Only after pruning and compression were components passed to LLMs.

Key constraints shaped this stage:

  • aggressive chunking due to token limits
  • incomplete semantic context per chunk
  • high risk of overgeneralization

Mitigations included:

  • constraining outputs to short, ranked label suggestions
  • penalizing generic labels
  • treating all LLM output as probabilistic hints, never as ground truth

This stage clearly demonstrated the limits of LLMs in this setting:
they amplified existing signal, but performed poorly as primary filters.


7. Stage 4 — Human-in-the-Loop as Signal Selection

Human input was not used to “fix” LLM mistakes.

Instead, humans were asked to evaluate semantic usefulness:

  • Which labels meaningfully separated components?
  • Which labels collapsed distinct behaviors?
  • Which labels improved downstream search and recommendation?

This reframed human annotation as information gain optimization, not correctness enforcement.

Humans defined what signal meant in practice.


8. Stage 5 — Iterative Projection and Acceptance of Imperfection

After each refinement cycle:

  • components were embedded
  • projected using PCA and t-SNE
  • cluster separability was visually inspected
  • label entropy was re-evaluated

These projections were used for debugging and intuition-building, not as objective metrics.

We explicitly accepted:

  • fuzzy category boundaries
  • partial misclassification
  • non-exhaustive labeling

as long as discoverability and recommendation quality improved.

Ontological purity was never the goal.


9. Outcomes

The resulting system improved:

  • semantic search precision
  • stability of component recommendations
  • overall developer discoverability on near.org/applications

It did not produce a perfect taxonomy.
It did not eliminate ambiguity.
It did not fully automate categorization.

What it did provide was a pragmatically useful structure in a fundamentally noisy, open ecosystem.


10. Retrospective: What We’d Do Differently Today

In hindsight, several conclusions stand out:

  • Large LLMs were overkill for the core problem.
  • The primary bottleneck was semantic noise, not reasoning depth.
  • Smaller, fine-tuned open-source models would likely outperform our original setup today.
  • The system itself mattered less than the lessons it revealed.

Most importantly:

In open, decentralized ecosystems, categorization is not a classification problem.
It is a compression problem under uncertainty.

Although BOS has since been retired, these lessons generalize to many real-world systems where structure, intent, and metadata drift continuously over time.


This article is a field report, not a product announcement. The system described was imperfect, temporary, and constrained by its context—but the insights remain durable.

stay connected.

sign up to receive the occasional update + information on upcoming projects.

Oops! There was an error sending the email, please try again.

Awesome! Now check your inbox and click the link to confirm your subscription.