On the idea behind Induction Labs

Dimensionality reduction used to be something we did to “make plots” — a quick PCA or UMAP for a paper figure, or a t-SNE for a curious notebook. Over the last few years that mechanical step quietly turned into a workflow: a way to navigate, tag, clean, and ultimately decide what to trust in our data.

This post walks the thread I lived through: the h-NNE experiments that made clusters obvious, the Spacewalker demo that turned exploration into annotation at scale, and the product we’re building now at Induction Labs to make that flow robust, collaborative, and (yes) a little bit fun.

The spark — why we got annoyed with “pretty plots”

Early on I was firmly in the “application” camp: running experiments, integrating methods, and trying to make researchers’ lives easier. The problem that keeps coming back is that most dimensionality reduction tools hide as much as they reveal when you don’t have ground truth labels.

With t-SNE or UMAP you can get beautiful islands and splashes of color, but you rarely know where one semantic group ends and another starts. Worse, projecting new points into those visualizations is expensive or impractical — often you’re stuck rerunning the whole pipeline.

That practical friction is what pushed me toward h-NNE during the paper work and follow-up experiments. The method wasn’t just faster; it came with properties that changed how we used an embedding.

What h-NNE unlocked in practice

In our experiments, h-NNE showed three properties that mattered immediately for real workflows:

Speed at scale. On ResNet50 embeddings of ImageNet, h-NNE produced an embedding in 5:11 minutes, while UMAP took 49:10, and t-SNE took far longer. That difference alone moves a method from “offline experiment” to “interactive tool.”
Hierarchy and explicit cluster separation. By leveraging the FINCH clustering hierarchy, h-NNE makes cluster boundaries visible and explorable — even without labels.
Efficient projection of new points. New samples can be embedded directly into an existing visualization, which is essential for iterative exploration and annotation.

Placeholder: h-NNE cluster hierarchy visualization

Placeholder: h-NNE hierarchical embedding showing clearly separated semantic clusters

Caption: Example h-NNE embedding with hierarchical cluster structure. Even without labels, semantic regions are clearly separated and navigable via the hierarchy.

One moment that stuck with me came from the initial experiments on the Google News dataset with roughly 3 million samples. We could traverse topic clusters top-down, zooming from broad themes into specific subtopics — and every cluster still retained strong semantic coherence. That’s when it became clear that this wasn’t just a faster embedding; it was a better interface to the data.

From embedding to interaction — Spacewalker

The next step felt natural: if the map is useful, let people act on it.

Spacewalker started as a research demo and group project, with Lukas Heine driving much of the implementation. The goal was simple but ambitious: combine embedding generation, visualization, querying, and annotation into a single interactive environment.

Spacewalker connected multimodal embeddings (images, text, video, full documents) to synchronized 2D and 3D views, text-to-image queries, region selection, and bulk annotation.

Placeholder: Spacewalker UI overview

Placeholder: Spacewalker interface with 2D/3D embedding views and selection tools

Caption: Spacewalker UI showing interactive navigation of an embedding space, with selection and annotation tools directly integrated.

Two features made the biggest difference in practice:

Multimodal text → image queries, which let users anchor exploration with natural language.
Bulk annotation via cluster or region selection, enabling entire semantic islands to be labeled at once.

To evaluate the impact, we ran a user study with 20 participants of mixed IT backgrounds. The result was striking: in a 10-minute window, users annotated more than 15,000 datapoints with Spacewalker, compared to roughly 100 datapoints using Label Studio — a speedup of over 100×.

Placeholder: Annotation speed comparison

Placeholder: Bar chart comparing annotation throughput between Spacewalker and Label Studio

Caption: Annotation throughput comparison. Spacewalker enables cluster-based labeling, leading to orders-of-magnitude speedups over traditional sample-by-sample tools.

One unexpected outcome was how people felt while using it. Exploration became playful — users described the process as game-like. That sense of engagement turned out to be tightly coupled with productivity.

Why this changes decision-making

Once embeddings are fast, projectable, and structured by clusters, the workflow changes fundamentally:

Dataset issues (corrupted files, mislabeled regions, duplicates) become visible as outliers or malformed clusters.
Large, semantically consistent regions can be labeled in bulk, saving time where perfect per-sample accuracy isn’t required.
Edge cases can be isolated and reviewed deliberately instead of being buried in noise.

At that point, dimensionality reduction stops being a visualization step and becomes a decision surface for data quality, curation, and downstream modeling.

From Spacewalker to Induction Labs

Spacewalker validated the idea. Induction Labs is the result of taking those lessons and building a system that works reliably at scale and across teams.

Key evolutions include:

Automated tagging. Instead of purely manual labeling, Induction Labs suggests tags using sophisticated clustering and chunking methods inspired by TW-FINCH.
Dedicated cluster investigation. Clusters are first-class objects, not accidental artifacts of a plot.
Smoother UI and workflows. Exploration, annotation, and analysis are tightly integrated.
Collaboration and reproducibility. Projects can be shared across users globally, making exploration sessions reproducible and collaborative.

Placeholder: Induction Labs project view

Placeholder: Induction Labs interface showing clusters, tags, and collaborative features

Caption: Induction Labs project dashboard with automated tags, cluster views, and collaborative exploration.

The idea is simple: make working in latent space feel as natural as working in a shared document — but with the full semantic structure of your data exposed.