Cover
Ignacio Soubelet
August 8, 2025

How Kubling Started and Why It Exists

Before diving into the problems it set out to solve, I want to share some context on the journey that led me to spend countless hours building and releasing Kubling.

It all started back in 2017, when the tech industry was in the middle of a major shift: moving away from traditional virtual machines and embracing containers, with Kubernetes leading the way. Containers themselves weren’t new, but—as is often the case in tech—transitioning from a battle-tested, stable solution to something newer takes time. Change is rarely instant; it’s a slow, careful process.

I vividly remember endless conversations with infrastructure colleagues. Many agreed that Kubernetes had potential, but they also found it tricky to work with. And, as always, introducing a new layer to the stack meant more to monitor, more to troubleshoot, more to adopt. In short, more work.

Then came a turning point: cloud providers started offering managed Kubernetes services. Suddenly, you could run private clusters on shared infrastructure without having to manage master nodes yourself. That alone removed one of the biggest pain points for teams and opened the door for broader adoption.

Still, something felt missing. There was no simple, serverless-like way to quickly prototype and test container images—especially in scenarios where a deployment required multiple containers. In those cases, working locally wasn’t just inconvenient; it was often impractical.

That’s when I began experimenting with a private fork of the brilliant k3s. My vision was to create an ultra-simple platform for scheduling containerized workloads from a single, easy-to-write definition. And instead of complex billing systems, I imagined a prepaid “wallet” model where users topped up their balance and the platform deducted usage as they went—much like how OpenAI operates today.

Fast forward, the startup never took off for several reasons, primarily because building a true tech startup in Europe is an uphill battle. And by “true tech startup” I mean one built on original products and IP, not just wrappers around someone else’s API.

Still, looking back, that early exploration planted the seed that would eventually grow into Kubling.

Design Before Operate

I have always believed that investing in solid design (and even running simulations on it, similar to how unit and integration tests work in software) before actual operations begin is a golden, yet often forgotten, practice. The fail-fast approach has its place, and in my case it applied to the business model itself, but not to the operational side.

The design was built around a few key principles:

  • The data model should fully explain the entire business operation.
  • All entities in the model must be part of a connected structure (there should always be a path from any node to any other, even if it requires passing through multiple connections).
  • The model must be flexible enough to integrate with other systems’ models without having to conform to them. In other words, we define our own internal layer and map it to the other system’s data structures.
  • The model must operate in real time, avoiding unnecessary data movement whenever possible (this is about the operational layer, not the analytical one).
  • The model must be easy to query and modify using a well-known, widely accessible language, ensuring that both extracting insights and updating data are straightforward.

Let’s start with a concrete example. Imagine we have the following physical resources:

Based on these principles, we can model them as entities. For example:

    Entity: DataCenter
    - code: String
    - region: String
    - country: String
    - provider: String
    Relationships:
    - contains Switch
    - contains BareMetal
 
    Entity: Switch
    - model: String
    - ports: Integer
    - speed: String
    Relationships:
    - belongsTo DataCenter
 
    Entity: BareMetal
    - id: String
    - cpu: String
    - gpu: String
    - memory: String
    - storage: String
    Relationships:
    - belongsTo DataCenter
    - hosts VirtualMachine
 
    Entity: BareMetalIF
    - id: String
    - mac: String
    - speed: String
    - enabled: Boolean
    Relationships:
    - belongsTo BareMetal
    - connectedTo Switch
 
    Entity: VirtualMachine
    - id: String
    - cpu: Double
    - availableCpu: Double
    - memory: Double
    - availableMemory: Double
    - storage: Double
    - availableStorage: Double
    - os: String
    Relationships:
    - runsOn BareMetal
    - hosts K3sNode
 
    Entity: K3sNode
    - id: String
    - version: String
    - role: String
    - id: String
    - cpu: String
    - memory: String
    - storage: String
    Relationships:
    - runsOn VirtualMachine

From an operational perspective, this would be fantastic, since we could query our infrastructure without even accessing the physical resources. For example:

Retrieve all K3sNodes whose VirtualMachine has less than 25% available memory, and also retrieve their associated BareMetal and DataCenter

We could then link this to more abstract parts of the model related to clients, usage, billing, and so on.

Benefits

Although this approach might be seen as complex, the benefits once operations begin are countless. In particular, I was interested in the following:

  • Easy onboarding and training for people to operate the system without requiring deep technical knowledge of each component.
  • A common query language for the entire setup that speeds up internal communication and standardizes terminology, while creating an internal DSL from day one.
  • Link daily customer and infrastructure operations with heterogeneous areas (such as system monitoring) in a single virtual, queryable environment.

Please note that this was years before OpenAI introduced an accessible LLM, which ultimately changed (and even reinforced) many of these ideas. More on that below.

Implementing It

So far, I have not mentioned any specific technology because, at that stage, I was not sure how to implement the idea in practice.
Elaborating on the decision-making process could easily take up an entire article, so I will jump straight to the conclusion: the query language I chose was SQL.

Some of the reasons were:

  • Most SQL-based engines support defining schemas at the level of data types and relationships we required.
  • SQL is expressive enough to handle more than 90% of the queries we needed.
  • SQL is easy to learn, even for non-technical people.
  • SQL can be generated from natural language using NLP (this was before the rise of LLMs).
  • It remains the de facto standard more than 50 years after its creation, and no technology seemed strong enough to replace it.

Having decided on the query language, I still faced a fundamental challenge: it had to act as an aggregation layer rather than a single data repository (such as Postgres or Cassandra). This requirement significantly increased the complexity of the solution.

After thoroughly researching frameworks and products in different languages, I chose an older Red Hat project called Teiid.
Although somewhat abandoned, its internal architecture was excellent. It offered well-decoupled components for DQP, AST processing, language traversal, adapters, buffers, and more, making it a perfect fit for both our current and future requirements.

To be completely honest, the work required on the latest stable version was enormous. It was like buying a house that appears to be in good condition, only to discover—once you start fixing minor issues—that the situation is worse than you initially thought.

Some of the most significant changes we had to make included:

  • Updating dependencies that were severely outdated and full of vulnerabilities. Upgrading to newer versions often meant dealing with completely different APIs, which forced us to refactor more than 30% of the original code.
  • Expanding data source definitions: the original method for defining data sources was limited, so we built a mechanism that uses separate configuration schemas with flexibility as a priority.
  • Custom connection managers: we implemented our own connection managers for each data source type, using observable connection pools.
  • Adding observability support, since the core was not designed for it. We built a wrapper around the engine to implement this feature without interfering with the most sensitive flows, but with a strong enough link to modify the engine’s behavior when needed.
  • Improving security, as the engine was not particularly designed with security in mind.
  • Modernized caching: updated the cache implementation to use an efficient policy known as TinyLFU.

Treat APIs as Data Sources

Once the base technology was decided and the core engine had been modernized and stabilized, a major problem became clear: adding one data source adapter for every system we wanted to interact with and ship it as part of the engine, simply did not scale. As more systems were incorporated into the stack, releasing a new version each time would have created unnecessary stress and made the whole process error-prone.

At first, I considered taking advantage of the Plugin/Addon mechanism. However, I discarded this idea because it conflicted with the way I wanted to distribute the engine (I explain this in the next section).

That left me with one option: enabling the engine to run arbitrary code. In other words, executing scripts that would act not as extensions, but as real connectors for data sources. The engine would follow the standard process used for any regular data source, but in the final step—when the protocol and data exchange with the target system occurs—the adapter would delegate that task to user-defined external scripts.

This is how the special data source called “Script Document” came to life. The name may sound unusual, but it reflects its purpose. The “Script” part indicates that the implementation is written in a scripting language, interpreted by the engine at runtime. The “Document” part refers to something worth explaining: the “Scripting” family of data sources is divided into two categories.

Fully Delegated

In this mode, the engine has no knowledge of the type of information the script data source is exchanging with the origin. It simply expects to receive tuples ready to be pushed as results.

Document

Since I expected that more than 75% of the cases requiring this type of data source would involve interacting with APIs—most of which exchange JSON documents—I created a special data source that handles the heavy lifting of JSON transformation. In this mode, the script only needs to return the raw JSON document.

By introducing the Script Document approach, we were able to treat APIs as first-class data sources without embedding their integrations directly into the engine or releasing a new version for each one. This design gave us the flexibility to connect to virtually any system, while keeping the core engine lean, stable, and focused on what it does best: processing and querying data in a consistent way.

The Most Complex Challenge: Distributing a Binary With a Contained Size

I was completely sure about how I wanted to distribute the engine: as a single binary.
Running large and complex systems on the JVM is not a problem in itself. In fact—and this is something many critics tend to overlook—some of the most critical infrastructures run on it, such as Kafka or even Scala-based systems.

However, my motivation was different and tied to my vision for Kubling: to create a federation. Since data federation is a big topic, I will cover it in the next article, but for context, the federation is based on the following principles:

  • It can consist of hundreds of instances.
  • Because they form a graph, if one instance goes down, parts of the graph (clusters) may become inaccessible.
  • Kubling instances must run as close as possible to the systems they connect to—even in environments where the data source is a physical appliance. This means we had to consider deploying Kubling instances on small, low-powered devices.

With these restrictions in mind, each instance needed to meet the following requirements:

  • Be fully self-contained in a single binary, with no external requirements such as the JVM.
  • Be as small as possible so it can be distributed as part of a lightweight, headless image.
  • Start up quickly.
  • Be volatile (stateless), so that it can be easily rescheduled to other nodes.

GraalVM to the Rescue

The only real alternative I had was GraalVM. However, when I began working on the first releases, it was still an immature technology with many limitations.
The journey from a multi-module engine to a single binary deserves an article of its own, which I will certainly publish soon.

In short, the process was extremely challenging, mainly due to the heavy use of reflection in the codebase and the lack of native build support for many dependencies.
Still, the effort was worth it. After nearly six months of stabilization, I was able to produce a fully functional Kubling binary that started 10 times faster and weighed in at only ~120MB when compressed.

Building such a complex Java application as a native binary is a one-way journey. Once you experience the benefits and performance improvements, you never want to go back to running it on the JVM. Looking back, this was one of the best technical decisions I ever made for Kubling.

The First Comprehensive Field Test: Managing Tens of Kubernetes Clusters

After all the work, it was finally time to test Kubling in a real environment.
Since the most critical component of the platform I mentioned earlier was k3s—and in our model it represented about 50% of the total complexity—the first test we performed was federating Kubernetes clusters and managing them via pure SQL.

The basic setup was:

  • One Kubling instance per Kubernetes cluster (each cluster had up to 50 agent nodes).
  • One Kubling instance aggregating cluster instances.
  • One parent Kubling instance managing the entire federation of Kubernetes clusters.

The goals we set were:

  • Access Kubernetes resources through their model.
  • Simulate various problems and investigate/troubleshoot using only Kubling instances and transactions.
  • Create, update, and delete Kubernetes workloads using only Kubling.
  • Test complex data joins by aggregating Kubernetes information with customer information.
  • Monitor Kubernetes instances exclusively with Kubling.

Graphically, the architecture looked like this:

Each Kubling instance named with a numeric suffix was a terminal instance, meaning it only had a k3s cluster as a data source.
Instances with the suffix “Ax” acted as aggregators per data center. Finally, the parent Kubling instance had visibility over the entire setup.

To simulate troubleshooting, we used a complex RBAC configuration. In general, the users responsible for investigations had read-only access to aggregators.
Problems were detected using both historical data (time-series values) and real-time monitoring.

  • Historical data collection was configured through Scheduled Scripts inside the Script Data Source module (later, we added support for bundle-level scheduled scripts via scheduledScripts in bundle-info.yaml). In practice, these scripts ran on a regular basis and wrote values to a global TimescaleDB.
  • For real-time monitoring, we used dashboards in Superset (our preferred visualization platform), which ran direct queries against Kubling instances.
  • For complex problems, we relied on Jupyter notebooks, which provided the flexibility to run custom analyses while also serving as a way to preserve investigation and solution steps as reusable knowledge.

However, this posed a real challenge. As we moved higher in the Kubling federation graph, each SELECT statement involving downstream Kubling instances could trigger a large workload. If executed too frequently, this risked putting the k3s API under heavy stress.

To better understand how Kubling (and a real data federation deployment) behaves under these conditions, we experimented with different topologies and reached some interesting conclusions. I will share those in another article.

Some of the concepts described here are illustrated in this example.

Conclusion

In this article, I tried to illustrate—briefly—what led me to create Kubling during my work on a startup.
Even though that startup never took off, as I mentioned earlier, I believe our industry still faces serious challenges in operations. This gave me the right motivation to keep working on what was originally conceived as an internal operations tool.

Running the of a operations a tech-driven business is often seen by professionals as boring, problematic, and not very challenging. This perception has turned operations into a niche area that is neither attractive to investors nor entrepreneurs, which in turn slows down innovation across the industry.

On top of that, designing serious data models has become a neglected practice, often dismissed as “just a chapter in an old software book.” In reality, strong data models can save significant costs, prevent countless headaches, and provide both faster time-to-market and greater reliability.

That belief—that operations and thoughtful design truly matter—is what continues to drive the purpose of Kubling today.

Bonus: AI Made It More Relevant Than Ever

Adopting a tool like Kubling—and getting the most out of it—does require some effort.
This is mainly because it starts to shine when you invest in designing a solid model that truly reflects the entire operation.

I have been involved in Advanced Analytics and Data Science projects for more than a decade, but it was not until OpenAI demonstrated the power of a well-trained LLM—capable of retaining context and reasoning over it—that I began to believe designing effective operations could once again be possible and accessible to almost everyone.

An LLM can now do the heavy lifting: creating the entire model for your specific domain (including the systems you want to interact with), generating custom Script Data Sources for your APIs, designing the right federation topology, and deploying Kubling instances.
Most importantly, it can interact with your operational tasks—such as configuration management and troubleshooting—through natural language, ultimately translating requests into structured queries with no ambiguity.

This convergence between Kubling and modern AI means that what once required deep expertise and countless hours of manual effort can now be achieved faster, more reliably, and by a much broader audience. In many ways, AI has made the purpose of Kubling clearer and more relevant than ever.