Discussing 10 years of orchestration challenges with Erik Munson, founding engineer at Day AI, ex-HubSpot, and Netflix Engineer

Erik Munson is the founding engineer of Day AI, an AI-native CRM that uses customer conversations to automatically maintain an always-up-to-date view of your business.

This interview is an opportunity to reflect on Erik's wealth of experience solving orchestration challenges over a span of 10 years, from building HubSpot's first sales product to working on Netflix's Platform Engineering team, and now building Day AI.

You worked at HubSpot back in 2014, and 10 years later, you're back to building a CRM. How did you originally end up in the CRM space?

Back in 2014, I had the incredible luck of getting recruited to HubSpot at just the right time to end up on a new team that was being kept separate from the rest of the company — we called it a “startup within a startup”. HubSpot was about to IPO on the success of its marketing automation suite, and they wanted to plant the seeds of their next growth engine. We operated on a distinct tech stack, with a new business model, and for a long time didn't even use HubSpot branding. The early product footprint was a Chrome extension that integrated into your Gmail inbox, with an Outlook version shortly after. It tracked email opens and link clicks, notified users of that activity via push notifications, and surfaced useful information about your relationships when you were composing or reading email. Over time, that early minimal footprint grew into HubSpot's CRM and Sales Hub product lines which today make up a huge portion of the company's revenue and growth.

That experience was super formative for me, and gave me a strong sense for how important business and productivity software is for many people's day to day lives. The tools we built were a real, tangible part of helping people make a living and play their role in the economy. It's also where I met Christopher and Michael, the founders at Day AI.

Sales and marketing tools are known for being data-intensive products, what was it like building and operating the systems behind the scenes?

We handled a surprising volume of data — it turns out people send a ton of email! Each email contained a pixel for tracking opens, and potentially many wrapped link URLs for tracking link clicks. These often triggered work in our system many times per email, even in normal cases, but became especially high volume on long email chains or messages that were forwarded to large groups of people. Our goal was to provide real-time desktop notifications via the Chrome extension to email senders whenever an email was opened or a link was clicked, requiring us to quickly and accurately manage incoming data. We even did this for test emails we'd have new users send when signing up for the product, so timely, reliable processing was often the difference between new signups converting to daily users or churning.

Importantly, while we always wanted to notify quickly and reliably, we also had to address the noise in the data stream, deduplicating events for the same individual opener and filtering out or aggregating bursts of open and click events that users would find fatiguing and unnecessary. We also wanted to do our best to include some info about the email recipients on each notification, to make them more useful at a glance. This involved real-time processing to join various data sources, and pull in relevant contact information rather than simply displaying the subject line of the emails. All of these detailed user experience touches were in service of making the product nicer to use day to day, and more deeply integrated with the broader CRM product that ultimately took shape around it.

“

One of the big challenges that we all had to contend with all the time was supporting the sheer scale and volume of incoming data streams, while simultaneously doing all of the work on top that makes the user experience better — like deduplication and identification and routing events into all the right places — without shipping invalid or confusing output to downstream systems.

”

How did you end up solving these real-time processing constraints and scale challenges?

At the time Kafka was still emerging and wasn't the ubiquitous tool it is today. Some folks on the team had become familiar with it, and found a lot of success using it as a high-throughput queue with some event ingestion API endpoints in front to accept requests from tracking URLs. Events like email opens and clicks flowed through a collection of topics. We also had Redis on the side of the topics to store complementary state, mimicking the functionality many modern streaming systems have today. The state stored in Redis helped do things like manage deduplication via a Bloom filter — we recognized that complete determinism was not feasible given the volume of data. Despite the technical complexities, our goal was to create a seamless experience for users. When that part of the system was working reliably, it was invisible in the best way. But if it was laggy, or down, the whole product experience would grind to a halt, and we could even lose incoming data due to ingestion failures. Keeping availability high and latency low required significant, continuously available operational support, including a dedicated pager duty rotation. Outages could be very challenging to remediate. While Kafka recently transitioned away from ZooKeeper as its metadata store, and ZooKeeper has gotten more reliable over the years, at the time we often ran into trouble with it and had to scramble to get it back up.

“

It was extremely challenging, right? Really, really hard stuff. And that was the part of the system with the most difficult to remediate outages. It was the part that was always the most time consuming to change. It was the part that needed the most operational backing behind it from the team. Again, this is in the 2014 timeframe.

”

The whole team put a lot of work and discipline into making data processing resilient, low latency, and consistently available. It was never easy.

Did you feel like the challenges you were working through were unique to your team?

Building the early HubSpot sales products was certainly intense, but I came to realize it was not just our team or product fighting hard to keep these kinds of systems running. Building features that required data processing often followed a familiar pattern across the entire organization. Each new feature necessitated forming a dedicated team responsible for managing components like Kafka, which created versions of the same experience across product teams. Later in my tenure at HubSpot, I transitioned to working on infrastructure and SDKs that enabled other product teams to integrate their features into email clients. Working with that enablement focus exposed me to what more groups around the company were doing, and I was struck by how common the pain around event processing and orchestration was.

When I moved to Netflix, it felt like a natural progression to continue exploring what it was like to work on developer-facing tools. There, I contributed to Spinnaker, a widely adopted CI/CD and cloud infrastructure management tool open sourced by Netflix and Google. Spinnaker is used by engineers across Netflix to deploy and manage services, and as an open source project it's also used to power a long list of other large-scale products many of us use every day. It's a perfect example of how complex work orchestration in the cloud can be.

Seeing yet another example of this pattern reinforced my belief that while the specifics of each challenge may differ, the underlying complexities of building and maintaining robust orchestration systems are fundamental, and translate across companies and industries.

What did you learn from those Platform Engineering years at Netflix?

I gained a lot of respect for the complexities — the depth — of building orchestration systems. I worked with a homegrown system that was configured using a JSON-based DAG (Directed Acyclic Graph) for CI/CD workflows. This setup, while powerful, relied on a Java service atop Redis that managed work execution, and took a lot of manual management and operational overhead. I quickly learned that reliably executing DAG-style workflows based on events is a complex challenge, comparable to database transaction management issues like isolation and concurrency. We consumed event triggers from Git, build completions, pipeline completions, and a variety of other kinds of events via SNS topics and SQS queues with a dedicated service to manage inflow from external systems and outflow to the homegrown execution service. Our migration to Temporal was a significant turning point. Temporal's architecture allowed us to offload much of the operational complexity associated with state management and task orchestration, which eliminated a significant portion of the work in our custom orchestrators. However, a lot of the fundamental cognitive and operational challenges remained the same. It was a big help, but didn't redefine the problem or solution space at a deeper level.

Overall, my experience at Netflix highlighted that the hardest part of work orchestration was cognitive — that in some ways even with modern tools we had plateaued in how much more productive and efficient we could be at building these kinds of systems. There's no question that both Kafka and Temporal would go on my “Mount Rushmore” of orchestration technology, they both introduced huge productivity gains for teams doing orchestration and real time processing. But just like how the wave of NoSQL databases in the 2000s with minimal or nonexistent concurrency control abstractions led to a major productivity drag on teams who had to model their storage at a very low level, I felt convinced that we had not yet gotten to a high enough level of abstraction for work execution to truly offload that part of the cognitive challenge to an infrastructure component.

“

I got even more of an appreciation from that of just how hard it is to build systems that execute work reliably, in some useful order. Partway into my tenure there it started really dawning on me that this is one of the hardest things to do in software. Having seen it in two different flavors, I had started building this deep appreciation and respect for the difficulty of it, right up there with database transaction isolation and durability. When I think about some of the hardest things out there, there's database concurrency control, and then there's workflow orchestration.

”

In 2023, you joined your former HubSpot colleagues to launch Day AI. There, you've faced familiar challenges but tackled them with new perspectives. How did you choose your technical stack?

When I joined Day AI, I immediately recognized that we were facing data stream processing challenges at an unprecedented scale. Our goal was to connect to and process real-time streams of data from every platform where you might talk to your customers: every Gmail message, every Google Calendar event, every meeting recording, every Slack channel and thread, for every person at your company. But we didn't just want to consume and store that huge volume of input data — we wanted to feed it all to LLMs to reason through it, transform it into structured, clean CRM data, and then fan that output data out to other LLM-powered workloads downstream.

Given my previous experience, I was initially overwhelmed by the complexity of the task. I understood that we might need to expand our team purely to manage the scale of the orchestration work ahead. As I began brainstorming the technical stack, I considered using Temporal, but quickly realized it wouldn't align with our needs. First, we were focused on leveraging TypeScript and running on AWS Lambda to simplify operations early on, especially with a small team. I didn't want to manage container clusters or spin up other complicated infrastructure, as that would almost immediately trigger the need to grow the team. Second, I felt strongly that to tackle the scale of orchestration complexity we had ahead of us, we needed to find a way to write workflow code at a much higher level of abstraction that what I had seen and done in the past. In my search for alternatives, I discovered Inngest, which stood out because it allowed me to utilize Lambda without the need for extensive infrastructure setup. The built-in event-driven architecture and flow control features were particularly appealing. Inngest enabled us to fire events that could translate into multi-step workflows without the complexity of running SQS, SNS, or Redis on the side to manage state. Its concurrency control was impressive and felt like one of the most important cognitive unlocks to make managing immense volumes of streaming data an order of magnitude simpler.

Ultimately, choosing Inngest allowed us to radically simplify our architecture, reduce operational complexity, and focus on delivering our product while still meeting the demands of real-time processing at unusual scale. I count this decision as among the most pivotal in the technical choices we made early on that kept us moving quickly — and kept reliability high — with a very small and nimble team.

“

When I came to Day AI, I knew that I was going to have this problem at an epic scale because the whole design of the product is like what I was doing at HubSpot and Netflix, but times 10 million. Literally connect up to streams of everything everybody's doing everywhere and get 100% of that data into the system all day long with relatively low latency, then do all of these downstream LLM processing steps that feed into each other and don't mess it up. **Operate it correctly. Stay very reliable. Set a new bar for scale.

”

In retrospect, would you say finding Inngest was serendipitous given your previous experiences?

Discovering Inngest felt like a wonderful moment of serendipity for sure, I had spent so many years of my career exposed to orchestration challenges, and right at the moment where I was about to dive back into solving them in the same way, the perfect solution appeared. The universe was looking out for me!

Before, managing concurrency across units of work typically required some kind of state store (like Redis) and a series of complex locking mechanisms that were very tough to reason about. Realizing Inngest could handle these issues effortlessly was a game changer; it removed the need for the cumbersome solutions I had relied on for years. I feel strongly that the problem Inngest solves is universal — the vast majority of systems processing incoming data and scheduling work must maintain database-like consistency and reliability properties in some way. Until I found Inngest, I wasn't sure if an equally generalized solution could exist, akin to how transaction isolation and concurrency control are solved generally in databases.

It was incredibly satisfying to realize Inngest could be the realization of that generalized solution. The team behind it tackled concurrency, queuing, and reliability with the same care as a database, offering a strong, consistent interface while managing the difficult parts underneath. As we've continue to scale and grow the product at Day AI, Inngest's abstractions have grown with us, and kept our team remarkably small and nimble despite managing a very large data processing footprint. We still spend most of our time focused on building the product, not operating scheduling systems — just as it should be, and as I hope will become more the norm than the exception in the industry.

“

Discovering that it was possible to solve orchestration challenges in a generalized way made it all the more satisfying and that I could hand off one of the hardest things in my career to a team that treated it like a database - they give you an interface and handle the difficult parts of concurrency, queuing, fairness and reliability.

”

Curious to know how Day AI leverages Inngest? Read their case study: How Day AI built a CRM that can reason using Inngest.