Root Cause Analysis
Tommy + the discipline of Root Cause Analysis
Home Capabilities Disciplines Root Cause Analysis
Root Cause Analysis Examples

Castle Risk Online
Personal Project2025 - present
Castle Risk Online is an online multiplayer board game with chat, animations, and AI players. It supports social login, mobile, dark mode, and is a blast to play with family and friends.
The game is built with React, with jotai for atomic state management on the frontend, and optimistic state synchronization viaWebSockets, proxied thru a K8s (Kubernetes) ingress controller equipped with Cert Manager to the underlying Express JS servers, which autoscale based on tcp connection rules, and use RxJS for Functional Programming stream processing of game events.
Key Results
- Launched fully functional multiplayer game with realtime chat, social login, mobile + desktop support, dark mode
- Achieved <200ms latency for real-time game state synchronization across all players
- Kubernetes + Skaffold used for cloud-agnostic deployments

pull.systems
EV Observability + Analytics2023 - 2024
Upon joining, I came up to speed quickly on the stack of the early version of Pull Workbench, which was very buggy but demonstrated the initial ideas and had a solid set of the latest technologies and patterns established in the codebase, providing for a solid starting point.
I was entrusted to aid our CTO in hiring several additional employees, and so I joined and conducted interviews for the first several months while working with the existing team AI + Full Stack to deliver features and solidify the system, with the aim of keeping it fully working with each merge, after playing a little catch-up to fix the early bugs that worried our business partners, giving them confidence that our team could deliver.
From there, I developed full stack features solo or by pairing with team members, and ultimately led a squad of 5 team members alongside a second squad that together comprised our engineering team.
Much of my time went into authoring complex analytics sql queries using the impressive Kysely library, a fluent, typesafe query builder that we used for our postgres and redshift databases. Given the nature of the product, we needed to make decisions on which queries could be run in real time vs. which queries and subqueries would need to be computed offline as part of a network of airflow dags.
On the ML Ops side I advocated for traceability and reproducibility / determinism of all models and artifacts, and integrated with systems that implemented that, such as Airflow to coordinate DAGs of ML training jobs and Sagemaker's metadata API, which we controlled via model lifecycle automations that produced and stored models, artifacts and metadata that were in turn consumed at runtime or in batch by our analytics stack
On the frontend, I helped us deliver an initial version of the Pattern Editor, a UI and set of APIs that users could use to put together their own patterns of interest, such as looking for certain anomalous ranges of quantities that themselves may be derived from other user-defined patterns. This entailed not only a UI that was DAG-aware but also a layer that converted the json representation of these patterns from the frontend into typesafe kyesely queries to be executed against redshift.
Key Results
- Led 5-person squad delivering Pattern Editor enabling custom anomaly detection workflows
- Processed 10M+ daily records with type-safe SQL queries using Kysely
- Improved hiring velocity conducting 30+ technical interviews while building product

Intertru.ai
AI-assisted Hiring2023 - 2024
The candidate summary page summarized a candidate's performance during multiple interview stages by presenting radar charts showing degree of fit against the values and attributes being evaluated for their position, as defined in the Interview Builder.
I built the frontend in React and Typescript, and integrated with the backend, which I partially built, which leveraged RAG and ran several Machine Learning models to produce scores and explainable AI. For example, models to break down interview transcripts into quotable fragments, evaluate relevance against configured company values, and call chatGPT APIs to obtain summaries and scores related to that content
Key Results
- Built AI-powered candidate evaluation dashboard enabling data-driven hiring decisions
- Integrated 3 ML models to support explainable AI
- Performed Quick prototyping with product and design to get product-market-fit cheaply

Appen AI
Formerly Figure Eight2022 - 2023
Instead of splitting devops and infrastructure and tests completely separate from development teams, I moved the needle so that product development teams could own more of their own infrastructure and tests, creating less back-and-forth and empowering teams to deliver.
We used Devspace, which meant any dev or team could stand up a reproducible, isolated stack with multiple services and frontends running, in the cloud, as well as modify the definitions of the infrastructure and code themselves, directly, without permission or external team tickets.
This enabled product engineers to do more experimentation and testing thru declarative infrastructure and configuration management while still protecting our production environments, unlocking their shackles and potential as the experts in the software.
At the same time I worked to reduce the outsized role our amazing DevOps team was playing in the day to day management as well as enhancement of environments, which unfairly impeded expert developers by introducing red tape and inter-team processes that didn't add value.
I ran Appen's ML Platform, which was used by FAANG and many other startups and enterprises to automate and scale their ML practices, including running both supervised and unsupervised workloads, as well as their global annotation workforce which enabled customers to leverage our crowdsourced professionals to elastically obtain labelling and quality checking services for text, voice, image, video and LIDAR annotation, training and validation use cases.
I reported to the CTO and directed multiple full stack teams each with their own tech leads and range of engineering skills to do both regular maintenance and product enhancements using technologies like Sagemaker, React, K8s (Kubernetes), Spark, Kafka, Airflow, Spring(Boot), Ruby, Python, Java, Typescript and SQL.
Maintenance included regular updates to infrastructure, bug fixes, and performance optimizations across the platform. We migrated more and more services to K8s (Kubernetes) and Ambassador as our API gateway, where we could consolidate cross-cutting logic like auth and versioning.
Enhancements included changes to simplify the UX, kill redundant or unused features, add measurement to inform our choices, and larger efforts like Enterprise OAuth.
2021 - 2022
There were 4 different websites in different technologies, acquired from different companies, and some APIs, that all needed to be unified in terms of sign up, sign in, and sign out, given their existing state of each having separate user stores, including 3rd party vendor users who logged in with vendors and then authed to us with a hidden token.
It was a stalled project, so I started with missing requirements, incomplete designs and misleading progress indicators and focused other leaders and teams on delivery thru tested working software, focusing on tested user stories and on-the-ground learnings as units of progress, instead of large, outdated PRDs waterfall style.
Contributed directly in React / Typescript, Nodejs / express, Ruby on Rails and custom gems, OAuth configuration, Java Spring with runtime loaded SPI implementations from across separate applications domains.
There was a complex architecture at play and teams that did not know each other and weren't working as a single unit, so the landscape was difficult and rife with demoralized team members.
Although my team was to play but one part in many on the project, I realized quickly that there was no single leader or coherent plan, and so there was lots of blame game and treading water.
With permission from our VP of Engineering, I took charge of the teams and worked with product to firm up requirements, and replace the initially conceived solution architecture, which would not have worked and was created in a bit of a vacuum, into one that would actually work, by digging in and running all the services and web apps myself and understanding the multiple data stores and existing auth mechanisms including auth via 3rd party vendors to some parts of the system.
I delivered the project within 5 months and for my efforts was rewarded not long after with a promotion.
Key Results
- Reduced deployment lead time by 75% enabling product teams to self-serve infrastructure
- Ran ML platform to support 100K+ annotation jobs daily across FAANG clients
- Unified authentication across 4 legacy systems reducing login friction by 85%

Progressive Insurance
Auto Insurer2011 - 2012
Consulted and advised Enterprise Steering Committee on adoption of Agile processes within the broader SDLC, which was being standardized away from waterfall.
Conceived of, designed, prototyped, developed, tested and rolled out various solutions for improved modelling and automation of several development-related business processes, spanning the software development lifecycle from requirements to system retirement. Created bidirectional traceability between requirements, code, test, and defect data and metadata from various disparate systems and technologies including Quality Center, Sharepoint, Visual Studio and TFS using REST, JavaScript, .NET and WCF services and an Enterprise Service Bus model.
2008 - 2011
The Quoting (F3) was a success, but to roll out to all 50 states, we needed an engine that could handle the complexity of the system, render quickly, and that was easier for our engineers to build and test with.
I took the most complex page of the Direct Auto Quoting app - the "Buy Page" - and drastically improved performance, reducing render time from 28 seconds to 2, by isolating the page and building it with a prototype of what would become REF2.
Seeing such a drastic improvement in performance gave the business the confidence they needed to convert Quoting (F3) to REF2 and use it to roll out to the remaining states. (sidenote: a code-oriented framework, FlashQuoting, which did away with REF2's markup and code bindings, superceded REF2 before all 50 states were rolled out. The later popularity of and similarities with React confirmed REF2 was onto something.)
Technical Details REF2 targeted both Flash and HTML during the pre-webkit era. Created technology-independent language and APIs for describing UI hierarchies, cascading styles, business logic and arbitrary data structures in a way that abstracted the developer away from the details of the client/server event communications, marshalling between multiple client technologies, persistence of state concerns, or the details of rendering engines / APIs in the various supported environments. Aided in development of dev tools such as code hinting and code generation to facilitate quick onboarding, compile time checks, type safety and debugging.
2007 - 2008
Progressive had two sites for Direct Auto Quoting - one used by customers at home and the other used by our call-center reps. Given the complex state-by-state variance in the insurance laws, this made for a huge maintenance cost, and doing it twice in two codebases didn't make sense.
The premise was that a single web 2.0 Direct Quoting Application could replace these while also yielding a much more modern and customer-delighting application.
As the team's Actionscript expert, I joined and quickly helped out delivering feature after feature, and I loved the XP discipline and grew to appreciate TDD especially after the site started hitting performance problems that warranted significant refactors, which would have been much riskier without test coverage!
Our pilot included 14 of the 50 states, and was a complete success. However, the amount of clientside rules and assets started slowing the app down, and this was when I was asked to replace the REF framework that preceded my joining, with something much faster and more developer friendly. This led to my promotion to lead engineer, where I begun work on REF 2.0 (UI Framework)
Key Results
- Reduced requirements-to-deployment cycle time by ~35% on key projects through Agile adoption
- Reduced complex quote page render time by 93% (from 28 seconds to 2 seconds)
- Successfully launched modern web app replacing 2 legacy systems in 14 states
Core.com
T1, ISDN, Dial-Up + Web Hosting1997 - 2000
At the time, dialup was still the most prevalent form of connecting to the internet, and our users had 28.8k, 26k and 56k modems, running PPP protocol, and some of them had static IPs whereas others used DHCP to assign connection-session long settings.
Users had Linux home directories, FTP accounts, and the option to host apache based vhosts. My job was to be able to leverage our support base and my growing knowledge of the protocols and technology to do Root Cause Analysis and solve their problems. I always took pride in my job and went above and beyond to provide the best Customer Service
Key Results
- Maintained 95%+ customer satisfaction score resolving 50+ support tickets weekly
- Reduced average ticket resolution time by 40% through systematic troubleshooting
- Enabled 50+ customers to successfully host websites on Apache virtual hosts