﻿<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Data, Chai & Dialogue - The Leadership Lounge]]></title><description><![CDATA[Data, Chai & Dialogue is a space where insights meet storytelling, brewing conversations at the intersection of tech, culture, and personal growth from Krakow, Poland.]]></description><link>https://tshasankda.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!cMG_!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5d6ee-8e59-41a5-ae04-a55ee9c5b03e_1024x1024.png</url><title>Data, Chai &amp; Dialogue - The Leadership Lounge</title><link>https://tshasankda.substack.com</link></image><generator>Substack</generator><lastBuildDate>Sat, 20 Jun 2026 13:26:16 GMT</lastBuildDate><atom:link href="https://tshasankda.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Tharashasank Davuluru]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[tshasankda@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[tshasankda@substack.com]]></itunes:email><itunes:name><![CDATA[Tharashasank Davuluru]]></itunes:name></itunes:owner><itunes:author><![CDATA[Tharashasank Davuluru]]></itunes:author><googleplay:owner><![CDATA[tshasankda@substack.com]]></googleplay:owner><googleplay:email><![CDATA[tshasankda@substack.com]]></googleplay:email><googleplay:author><![CDATA[Tharashasank Davuluru]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[STARBURST ENTERPRISE PERFORMANCE TUNING — A PRACTITIONER'S SERIES]]></title><description><![CDATA[Part 2: Configuration Deep Dive &#8212; Memory, JVM & Concurrency]]></description><link>https://tshasankda.substack.com/p/starburst-enterprise-performance-0df</link><guid isPermaLink="false">https://tshasankda.substack.com/p/starburst-enterprise-performance-0df</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Wed, 15 Apr 2026 09:01:50 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!g8cj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa77a3e33-14e9-476b-ae4b-e5f47bbe2fb1_846x861.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://tshasankda.substack.com/p/starburst-enterprise-performance">Part 1</a> covered the naming history, MPP architecture, and cluster sizing across AWS, Azure, and GCP. Part 2 of 3 in this series goes deep on configuration: the memory hierarchy, correct current property names (validated against SEP 477-e LTS), JVM tuning, spilling, and resource group setup</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;3edc425a-2677-436e-9e12-cb0bc5cf65c0&quot;,&quot;caption&quot;:&quot;What sparked this series: A junior engineer on our team noticed that RAM utilization on our Starburst cluster was consistently above 80%, with all 25 worker nodes running at full capacity. Rather tha&#8230;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;showDescription&quot;:true,&quot;showImage&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;STARBURST ENTERPRISE PERFORMANCE TUNING &#8212; A PRACTITIONER'S SERIES&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28659751,&quot;name&quot;:&quot;Tharashasank Davuluru&quot;,&quot;bio&quot;:&quot;I&#8217;ve been living in Krakow, Poland, for the past eight years, with a background in engineering and technology. &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf2a8721-d298-4f9f-b6a7-bc77ba443e9f_1200x1200.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-04-01T19:55:03.631Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9d512fff-e4f6-4625-99ee-80d41a5d57f6_1536x1024.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://tshasankda.substack.com/p/starburst-enterprise-performance&quot;,&quot;section_name&quot;:&quot;Journeys &amp; Lessons&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:192879054,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:4523895,&quot;publication_name&quot;:&quot;Data, Chai &amp; Dialogue - The Leadership Lounge&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!cMG_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5d6ee-8e59-41a5-ae04-a55ee9c5b03e_1024x1024.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h1>Chapter 6: Memory Configuration &#8212; Getting It Right</h1><p style="text-align: justify;">Memory configuration is the single highest-impact tuning dimension for Starburst Enterprise. Get it wrong and you get either constant OOM failures (configured too tight) or wasted capacity (configured too loose with too-generous per-query limits). Get it right and the cluster runs predictably under heavy concurrent load.</p><p style="text-align: justify;">Before configuring anything, you need to understand what is actually consuming memory on each node.</p><h2>6.1 The Memory Hierarchy &#8212; Every Layer Accounted For</h2><p>A common assumption is that assigning 128GB of RAM to a Trino worker means 128GB is available for queries. The reality is more layered:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g8cj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa77a3e33-14e9-476b-ae4b-e5f47bbe2fb1_846x861.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g8cj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa77a3e33-14e9-476b-ae4b-e5f47bbe2fb1_846x861.png 424w, https://substackcdn.com/image/fetch/$s_!g8cj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa77a3e33-14e9-476b-ae4b-e5f47bbe2fb1_846x861.png 848w, https://substackcdn.com/image/fetch/$s_!g8cj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa77a3e33-14e9-476b-ae4b-e5f47bbe2fb1_846x861.png 1272w, https://substackcdn.com/image/fetch/$s_!g8cj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa77a3e33-14e9-476b-ae4b-e5f47bbe2fb1_846x861.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g8cj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa77a3e33-14e9-476b-ae4b-e5f47bbe2fb1_846x861.png" width="846" height="861" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a77a3e33-14e9-476b-ae4b-e5f47bbe2fb1_846x861.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:861,&quot;width&quot;:846,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:154497,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/193482004?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa77a3e33-14e9-476b-ae4b-e5f47bbe2fb1_846x861.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g8cj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa77a3e33-14e9-476b-ae4b-e5f47bbe2fb1_846x861.png 424w, https://substackcdn.com/image/fetch/$s_!g8cj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa77a3e33-14e9-476b-ae4b-e5f47bbe2fb1_846x861.png 848w, https://substackcdn.com/image/fetch/$s_!g8cj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa77a3e33-14e9-476b-ae4b-e5f47bbe2fb1_846x861.png 1272w, https://substackcdn.com/image/fetch/$s_!g8cj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa77a3e33-14e9-476b-ae4b-e5f47bbe2fb1_846x861.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>6.2 The Memory Allocation Formula</h2><p>The official Starburst documentation states: &#8216;Typically, values representing 70 to 85 percent of the total available memory is recommended&#8217; for the JVM heap. For larger nodes, the percentage can be lower because absolute headroom grows. Here is how I think about allocation on a 256 GB worker:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vq8c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1262c05-d743-44c8-bb04-cffd2ff8e329_844x324.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vq8c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1262c05-d743-44c8-bb04-cffd2ff8e329_844x324.png 424w, https://substackcdn.com/image/fetch/$s_!vq8c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1262c05-d743-44c8-bb04-cffd2ff8e329_844x324.png 848w, https://substackcdn.com/image/fetch/$s_!vq8c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1262c05-d743-44c8-bb04-cffd2ff8e329_844x324.png 1272w, https://substackcdn.com/image/fetch/$s_!vq8c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1262c05-d743-44c8-bb04-cffd2ff8e329_844x324.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vq8c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1262c05-d743-44c8-bb04-cffd2ff8e329_844x324.png" width="844" height="324" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1262c05-d743-44c8-bb04-cffd2ff8e329_844x324.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:324,&quot;width&quot;:844,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:49585,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/193482004?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1262c05-d743-44c8-bb04-cffd2ff8e329_844x324.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vq8c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1262c05-d743-44c8-bb04-cffd2ff8e329_844x324.png 424w, https://substackcdn.com/image/fetch/$s_!vq8c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1262c05-d743-44c8-bb04-cffd2ff8e329_844x324.png 848w, https://substackcdn.com/image/fetch/$s_!vq8c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1262c05-d743-44c8-bb04-cffd2ff8e329_844x324.png 1272w, https://substackcdn.com/image/fetch/$s_!vq8c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1262c05-d743-44c8-bb04-cffd2ff8e329_844x324.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="callout-block" data-callout="true"><p style="text-align: justify;"><em><strong>Memory arithmetic check</strong>: The official Starburst documentation states that the sum of query.max-memory-per-node and memory.heap-headroom-per-node MUST be less than the JVM heap size (-Xmx). If these two values sum to more than -Xmx, SEP will fail to start. This is validated at startup.</em></p></div><h2>6.3 Correct Configuration Properties (SEP 453-e and Later)</h2><p>This is where I want to be very precise, because several property names have changed across SEP versions and incorrect properties cause the cluster to fail to start. The following are correct for SEP 453-e LTS and all later versions including the current 477-e LTS.</p><h3>config.properties on All Nodes</h3><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;56ee2317-39a2-40b6-88e7-bce6edd3dbc7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext"># === MEMORY MANAGEMENT ===

# Maximum distributed (cluster-wide) USER memory a single query can use
# User memory = data directly attributable to query execution (hash tables, sort buffers)
# Default: 20GB &#8212; almost always needs tuning upward for production workloads
query.max-memory=500GB

# Maximum user memory a single query can use on a single worker node
# The sum of (query.max-memory-per-node + memory.heap-headroom-per-node)
# MUST be less than the JVM heap (-Xmx). SEP validates this at startup.
query.max-memory-per-node=150GB

# Memory reserved as headroom for untracked JVM allocations
# Default: JVM max memory * 0.3
# Increase if workers are experiencing unexpected OOM errors despite query memory limits looking fine
memory.heap-headroom-per-node=50GB
</code></pre></div><h2>6.4 Understanding query.max-memory &#8212; The Cluster-Wide Cap</h2><p style="text-align: justify;">This is one of the most commonly misunderstood settings in Starburst/Trino deployments. query.max-memory is NOT a per-node limit. It is the total memory a single query is allowed to consume across all worker nodes combined.</p><p style="text-align: justify;">Example: If you have 25 workers and set query.max-memory=500GB, a single query can use up to 500GB distributed across all 25 machines &#8212; approximately 20GB per worker on average. In practice, distribution is uneven (JOIN build sides concentrate memory on specific nodes), which is why per-node limits also matter.</p><p>How to set query.max-memory:</p><ul><li><p>Start with a value that covers your largest realistic single-query memory consumption (measure with EXPLAIN ANALYZE on your heaviest queries)</p></li><li><p>Set it high enough that legitimate large queries succeed, but low enough that a runaway query cannot consume the entire cluster</p></li><li><p>Use resource groups to apply different query.max-memory limits for different user groups &#8212; interactive users get lower limits than ETL jobs</p></li></ul><h2>6.5 Memory and Data Skew</h2><p style="text-align: justify;">Even with correct memory configuration, data skew can cause EXCEEDED_LOCAL_MEMORY_LIMIT errors on individual workers while the cluster-wide memory usage looks fine. This happens because the per-node limit (query.max-memory-per-node) is enforced independently on each worker, and a skewed key causes one worker to accumulate disproportionately large hash table data.</p><p style="text-align: justify;">Example from production: A JOIN on user_id where user_id=0 represents 30% of all rows. Worker assigned hash(0) % 25 must build a hash table 7.5x larger than the average worker (30% vs 4%). If query.max-memory-per-node is set to the average expected usage, this worker will OOM while others have abundant headroom.</p><p>Remediation options:</p><ul><li><p>Pre-filter known skewed values before the JOIN, If user_id=0 (anonymous) is not analytically relevant, filter it out with WHERE user_id != 0</p></li><li><p>Increase query.max-memory-per-node to handle the skewed case , but this reduces concurrent query capacity</p></li><li><p>Restructure the query to avoid the skewed join key being the shuffle key</p></li><li><p>For GROUP BY skew: GROUP BY with skewed keys benefits from partial pre-aggregation on workers (Trino does this automatically) before the final shuffle</p></li></ul><h1>Chapter 7: JVM Configuration &#8212; The Current Official Template</h1><p style="text-align: justify;">Trino is a Java application and JVM configuration directly affects query throughput, GC pause times, and stability. The JVM configuration has evolved significantly as SEP has moved through Java versions. What follows is the current official template from the Starburst Kubernetes configuration documentation, validated against SEP 477-e LTS.</p><h2>7.1 Java Version Requirements by SEP Version</h2><p style="text-align: justify;">This is operationally critical. SEP has strict Java version requirements that change with each LTS release, and using the wrong Java version causes SEP to refuse to start.JDK is now bundled: Since SEP 462-e, SEP ships with the correct JDK version bundled. You do not need to separately install Java on nodes. The bundled JDK is Eclipse Temurin OpenJDK from Adoptium. If you set JAVA_HOME to override the bundled JDK, ensure it matches the minimum required version for your SEP release, using an older JDK causes startup failure.</p><h2>7.2 The Official jvm.config Template (SEP 477-e LTS)</h2><p>This is taken directly from the official Starburst Kubernetes configuration examples at <a href="http://docs.starburst.io/latest/k8s/sep-config-examples.html">docs.starburst.io/latest/k8s/sep-config-examples.html</a>. For bare-metal or VM deployments, the same flags apply in etc/jvm.config, with -Xmx set explicitly instead of using RAM percentage flags:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;920d2dac-6762-4d38-af82-70093a67157b&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">server
-Xmx205G                           # Set to 70-85% of total node RAM                                   
-XX:InitialRAMPercentage=80        # K8s-friendly alternative to explicit Xmx &amp; use InitialRAMPercentage/MaxRAMPercentage instead
-XX:MaxRAMPercentage=80            # K8s-friendly alternative to explicit Xmx
-XX:G1HeapRegionSize=32M           # G1GC region size &#8212; 32M recommended for large heaps
-XX:+ExplicitGCInvokesConcurrent   # Prevents stop-the-world GC when System.gc() is called
-XX:+ExitOnOutOfMemoryError        # CRITICAL: exit cleanly on OOM; let process manager restart
-XX:+HeapDumpOnOutOfMemoryError    # Write heap dump for post-mortem analysis
-XX:-OmitStackTraceInFastThrow     # Preserve stack traces even for frequently thrown exceptions
-XX:ReservedCodeCacheSize=512M     # Code cache for JIT-compiled query operators
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
-Dfile.encoding=UTF-8
-XX:+EnableDynamicAgentLoading    # Required for JOL (Java Object Layout) agent support
</code></pre></div><div class="callout-block" data-callout="true"><p><em><strong>Note on -Xms</strong>: The official Starburst jvm.config template does NOT include -Xms (initial heap size). Setting -Xms equal to -Xmx (the pattern sometimes recommended to avoid heap resizing) is not in the current official template. The -XX:InitialRAMPercentage flag achieves a similar goal in containerised deployments. On bare metal, -Xmx alone is the current official guidance.</em></p></div><h2>7.3 Key JVM Principles Explained</h2><h3>Why -XX:+ExitOnOutOfMemoryError Is Critical</h3><p style="text-align: justify;">When a JVM hits an OutOfMemoryError, the JVM is in an inconsistent state. Objects may be partially initialised, locks may be held, and the garbage collector may be unable to recover. Attempting to continue running after an OOM typically leads to cascading failures, corrupted state, and queries that appear to hang rather than fail cleanly.</p><p style="text-align: justify;">With -XX:+ExitOnOutOfMemoryError, the JVM exits immediately when OOM occurs. The process manager (systemd on VMs, Kubernetes on containers) restarts the Trino process in a clean state. Other workers that were not affected continue serving queries. The coordinator detects the worker departure and marks any in-flight queries on that worker as failed, allowing clients to retry.</p><h3>Why OS Swap Must Be Disabled</h3><p style="text-align: justify;">Trino&#8217;s JVM uses G1GC (Garbage First Garbage Collector), a compacting garbage collector. G1GC works by moving objects between memory regions and continuously compacting the heap to eliminate fragmentation. Because objects move frequently, there are no stable &#8216;cold&#8217; memory regions that the OS could safely page to swap disk.</p><p style="text-align: justify;">If the OS swaps any part of the JVM heap, G1GC may need to touch that swapped page during a collection cycle, triggering a page fault and dramatically extending GC pause times. A GC pause that should take 200ms can extend to many seconds when swap is involved, manifesting as erratic query latency spikes.</p><p style="text-align: justify;">The official Starburst documentation explicitly states: &#8216;Allocation of all memory to the JVM or using swap space is not supported, and disabling swap space on the operating system level is recommended.&#8217;</p><h3>The Code Cache &#8212; An Often Overlooked Limit</h3><p style="text-align: justify;">Trino uses runtime code generation for query operators &amp; it generates optimised Java bytecode at query execution time based on the specific types and operations in each query. This generated code is stored in the JVM code cache.</p><p style="text-align: justify;">If the code cache fills up (-XX:ReservedCodeCacheSize), the JVM stops compiling new code and falls back to interpreted execution. This causes a progressive and difficult-to-diagnose performance degradation &amp; queries get slower over time as the cache fills, but there may be no obvious error. The 512MB setting in the official template is appropriate for production SEP workloads.</p><h2>7.4 Kubernetes-Specific JVM Considerations</h2><p>When running SEP on Kubernetes, several JVM behaviours require specific attention:</p><ul><li><p>CPU limits vs. requests: Do NOT set CPU limits on SEP coordinator or worker pods. The official Starburst Kubernetes documentation documents this explicitly hard CPU limits cause JVM CPU throttling even when the node has headroom, degrading performance. Set CPU requests only.</p></li><li><p>Container memory limits: Set memory limits equal to memory requests to prevent OOM eviction. The JVM heap (-Xmx or MaxRAMPercentage) plus non-heap overhead must fit within the container memory limit.</p></li><li><p>JVM cgroup awareness: Modern JDK versions (17+) are cgroup-aware InitialRAMPercentage and MaxRAMPercentage calculate percentages of the container&#8217;s memory limit, not the node&#8217;s total RAM. This makes K8s deployments cleaner than specifying absolute -Xmx values.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;dfe667c3-bfc9-465e-82d8-dee8bd0a877f&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext"># Recommended K8s resource configuration (values.yaml)
coordinator:
  resources:
    requests:
      memory: '256Gi'
      cpu: 32
    limits:
      memory: '256Gi'
      # NO cpu limit &#8212; critical for JVM performance
worker:
  resources:
    requests:
      memory: '256Gi'
      cpu: 32
    limits:
      memory: '256Gi'
      # NO cpu limit &#8212; critical for JVM performance
</code></pre></div></li></ul><h1>Chapter 8: Disk Spilling Why I Recommend Against It</h1><p>Spilling is Trino&#8217;s mechanism for reducing memory consumption during JOIN and aggregation operations by writing intermediate data to local disk and reading it back in batches. It is a legitimate feature that can save queries that would otherwise fail with OOM errors. My recommendation after years of production experience: design your cluster so you never need it.</p><h2>8.1 How Spilling Works</h2><p>When a query exceeds its memory limit during execution:</p><ul><li><p>JOIN spill: The build-side hash table (and matched probe rows) are serialised to disk in sorted partitions. Each partition is later read back from disk and the join continues partition by partition. The join completes, but requires multiple passes of disk I/O.</p></li><li><p>Aggregation spill: Partial aggregation accumulators are written to disk in sorted order and then merged in a second pass similar to an external sort.</p></li></ul><h2>8.2 The Real Cost of Spilling</h2><p>The mechanics sound manageable. The production reality is less friendly:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3KsR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9184aba-d467-43f2-9cc4-0d68e151a204_849x627.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3KsR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9184aba-d467-43f2-9cc4-0d68e151a204_849x627.png 424w, https://substackcdn.com/image/fetch/$s_!3KsR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9184aba-d467-43f2-9cc4-0d68e151a204_849x627.png 848w, https://substackcdn.com/image/fetch/$s_!3KsR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9184aba-d467-43f2-9cc4-0d68e151a204_849x627.png 1272w, https://substackcdn.com/image/fetch/$s_!3KsR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9184aba-d467-43f2-9cc4-0d68e151a204_849x627.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3KsR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9184aba-d467-43f2-9cc4-0d68e151a204_849x627.png" width="849" height="627" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c9184aba-d467-43f2-9cc4-0d68e151a204_849x627.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:627,&quot;width&quot;:849,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:115402,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/193482004?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9184aba-d467-43f2-9cc4-0d68e151a204_849x627.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3KsR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9184aba-d467-43f2-9cc4-0d68e151a204_849x627.png 424w, https://substackcdn.com/image/fetch/$s_!3KsR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9184aba-d467-43f2-9cc4-0d68e151a204_849x627.png 848w, https://substackcdn.com/image/fetch/$s_!3KsR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9184aba-d467-43f2-9cc4-0d68e151a204_849x627.png 1272w, https://substackcdn.com/image/fetch/$s_!3KsR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9184aba-d467-43f2-9cc4-0d68e151a204_849x627.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>8.3 Configuration</h2><p>Spilling is disabled by default in Trino/SEP. If you are on an older version where it was previously enabled or are considering enabling it, the current configuration properties are:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;98da1266-8e15-4c0f-b1f2-21999989fc3c&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext"># Default: false &#8212; spilling disabled
# Only enable if you have a specific, accepted use case
spill-enabled=false

# If enabling spill, configure the spill path to a fast local disk
# Do not use object storage or network filesystems for spill
spiller-spill-path=/var/trino/spill
max-spill-per-node=500GB
query-max-spill-per-node=100GB
</code></pre></div><p>What to do instead of spilling: </p><ol><li><p>Add memory to worker nodes , the right long-term fix. </p></li><li><p>Rewrite the query to use approximate functions (approx_distinct, approx_percentile) dramatically lower memory requirements. </p></li><li><p>Pre-aggregate or pre-sort the data to reduce JOIN memory requirements.</p></li><li><p>Use resource groups to prevent multiple memory-heavy queries from running concurrently.</p></li><li><p>Use bucketing to enable node-local joins that eliminate the distributed hash table pattern. All of these are better than relying on spill.</p></li></ol><h1>Chapter 9: Task Concurrency and Worker Scheduling</h1><p style="text-align: justify;">After memory and JVM, the next configuration layer is how Trino schedules work within and across workers. Trino uses a cooperative, fair-share task scheduler all queries share the same thread pool on each worker, with preemption to prevent long queries from starving short ones.</p><h2>9.1 Core Scheduling Properties</h2><p style="text-align: justify;">These go in config.properties on all nodes. Verify property names against docs.starburst.io/latest/admin/properties.html for your specific SEP version, as property names occasionally change across major releases:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;f09b98c5-084a-42f9-b7a2-9a709ce8c044&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext"># === TASK AND SCHEDULING ===

# Number of concurrent processing threads per task on each worker
# Rule of thumb: set to 2x physical core count (NOT vCPU count on hyperthreaded instances)
# Example: m7i.16xlarge has 32 physical cores &#8594; task.concurrency=64 (or match vCPU count)
# Example: Graviton ARM has no hyperthreading &#8594; task.concurrency = vCPU count
task.concurrency=32

# Threads handling HTTP data exchange between workers (shuffle network traffic)
# Increase for high-concurrency clusters with many simultaneous queries
task.http-response-threads=100

# Timeout for HTTP data exchange between workers
# Increase if network latency between nodes is high or cluster spans availability zones
task.http-timeout=2m
</code></pre></div><div class="callout-block" data-callout="true"><p style="text-align: justify;"><em><strong>Setting task.concurrency correctly matters</strong>: If set too low, CPUs sit idle even when work is available. If set too high, too many threads compete for the same CPU cores, increasing context switching overhead. The recommended starting value is 2x physical cores. On ARM instances without hyperthreading (Graviton, Ampere), use the vCPU count directly.</em></p></div><h2>9.2 Split Assignment Configuration</h2><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;1355d414-fd71-4e34-9007-58ecf043861a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext"># Maximum number of splits queued per worker node
# Increase for workloads with many small splits (many small files)
node-scheduler.max-splits-per-node=100

# Maximum pending (not yet started) splits per active task on a worker
node-scheduler.max-pending-splits-per-task=10
</code></pre></div><h2>9.3 Resource Groups &#8212; Sharing the Cluster</h2><p style="text-align: justify;">Resource groups are the mechanism for controlling how concurrent queries share cluster resources. They define limits on the number of running and queued queries per group, and they implement a psychology-based approach to fairness: the design goal is to maximise user happiness, not just mathematical resource utilisation.</p><p style="text-align: justify;">From the original Presto Training Series (Dain Sundstrom, Starburst 2020): &#8216;Focus on maximizing user happiness &#8212; psychology not computer science. User experience should match expectations. Small, fast, trivial queries should run immediately. People hate waiting in line.&#8217;</p><h3>SEP Resource Group Systems</h3><p>SEP 477-e LTS offers two resource group implementations:</p><ul><li><p>Built-in Resource Groups: The current and preferred approach. Configured via the Starburst Enterprise Web UI under the Workload Management section, or via the Starburst REST API. Enabled by setting resource-groups.configuration-manager=starburst in config.properties.</p></li><li><p>Legacy SEP Resource Groups: JSON-file-based configuration, compatible with the Trino open source resource group format. Still functional but superseded by built-in resource groups in current SEP.</p></li></ul><div class="callout-block" data-callout="true"><p style="text-align: justify;"><em>If you are coming from older Starburst or Trino documentation that shows YAML or JSON resource group files, that approach still works in current SEP. The built-in resource groups via the Web UI are the newer, recommended path for new deployments. You can migrate from the file-based approach to built-in resource groups &#8212; see <a href="http://docs.starburst.io/latest/admin/workload-management.html">SEP DOCS</a> for migration guidance.</em></p></div><h3>Designing Effective Resource Groups</h3><p>The principles I apply in practice:</p><ul><li><p>Separate interactive from batch: Interactive queries (dashboards, ad-hoc exploration) have latency expectations measured in seconds. Batch ETL jobs have latency expectations measured in minutes. Give each group enough capacity that interactive queries are never blocked by batch jobs.</p></li><li><p>Allow every user at least one concurrent query: No one should hit a queue waiting to start even a single query during normal operation. Queuing should only occur when a user or team is genuinely overloading their allocation.</p></li><li><p>Group by organisational unit: Teams regulate each other better than administrators can regulate everyone. Let team leads see what their team is running and kill queries in their group.</p></li><li><p>Set conservative initial limits and expand: It is easier to relax a limit that is too tight than to tighten a limit that users have come to expect as available capacity.</p></li></ul><h3>Example Resource Group Structure</h3><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;81cecafe-7d85-4ee0-9de0-4ca2a88d3ea1&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext"># config.properties &#8212; enable built&#8209;in resource groups
resource-groups.configuration-manager=starburst

# Example hierarchy (configure via Web UI, REST API, or resource-groups.json)
# Semantics only, not a literal config file

Global (root group)
  softMemoryLimit: 80%      # Soft memory ceiling (queries can briefly exceed)
  hardConcurrencyLimit: 150 # Total concurrent running queries cluster&#8209;wide
  maxQueued: 1000           # Max queued before rejecting new queries

  Interactive (subgroup)
    softMemoryLimit: 50%
    hardConcurrencyLimit: 75
    maxQueued: 300
    schedulingPolicy: fair     # Each user gets equal share

  Batch-ETL (subgroup)
    softMemoryLimit: 30%
    hardConcurrencyLimit: 20
    maxQueued: 500
    schedulingPolicy: weighted # Priority&#8209;weighted execution

  System (subgroup for internal jobs)
    softMemoryLimit: 10%
    hardConcurrencyLimit: 10
    maxQueued: 100</code></pre></div><h2>9.4 Autoscaling on Kubernetes</h2><p style="text-align: justify;">When running SEP on Kubernetes (EKS, AKS, GKE), you can configure Kubernetes autoscaling to add worker pods under load and remove them when idle. This is particularly useful for:</p><ul><li><p>Bursty analytical workloads where peak demand is 3&#8211;5x average demand &amp; paying for peak capacity 24/7 is wasteful</p></li><li><p>Cost optimisation when workloads follow business-hours patterns</p></li><li><p>Scaling for large one-off jobs without permanently increasing cluster size</p></li></ul><p style="text-align: justify;">SEP integrates with the Kubernetes Horizontal Pod Autoscaler (HPA) via KEDA (Kubernetes Event-Driven Autoscaling), which allows scaling based on Trino-specific metrics like queue depth or running query count. Configure KEDA integration through the SEP Helm chart values.yaml. We&#8217;ll explore this concept in more detail in <strong>Part 4</strong> of this series.</p><div class="callout-block" data-callout="true"><p style="text-align: justify;"><em>Autoscaling caveat: New worker nodes take time to register with the coordinator and warm up (JIT compilation of hot code paths). For latency-sensitive interactive workloads, always maintain a baseline number of warm workers and scale around that baseline. Scaling from zero is appropriate only for batch workloads where startup latency is acceptable.</em></p></div><h1>Chapter 10: Hive Caching &#8212; What It Does and What It Does Not Do</h1><p style="text-align: justify;">Starburst Enterprise provides several caching layers for Hive connector operations. Caching can help with specific bottlenecks, but each layer has trade-offs that matter in production.</p><h2>10.1 Metastore Metadata Caching</h2><p style="text-align: justify;">The Hive Metastore Service (HMS) is a relational database that Trino queries at the start of every query to resolve table schemas and list partitions. A slow HMS causes queries to sit in the planning phase with CPUs idle.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;606f12be-aeab-4c34-88b4-e8cee9618789&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext"># Cache HMS table and partition metadata in Trino&#8217;s memory
hive.metastore-cache-ttl=10m
hive.metastore-refresh-interval=5m
hive.metastore-cache-maximum-size=10000</code></pre></div><div class="callout-block" data-callout="true"><p style="text-align: justify;">Trade-off: cached metadata means new partitions or schema changes may not be visible until the cache expires. For rapidly-changing tables, reduce the TTL or disable caching. For stable tables, a longer TTL improves query planning performance significantly.</p></div><h2>10.2 File Listing Cache</h2><p style="text-align: justify;">Before reading data from HDFS or object storage, Trino must list files in the relevant partition directories. With millions of small files in S3, this can take seconds to minutes.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;ef09f7b7-8de5-4a19-9d46-54abb31be1a8&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext"># Cache file listing results to avoid repeated S3/HDFS list operations
hive.file-status-cache-expire-time=5m
hive.file-status-cache-size=1000000</code></pre></div><div class="callout-block" data-callout="true"><p>Trade-off: new files written to existing partitions will not be visible until the cache expires. In streaming or near-real-time workloads where fresh data matters, reduce the TTL or disable. For batch workloads with daily or hourly partitions, this cache provides significant speedups.</p></div><h2>10.3 Native File System Data Cache</h2><p>SEP&#8217;s native file system cache stores actual file data blocks on Presto worker local disks, enabling repeated access to the same data without hitting object storage each time.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;193713df-52bc-4d58-934b-a567910fe65a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext"># CURRENT property name 
fs.cache.enabled=true
# Configure the cache storage path (fast local SSD recommended)
fs.cache.directories=/mnt/cache/trino</code></pre></div><p>Reality check on the native data cache: It helps for repeated access to the same data by the same worker with a pattern that occurs in interactive dashboards where users repeatedly query the same date range. It does not significantly help for:</p><ul><li><p>First-time access to data (cold cache always misses)</p></li><li><p>Queries that access different data ranges on each run (most ETL/batch patterns)</p></li><li><p>Object storage (S3, GCS, ADLS) performance &#8212; the cache helps, but network round-trip latency to S3 is already low, so the benefit is less pronounced than for HDFS</p></li></ul><p>In the next part, <strong>Part 3: Query &amp; Workload Tuning</strong>, we&#8217;ll dive into EXPLAIN ANALYZE, Cost-Based Optimization (CBO), and effective data organization strategies</p><h3>Further Reading</h3><p>The concepts covered in this part are grounded in official Starburst documentation and Trino fundamentals. The following references are recommended for deeper exploration and version-specific details.</p><ul><li><p><strong><a href="https://docs.starburst.io/latest/admin/properties-resource-management.html">Starburst Enterprise Documentation &#8211; Resource Management Properties</a></strong></p></li><li><p><strong><a href="https://docs.starburst.io/latest/installation/deployment.html">Starburst Enterprise Documentation &#8211; Deployment Basics</a></strong></p></li><li><p><strong><a href="https://docs.starburst.io/latest/k8s/sep-config-examples.html">Starburst Enterprise Kubernetes Configuration Examples</a></strong></p></li><li><p><strong><a href="https://docs.starburst.io/latest/release/lts/breaking-changes.html">SEP Breaking Changes (LTS Compilation)</a></strong></p></li><li><p><strong><a href="https://docs.starburst.io/latest/admin/workload-management.html">Starburst Workload Management Documentation</a></strong></p></li><li><p><strong><a href="https://www.oreilly.com/library/view/trino-the-definitive/9781098137229/">Trino: The Definitive Guide (2nd Edition) Fuller, Moser, Traverso &#8212; O&#8217;Reilly,</a> <br></strong></p></li></ul>]]></content:encoded></item><item><title><![CDATA[STARBURST ENTERPRISE PERFORMANCE TUNING — A PRACTITIONER'S SERIES]]></title><description><![CDATA[Part 1: Foundations &#8212; Architecture, Naming & Cluster Sizing]]></description><link>https://tshasankda.substack.com/p/starburst-enterprise-performance</link><guid isPermaLink="false">https://tshasankda.substack.com/p/starburst-enterprise-performance</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Wed, 01 Apr 2026 19:55:03 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9d512fff-e4f6-4625-99ee-80d41a5d57f6_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="pullquote"><p style="text-align: justify;">What sparked this series: A junior engineer on our team noticed that RAM utilization on our Starburst cluster was consistently above 80%, with all 25 worker nodes running at full capacity. Rather than raising an incident, I realised this was a teaching moment about how MPP query engines actually work and it became the foundation for this entire series of posts.</p></div><p style="text-align: center;"><strong>About This Series</strong></p><p style="text-align: justify;">This is a three-part practitioner series on Starburst Enterprise performance tuning. Each part is standalone but they build on one another. The series covers what I have learned deploying and tuning Starburst Enterprise in production, cross-referenced against official Starburst documentation (SEP 477-e LTS, November 2025) and the Presto/Trino Training Series.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zGbX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f9b267-a871-41b1-a11a-e2f954a5fdda_954x310.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zGbX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f9b267-a871-41b1-a11a-e2f954a5fdda_954x310.png 424w, https://substackcdn.com/image/fetch/$s_!zGbX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f9b267-a871-41b1-a11a-e2f954a5fdda_954x310.png 848w, https://substackcdn.com/image/fetch/$s_!zGbX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f9b267-a871-41b1-a11a-e2f954a5fdda_954x310.png 1272w, https://substackcdn.com/image/fetch/$s_!zGbX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f9b267-a871-41b1-a11a-e2f954a5fdda_954x310.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zGbX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f9b267-a871-41b1-a11a-e2f954a5fdda_954x310.png" width="954" height="310" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f6f9b267-a871-41b1-a11a-e2f954a5fdda_954x310.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:310,&quot;width&quot;:954,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:49202,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/192879054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f9b267-a871-41b1-a11a-e2f954a5fdda_954x310.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zGbX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f9b267-a871-41b1-a11a-e2f954a5fdda_954x310.png 424w, https://substackcdn.com/image/fetch/$s_!zGbX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f9b267-a871-41b1-a11a-e2f954a5fdda_954x310.png 848w, https://substackcdn.com/image/fetch/$s_!zGbX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f9b267-a871-41b1-a11a-e2f954a5fdda_954x310.png 1272w, https://substackcdn.com/image/fetch/$s_!zGbX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6f9b267-a871-41b1-a11a-e2f954a5fdda_954x310.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="pullquote"><p style="text-align: justify;"><strong>Note:</strong> Everything in this series is based on my own production experience, combined with learnings from official Starburst and Trino documentation, along with other trusted resources. Wherever specific sources have influenced the content, I&#8217;ve credited them explicitly. This is not a copy of any documentation it&#8217;s my own interpretation, synthesis, and practical understanding of how these systems behave in real-world scenarios. If you notice anything that seems off or outdated, I genuinely welcome the feedback  as this space evolves quickly, and there&#8217;s always more to learn.I really enjoyed putting this series together from start to finish. After completing the writing, I used ChatGPT to help refine the grammar and improve readability, while keeping my original thoughts and intent intact.<em>.</em></p></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yM5w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeb745ae-f2cc-45f6-8f74-f207320884ef_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yM5w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeb745ae-f2cc-45f6-8f74-f207320884ef_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!yM5w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeb745ae-f2cc-45f6-8f74-f207320884ef_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!yM5w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeb745ae-f2cc-45f6-8f74-f207320884ef_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!yM5w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeb745ae-f2cc-45f6-8f74-f207320884ef_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yM5w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeb745ae-f2cc-45f6-8f74-f207320884ef_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/beb745ae-f2cc-45f6-8f74-f207320884ef_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2368142,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/192879054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeb745ae-f2cc-45f6-8f74-f207320884ef_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yM5w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeb745ae-f2cc-45f6-8f74-f207320884ef_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!yM5w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeb745ae-f2cc-45f6-8f74-f207320884ef_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!yM5w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeb745ae-f2cc-45f6-8f74-f207320884ef_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!yM5w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeb745ae-f2cc-45f6-8f74-f207320884ef_1536x1024.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p style="text-align: justify;"></p><h1>Chapter 1: The Naming Story &#8212; Presto, PrestoSQL, and Trino</h1><p style="text-align: justify;">Before any performance discussion, you need to understand the naming landscape. If you have read documentation, training slides, blog posts, and academic papers about this technology, you will have encountered &#8216;Presto&#8217;, &#8216;PrestoSQL&#8217;, &#8216;PrestoDB&#8217;, and &#8216;Trino&#8217; used in ways that seem interchangeable but are not. This causes real confusion, so let me set the record straight.</p><h2>1.1 Origins at Facebook (2012&#8211;2018)</h2><p style="text-align: justify;">Presto was created at Facebook in 2012 by four engineers <strong>Martin Traverso, Dain Sundstrom, David Phillips, and Eric Hwang</strong> as a replacement for Apache Hive. The problem they were solving was straightforward: Hive was too slow for the interactive analytics Facebook needed over its multi-petabyte Hadoop data warehouse. Facebook open-sourced the project in November 2013. Inside Facebook the project was called PrestoDB to distinguish it from unrelated projects.</p><h2>1.2 The 2018-2019 Governance Disagreement and Fork</h2><p style="text-align: justify;">In late 2018, Facebook made a governance decision that the original creators considered incompatible with healthy open source community management: granting Facebook engineers commit access without the standard community merit process. The four founders left Facebook and in January 2019 forked the codebase, creating the Presto Software Foundation as an independent non-profit. Their fork was initially named PrestoSQL , the same software, same team, different governance.</p><h2>1.3 The December 2020 Rename to Trino</h2><p style="text-align: justify;">In September 2019, Facebook transferred PrestoDB to The Linux Foundation, which also applied for and received a trademark on the name &#8216;Presto&#8217;. After extended negotiations between the communities, the PrestoSQL team was forced to rename. On December 27, 2020, PrestoSQL was officially renamed Trino. The community, codebase, and leadership remained identical only the name changed. Since the fork, Trino has seen approximately three times the development velocity of PrestoDB.</p><h2>1.4 Where Starburst Fits</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!B9QI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a2849c-eb6e-4f43-82f2-aee5d2187f15_858x466.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!B9QI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a2849c-eb6e-4f43-82f2-aee5d2187f15_858x466.png 424w, https://substackcdn.com/image/fetch/$s_!B9QI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a2849c-eb6e-4f43-82f2-aee5d2187f15_858x466.png 848w, https://substackcdn.com/image/fetch/$s_!B9QI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a2849c-eb6e-4f43-82f2-aee5d2187f15_858x466.png 1272w, https://substackcdn.com/image/fetch/$s_!B9QI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a2849c-eb6e-4f43-82f2-aee5d2187f15_858x466.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!B9QI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a2849c-eb6e-4f43-82f2-aee5d2187f15_858x466.png" width="858" height="466" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/78a2849c-eb6e-4f43-82f2-aee5d2187f15_858x466.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:466,&quot;width&quot;:858,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:56526,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/192879054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a2849c-eb6e-4f43-82f2-aee5d2187f15_858x466.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!B9QI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a2849c-eb6e-4f43-82f2-aee5d2187f15_858x466.png 424w, https://substackcdn.com/image/fetch/$s_!B9QI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a2849c-eb6e-4f43-82f2-aee5d2187f15_858x466.png 848w, https://substackcdn.com/image/fetch/$s_!B9QI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a2849c-eb6e-4f43-82f2-aee5d2187f15_858x466.png 1272w, https://substackcdn.com/image/fetch/$s_!B9QI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a2849c-eb6e-4f43-82f2-aee5d2187f15_858x466.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p style="text-align: justify;"><em>Note: When you see &#8216;Presto&#8217; in Starburst training materials from 2020 (including the training series that informs much of this guide), that referred to what is now called Trino the same engine, same original authors, different name. The architecture, concepts, and configuration principles documented in those 2020 sessions remain fully valid for current Trino/Starburst Enterprise.</em></p><h2>1.5 Why Starburst Documentation Still References &#8216;Presto&#8217;</h2><p style="text-align: justify;">You will still encounter &#8216;Presto&#8217; references in Starburst&#8217;s own documentation, especially in older blog posts, some connector code, and historical training materials. This is because:</p><ul><li><p style="text-align: justify;">Content written before December 2020 legitimately referred to the engine as Presto (PrestoSQL)</p></li><li><p style="text-align: justify;">The academic research paper &#8216;Presto: SQL on Everything&#8217; (published by Facebook engineers, 2019) predates the rename and is universally cited by its original title.</p></li><li><p style="text-align: justify;">Some third-party tools use the Presto JDBC driver or Presto SQL parser for compatibility , the SQL dialect between Trino and PrestoDB remains largely compatible.</p></li><li><p style="text-align: justify;">Some GCP marketplace scripts and older Helm chart references still show &#8216;starburst-enterprise-presto&#8217; in artifact names but this is a naming legacy, not a different product.</p></li></ul><p style="text-align: justify;">In this series: &#8216;Trino&#8217; means the open source engine. &#8216;SEP&#8217; or &#8216;Starburst Enterprise&#8217; means the commercial product. &#8216;Presto&#8217; in quoted training materials means Trino as it was known before December 2020.</p><h1>Chapter 2: MPP Architecture &#8212; How the Engine Actually Works</h1><p style="text-align: justify;">The most important step in understanding Starburst performance is understanding what kind of system you are actually dealing with. Starburst Enterprise is not a database, not a microservice, and not a batch processing framework. It is a Massively Parallel Processing (MPP) federated SQL query engine and that distinction matters enormously for how you interpret metrics.</p><h2>2.1 What MPP Actually Means</h2><p style="text-align: justify;">Massively Parallel Processing means that a single SQL query is simultaneously executed across every available worker node in the cluster. This is the defining characteristic:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eisx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd70d82d-e81c-496b-9217-bc080f1b10d6_843x454.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eisx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd70d82d-e81c-496b-9217-bc080f1b10d6_843x454.png 424w, https://substackcdn.com/image/fetch/$s_!eisx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd70d82d-e81c-496b-9217-bc080f1b10d6_843x454.png 848w, https://substackcdn.com/image/fetch/$s_!eisx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd70d82d-e81c-496b-9217-bc080f1b10d6_843x454.png 1272w, https://substackcdn.com/image/fetch/$s_!eisx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd70d82d-e81c-496b-9217-bc080f1b10d6_843x454.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eisx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd70d82d-e81c-496b-9217-bc080f1b10d6_843x454.png" width="843" height="454" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd70d82d-e81c-496b-9217-bc080f1b10d6_843x454.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:454,&quot;width&quot;:843,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:73833,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/192879054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd70d82d-e81c-496b-9217-bc080f1b10d6_843x454.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eisx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd70d82d-e81c-496b-9217-bc080f1b10d6_843x454.png 424w, https://substackcdn.com/image/fetch/$s_!eisx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd70d82d-e81c-496b-9217-bc080f1b10d6_843x454.png 848w, https://substackcdn.com/image/fetch/$s_!eisx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd70d82d-e81c-496b-9217-bc080f1b10d6_843x454.png 1272w, https://substackcdn.com/image/fetch/$s_!eisx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd70d82d-e81c-496b-9217-bc080f1b10d6_843x454.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p style="text-align: justify;">This is the single most important fact to internalise from this series. When a junior engineer sees 80% RAM across 25 workers and all CPUs busy, the correct interpretation is: the cluster is working exactly as designed. The questions to ask are not &#8216;why is utilization high?&#8217; but rather &#8216;are queries completing successfully?&#8217; and &#8216;are query latencies within acceptable ranges?&#8217;</p><h2>2.2 Coordinator and Worker Architecture</h2><p style="text-align: justify;">A Trino/SEP cluster has two node types. The official Starburst documentation describes the coordinator as &#8216;the brain of a Trino installation&#8217;. Here is what each does in practice:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Fi5i!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb725aab1-3671-4f09-b23f-1bc4e2754185_861x522.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Fi5i!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb725aab1-3671-4f09-b23f-1bc4e2754185_861x522.png 424w, https://substackcdn.com/image/fetch/$s_!Fi5i!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb725aab1-3671-4f09-b23f-1bc4e2754185_861x522.png 848w, https://substackcdn.com/image/fetch/$s_!Fi5i!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb725aab1-3671-4f09-b23f-1bc4e2754185_861x522.png 1272w, https://substackcdn.com/image/fetch/$s_!Fi5i!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb725aab1-3671-4f09-b23f-1bc4e2754185_861x522.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Fi5i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb725aab1-3671-4f09-b23f-1bc4e2754185_861x522.png" width="861" height="522" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b725aab1-3671-4f09-b23f-1bc4e2754185_861x522.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:522,&quot;width&quot;:861,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:100963,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/192879054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb725aab1-3671-4f09-b23f-1bc4e2754185_861x522.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Fi5i!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb725aab1-3671-4f09-b23f-1bc4e2754185_861x522.png 424w, https://substackcdn.com/image/fetch/$s_!Fi5i!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb725aab1-3671-4f09-b23f-1bc4e2754185_861x522.png 848w, https://substackcdn.com/image/fetch/$s_!Fi5i!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb725aab1-3671-4f09-b23f-1bc4e2754185_861x522.png 1272w, https://substackcdn.com/image/fetch/$s_!Fi5i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb725aab1-3671-4f09-b23f-1bc4e2754185_861x522.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!plX8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf466175-674b-49e6-a736-be27bcfeaded_889x154.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!plX8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf466175-674b-49e6-a736-be27bcfeaded_889x154.png 424w, https://substackcdn.com/image/fetch/$s_!plX8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf466175-674b-49e6-a736-be27bcfeaded_889x154.png 848w, https://substackcdn.com/image/fetch/$s_!plX8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf466175-674b-49e6-a736-be27bcfeaded_889x154.png 1272w, https://substackcdn.com/image/fetch/$s_!plX8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf466175-674b-49e6-a736-be27bcfeaded_889x154.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!plX8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf466175-674b-49e6-a736-be27bcfeaded_889x154.png" width="889" height="154" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df466175-674b-49e6-a736-be27bcfeaded_889x154.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:154,&quot;width&quot;:889,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36506,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/192879054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf466175-674b-49e6-a736-be27bcfeaded_889x154.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!plX8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf466175-674b-49e6-a736-be27bcfeaded_889x154.png 424w, https://substackcdn.com/image/fetch/$s_!plX8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf466175-674b-49e6-a736-be27bcfeaded_889x154.png 848w, https://substackcdn.com/image/fetch/$s_!plX8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf466175-674b-49e6-a736-be27bcfeaded_889x154.png 1272w, https://substackcdn.com/image/fetch/$s_!plX8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf466175-674b-49e6-a736-be27bcfeaded_889x154.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h2>2.3 The Split &#8212; Trino&#8217;s Unit of Parallelism</h2><p style="text-align: justify;">The smallest unit of parallel work in Trino is called a split. For the Hive/Iceberg connector reading from object storage, a split typically corresponds to one file or a portion of a large file. </p><p style="text-align: justify;">The number of splits available determines the maximum parallelism achievable for a scan:</p><ul><li><p style="text-align: justify;">1,000 splits across 25 workers &#215; 32 threads each = all splits can be processed in one wave.</p></li><li><p style="text-align: justify;">10 splits across 25 workers = most workers sit idle during the scan stage regardless of cluster size.</p></li><li><p style="text-align: justify;">More splits is better up to the limit of available worker threads after that you get scheduling overhead without benefit</p></li></ul><blockquote><p style="text-align: justify;"><em>The implication for file management: Files smaller than roughly 8MB create very small splits with high scheduling overhead. Well-sized files (128MB to 1GB) produce appropriately-sized splits that keep all worker threads busy. This is why file compaction is one of the highest-leverage optimisations in Part 3 of this series.</em></p></blockquote><h2>2.4 Query Lifecycle &#8212; SQL to Results</h2><p style="text-align: justify;">Understanding the full execution path helps identify where slowdowns occur. A query travels through these stages:</p><p style="text-align: justify;">1. Client submits SQL via HTTP POST to the Coordinator&#8217;s REST endpoint.</p><p style="text-align: justify;">2. Coordinator parses the SQL and resolves table/column names against the Hive Metastore or Iceberg catalog this is where HMS slowness manifests.</p><p style="text-align: justify;">3. The Cost-Based Optimizer (CBO) reads table statistics and builds a logical query plan, choosing join order, join distribution type, and aggregation strategy.</p><p style="text-align: justify;">4. The physical planner converts the logical plan into a distributed execution plan composed of stages each stage is a set of tasks that can run on multiple workers in parallel.</p><p style="text-align: justify;">5. The scheduler assigns available splits to workers and begins stage execution. Multiple stages can pipeline worker doesn&#8217;t wait for stage N to fully complete before starting stage N+1</p><p style="text-align: justify;">6. Workers exchange intermediate data with each other via HTTP shuffle. This is the &#8216;network shuffle&#8217; step that redistributes rows for GROUP BY keys and JOIN keys across workers.</p><p style="text-align: justify;">7. The final output stage streams results back to the Coordinator, which returns them to the client.</p><p style="text-align: justify;">Trino does NOT write intermediate results to disk between stages. All data flows through memory. This is what makes it fast and why memory management is so consequential.</p><h2>2.5 The Shared Hash Table &#8212; Why More CPU Helps Memory-Heavy Queries</h2><p style="text-align: justify;">During a JOIN operation, Trino builds a hash table from the build-side table (typically the smaller one) and streams the probe-side table through it looking up each row. This hash table is built once on each worker and shared across all probe threads on that worker it is not duplicated per thread.</p><p style="text-align: justify;">The implication: adding more CPU cores to a JOIN-heavy query allows the probe phase to run faster, completing the query sooner and freeing the hash table memory. This is why high CPU utilization during JOIN operations is a sign the cluster is working efficiently, not a sign of resource waste.</p><h1>Chapter 3: Cluster Sizing &#8212; The Right Foundation</h1><p style="text-align: justify;">Before any tuning is worth doing, the cluster needs to be correctly sized. Tuning a cluster that lacks sufficient headroom just moves the bottleneck around without solving it. The guidance from the 2020 Presto Training Series (Dain Sundstrom, Starburst) and production experience both point to the same starting strategy: start larger than you think you need, verify stability, then rightsize.</p><h2>3.1 General Sizing Strategy</h2><p style="text-align: justify;">8. Build a cluster that has more capacity than your initial workload estimate. You need headroom to differentiate configuration problems from resource problems.</p><p style="text-align: justify;">9. Verify stability first make sure queries complete without OOM errors and cluster behaviour is predictable.</p><p style="text-align: justify;">10. Do NOT try to stabilize and performance-tune simultaneously they require different diagnosis approaches.</p><p style="text-align: justify;">11. Measure the actual workload over time: peak memory per query, concurrent query count, CPU utilisation patterns.</p><p style="text-align: justify;">12. Then rightsize downward reduce nodes or machine size until utilisation targets are met.</p><blockquote><p style="text-align: justify;"><em>The most common failure pattern I&#8217;ve seen in Starburst deployments: teams provision a small cluster and immediately try to optimize it. Without headroom, you cannot tell whether slow queries are caused by configuration problems or by the cluster simply being too small. Always start bigger than you need.</em></p></blockquote><h2 style="text-align: justify;">3.2 CPU vs. Memory &#8212; Two Independent Dimensions</h2><p style="text-align: justify;">Trino has two independently scaling resource dimensions that behave very differently and require different responses when they become bottlenecks:</p><h3>CPU &#8212; Controls Query Speed</h3><ul><li><p style="text-align: justify;">More CPU cores &#8594; queries run faster. The relationship is approximately linear: doubling CPU halves query execution time.</p></li><li><p style="text-align: justify;">CPU does NOT determine whether a query can run , only how fast it completes.</p></li><li><p style="text-align: justify;">Running two identical queries concurrently with the same total CPU means each takes roughly 2x longer.</p></li><li><p style="text-align: justify;">If queries are slow but completing without errors: the first thing to consider is whether you are CPU-constrained.</p></li></ul><h3>Memory &#8212; Controls Query Viability and Concurrency</h3><ul><li><p style="text-align: justify;">Memory determines whether a query CAN run, not how fast it runs.</p></li><li><p style="text-align: justify;">If a query requires 50GB of distributed hash table memory and your cluster only has 40GB available, the query fails, there is no &#8216;slow but succeeds&#8217; for memory.</p></li><li><p style="text-align: justify;">Memory is consumed by: JOIN hash tables, GROUP BY aggregation accumulators, ORDER BY sort buffers, window function frame data.</p></li><li><p style="text-align: justify;">More memory increases the number of concurrent queries you can run and enables larger individual queries.</p></li><li><p style="text-align: justify;">If queries are failing with EXCEEDED_LOCAL_MEMORY_LIMIT or EXCEEDED_GLOBAL_MEMORY_LIMIT errors: this is a memory problem, not a CPU problem</p></li></ul><h2>3.3 Fewer Bigger Machines vs. Many Smaller Machines</h2><p style="text-align: justify;">This is one of the most consequential sizing decisions. The guidance from both the Presto Training Series and production experience is clear: fewer, larger machines outperform many smaller ones for Trino workloads.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!40FT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928dbd97-3dc0-4c84-b86d-d7bd995cdc2b_945x682.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!40FT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928dbd97-3dc0-4c84-b86d-d7bd995cdc2b_945x682.png 424w, https://substackcdn.com/image/fetch/$s_!40FT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928dbd97-3dc0-4c84-b86d-d7bd995cdc2b_945x682.png 848w, https://substackcdn.com/image/fetch/$s_!40FT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928dbd97-3dc0-4c84-b86d-d7bd995cdc2b_945x682.png 1272w, https://substackcdn.com/image/fetch/$s_!40FT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928dbd97-3dc0-4c84-b86d-d7bd995cdc2b_945x682.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!40FT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928dbd97-3dc0-4c84-b86d-d7bd995cdc2b_945x682.png" width="945" height="682" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/928dbd97-3dc0-4c84-b86d-d7bd995cdc2b_945x682.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:682,&quot;width&quot;:945,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:136584,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/192879054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928dbd97-3dc0-4c84-b86d-d7bd995cdc2b_945x682.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!40FT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928dbd97-3dc0-4c84-b86d-d7bd995cdc2b_945x682.png 424w, https://substackcdn.com/image/fetch/$s_!40FT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928dbd97-3dc0-4c84-b86d-d7bd995cdc2b_945x682.png 848w, https://substackcdn.com/image/fetch/$s_!40FT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928dbd97-3dc0-4c84-b86d-d7bd995cdc2b_945x682.png 1272w, https://substackcdn.com/image/fetch/$s_!40FT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928dbd97-3dc0-4c84-b86d-d7bd995cdc2b_945x682.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="pullquote"><p style="text-align: justify;"><em><strong>Production recommendation from experience</strong>: If choosing between 50 nodes with 64GB RAM each vs 25 nodes with 128GB RAM each for the same total cluster memory, choose the 25 larger nodes every time. The reduction in scheduling complexity, skew impact, and network overhead is significant and compounds over time.</em></p></div><h2>3.4 Sizing for Concurrent Workloads</h2><p>The hardest sizing question is not &#8216;how much does one query need?&#8217; but &#8216;what is the peak concurrent demand?&#8217;. Some things I have learned:</p><ul><li><p style="text-align: justify;">Memory concurrency is additive: if query A needs 30GB cluster-wide and query B simultaneously needs 40GB, your cluster needs at least 70GB of query-accessible memory to run both at the same time.</p></li><li><p style="text-align: justify;">CPU concurrency is time-sliced: the scheduler shares CPU time across concurrent queries. Adding queries increases latency for all queries rather than causing outright failures.</p></li><li><p style="text-align: justify;">Peak concurrency, not average, determines minimum cluster size, size for your realistic P95 concurrent query memory demand.</p></li><li><p style="text-align: justify;">Bigger clusters get a mix of query sizes that statistically even out. Small clusters are more vulnerable to a single heavy query consuming most available resources and causing other queries to queue.</p></li></ul><h2>3.5 Plan for Growth</h2><p style="text-align: justify;">Every Starburst deployment I have been involved with saw demand grow faster than expected. The platform is easy to use for SQL-familiar users, fast results encourage more queries, and federated access enables use cases teams never previously attempted. It is not unusual for query volume to double or even triple year over year once the platform is well-adopted. Build this into your sizing assumptions and your budget.</p><h1>Chapter 4: Instance Type Selection &#8212; All Three Clouds</h1><p style="text-align: justify;">Starburst Enterprise is cloud-agnostic and fully supported on AWS, Azure, and GCP. You can deploy via native Kubernetes (EKS, AKS, GKE), directly on virtual machines using Starburst Admin, or via the respective cloud marketplaces. The right instance type for Trino/SEP workloads shares the same characteristics regardless of cloud: high memory-to-core ratio, high network bandwidth, and modern CPU platforms.</p><p style="text-align: justify;">The guidance below is based on my experience selecting instance types for SEP workloads and on published instance specifications from each cloud provider. As cloud instance pricing and availability change frequently, always verify current pricing and availability in your region before committing.</p><h2>4.1 General Principles Across All Clouds</h2><ul><li><p style="text-align: justify;">Memory-optimized families are the first choice for JOIN-heavy and aggregation-heavy analytical workloads they give the highest memory per vCPU.</p></li><li><p style="text-align: justify;">General-purpose balanced families work well when workloads are more scan-heavy or when memory per query is moderate.</p></li><li><p style="text-align: justify;">Compute-optimized families are appropriate when workloads are CPU-bound with relatively small hash table sizes.</p></li><li><p style="text-align: justify;">Network bandwidth matters: Starburst recommends a minimum 10 Gbps within the cluster; 25 Gbps is recommended for object storage access.</p></li><li><p style="text-align: justify;">ARM/Ampere-based instances (Graviton on AWS, Ampere Altra on GCP and Azure) offer meaningful cost savings for the same workload &#8212; worth evaluating for production after benchmarking</p></li></ul><h2>4.2 Amazon Web Services (AWS)</h2><p style="text-align: justify;">The Presto Training Series (Starburst, 2020) originally provided AWS-centric guidance because that was the dominant cloud for Starburst workloads at the time. The recommendations remain relevant, though newer instance generations (m6i, r6i, m7i, r7i) offer better price-performance than the m5/r5 series shown in the original 2020 training.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!17k2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a6b85f-5675-41c4-88ba-5b647086034f_843x748.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!17k2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a6b85f-5675-41c4-88ba-5b647086034f_843x748.png 424w, https://substackcdn.com/image/fetch/$s_!17k2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a6b85f-5675-41c4-88ba-5b647086034f_843x748.png 848w, https://substackcdn.com/image/fetch/$s_!17k2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a6b85f-5675-41c4-88ba-5b647086034f_843x748.png 1272w, https://substackcdn.com/image/fetch/$s_!17k2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a6b85f-5675-41c4-88ba-5b647086034f_843x748.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!17k2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a6b85f-5675-41c4-88ba-5b647086034f_843x748.png" width="843" height="748" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39a6b85f-5675-41c4-88ba-5b647086034f_843x748.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:748,&quot;width&quot;:843,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:104018,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/192879054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a6b85f-5675-41c4-88ba-5b647086034f_843x748.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!17k2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a6b85f-5675-41c4-88ba-5b647086034f_843x748.png 424w, https://substackcdn.com/image/fetch/$s_!17k2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a6b85f-5675-41c4-88ba-5b647086034f_843x748.png 848w, https://substackcdn.com/image/fetch/$s_!17k2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a6b85f-5675-41c4-88ba-5b647086034f_843x748.png 1272w, https://substackcdn.com/image/fetch/$s_!17k2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39a6b85f-5675-41c4-88ba-5b647086034f_843x748.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>AWS note on vCPU vs physical cores: AWS instances use hyperthreading (2 threads per core). A 64-vCPU instance has 32 physical cores. Trino&#8217;s task.concurrency should be set to roughly 2x physical cores  so 64 for a 32-core machine. For Graviton instances, each vCPU IS a full physical core, so task.concurrency can be set to the vCPU count directly.</em></p><h2>4.3 Microsoft Azure</h2><p>SEP is supported on Azure Kubernetes Service (AKS) and directly on Azure Compute Engine VMs via Starburst Admin. Azure&#8217;s instance families that match the workload profile for Trino are primarily in the E-series (memory-optimised) and D-series (general-purpose balanced).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T4-Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2742fa2-e1aa-44e1-ac49-36a8b4fb014e_838x880.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T4-Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2742fa2-e1aa-44e1-ac49-36a8b4fb014e_838x880.png 424w, https://substackcdn.com/image/fetch/$s_!T4-Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2742fa2-e1aa-44e1-ac49-36a8b4fb014e_838x880.png 848w, https://substackcdn.com/image/fetch/$s_!T4-Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2742fa2-e1aa-44e1-ac49-36a8b4fb014e_838x880.png 1272w, https://substackcdn.com/image/fetch/$s_!T4-Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2742fa2-e1aa-44e1-ac49-36a8b4fb014e_838x880.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T4-Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2742fa2-e1aa-44e1-ac49-36a8b4fb014e_838x880.png" width="838" height="880" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2742fa2-e1aa-44e1-ac49-36a8b4fb014e_838x880.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:880,&quot;width&quot;:838,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:131663,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/192879054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2742fa2-e1aa-44e1-ac49-36a8b4fb014e_838x880.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!T4-Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2742fa2-e1aa-44e1-ac49-36a8b4fb014e_838x880.png 424w, https://substackcdn.com/image/fetch/$s_!T4-Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2742fa2-e1aa-44e1-ac49-36a8b4fb014e_838x880.png 848w, https://substackcdn.com/image/fetch/$s_!T4-Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2742fa2-e1aa-44e1-ac49-36a8b4fb014e_838x880.png 1272w, https://substackcdn.com/image/fetch/$s_!T4-Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2742fa2-e1aa-44e1-ac49-36a8b4fb014e_838x880.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Azure-specific note: Azure&#8217;s E-series Esv5 instances (without &#8216;d&#8217; in the name) do not have local temporary storage. The Edsv5 variants include local NVMe SSD. For SEP deployments where you might enable spilling (which I recommend against &#8212; see Part 2), you need the &#8216;d&#8217; variants with local storage. For standard SEP use, either works, though local storage is useful for HMS and other auxiliary services.</em></p><p>Azure-specific considerations for SEP deployments:</p><ul><li><p style="text-align: justify;">Azure Data Lake Storage Gen2 (ADLS Gen2): SEP supports ADLS Gen2 natively as object storage via the Hive and Iceberg connectors. Use fs.native-azure.enabled=true in catalog configuration for the current native file system implementation.</p></li><li><p style="text-align: justify;">Azure Kubernetes Service (AKS): The primary deployment mechanism for SEP on Azure. Use node pools with the instance types above dedicated to SEP coordinator and workers.</p></li><li><p style="text-align: justify;">AKS Accelerated Networking: Ensure Accelerated Networking is enabled on all SEP node pool VMs  this is critical for achieving the advertised network bandwidth figures</p></li><li><p style="text-align: justify;">Azure Marketplace: SEP is available directly from the Azure Marketplace with PAYG licensing</p></li></ul><h2>4.4 Google Cloud Platform (GCP)</h2><p style="text-align: justify;">SEP is supported on Google Kubernetes Engine (GKE) and Google Compute Engine VMs. The official Starburst documentation lists GKE as the primary and fully certified deployment method on GCP. GCP&#8217;s machine naming convention is different from AWS and Azure: machines are described as &#8216;machine series&#8217; (e.g., n2) + &#8216;machine type suffix&#8217; (e.g., standard, highmem, highcpu).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G4vB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4229ef59-21f7-4d9b-b458-a374ae22d439_837x916.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G4vB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4229ef59-21f7-4d9b-b458-a374ae22d439_837x916.png 424w, https://substackcdn.com/image/fetch/$s_!G4vB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4229ef59-21f7-4d9b-b458-a374ae22d439_837x916.png 848w, https://substackcdn.com/image/fetch/$s_!G4vB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4229ef59-21f7-4d9b-b458-a374ae22d439_837x916.png 1272w, https://substackcdn.com/image/fetch/$s_!G4vB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4229ef59-21f7-4d9b-b458-a374ae22d439_837x916.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G4vB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4229ef59-21f7-4d9b-b458-a374ae22d439_837x916.png" width="837" height="916" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4229ef59-21f7-4d9b-b458-a374ae22d439_837x916.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:916,&quot;width&quot;:837,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:147170,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/192879054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4229ef59-21f7-4d9b-b458-a374ae22d439_837x916.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!G4vB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4229ef59-21f7-4d9b-b458-a374ae22d439_837x916.png 424w, https://substackcdn.com/image/fetch/$s_!G4vB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4229ef59-21f7-4d9b-b458-a374ae22d439_837x916.png 848w, https://substackcdn.com/image/fetch/$s_!G4vB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4229ef59-21f7-4d9b-b458-a374ae22d439_837x916.png 1272w, https://substackcdn.com/image/fetch/$s_!G4vB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4229ef59-21f7-4d9b-b458-a374ae22d439_837x916.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="pullquote"><p style="text-align: justify;"><em>GCP-specific note on vCPUs: Unlike AWS, GCP&#8217;s standard (Intel/AMD) instances also use hyperthreading &#8212; each vCPU is one hardware thread. The Tau T2A (ARM/Ampere) is the exception: SMT (hyperthreading) is disabled, so each vCPU represents a full physical core. This matters for setting task.concurrency in Trino config (discussed in Part 2).</em></p></div><p style="text-align: justify;">GCP-specific considerations for SEP deployments:</p><ul><li><p style="text-align: justify;">Google Cloud Storage (GCS): SEP reads from GCS via the native file system implementation. Use fs.native-gcs.enabled=true in your catalog configuration. The legacy GCS support is deprecated and will be removed in a future SEP version.</p></li><li><p style="text-align: justify;">GKE Standard clusters: Recommended over Autopilot for SEP deployments that need precise control over node pool configuration and affinity rules.</p></li><li><p style="text-align: justify;">Dataproc Metastore: Can serve as the Hive Metastore Service (HMS) for SEP catalogs on GCP &#8212; eliminates the operational overhead of managing your own HMS.</p></li><li><p style="text-align: justify;">Sustained Use Discounts: GCP automatically applies sustained use discounts for N2 and N2D instances that run for most of the month &#8212; no upfront commitment required.</p></li><li><p style="text-align: justify;">Committed Use Discounts (CUDs): Available for N2 and N2D &#8212; up to 55% savings for 3-year commitments</p></li></ul><h2>4.5 Multi-Cloud Comparison &#8212; Equivalent Instance Families</h2><p>For teams operating across multiple clouds or evaluating which cloud to use for a new SEP deployment, this table maps roughly equivalent instance families across all three:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Fbtt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F233ad0cb-818f-4eba-8f3e-7a5277d3699d_940x759.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Fbtt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F233ad0cb-818f-4eba-8f3e-7a5277d3699d_940x759.png 424w, https://substackcdn.com/image/fetch/$s_!Fbtt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F233ad0cb-818f-4eba-8f3e-7a5277d3699d_940x759.png 848w, https://substackcdn.com/image/fetch/$s_!Fbtt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F233ad0cb-818f-4eba-8f3e-7a5277d3699d_940x759.png 1272w, https://substackcdn.com/image/fetch/$s_!Fbtt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F233ad0cb-818f-4eba-8f3e-7a5277d3699d_940x759.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Fbtt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F233ad0cb-818f-4eba-8f3e-7a5277d3699d_940x759.png" width="940" height="759" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/233ad0cb-818f-4eba-8f3e-7a5277d3699d_940x759.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:759,&quot;width&quot;:940,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:118072,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/192879054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F233ad0cb-818f-4eba-8f3e-7a5277d3699d_940x759.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Fbtt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F233ad0cb-818f-4eba-8f3e-7a5277d3699d_940x759.png 424w, https://substackcdn.com/image/fetch/$s_!Fbtt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F233ad0cb-818f-4eba-8f3e-7a5277d3699d_940x759.png 848w, https://substackcdn.com/image/fetch/$s_!Fbtt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F233ad0cb-818f-4eba-8f3e-7a5277d3699d_940x759.png 1272w, https://substackcdn.com/image/fetch/$s_!Fbtt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F233ad0cb-818f-4eba-8f3e-7a5277d3699d_940x759.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="pullquote"><p style="text-align: justify;"><em><strong>Important caveat on instance comparisons</strong>: Cloud pricing and availability change frequently. The instances listed above represent the general family recommendations as of early 2026. Always check current pricing in your specific region before making deployment decisions. For production workloads, benchmark your actual query mix on candidate instance types before committing &#8212; paper specs do not always translate to real-world Trino performance.</em></p></div><h2>4.6 Kubernetes vs. VM Deployments</h2><p>SEP supports both Kubernetes (K8s) and VM-based deployments. From a sizing perspective:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RJcz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e02f8d5-a3a2-4f57-8e3b-a09fb705ba93_937x447.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RJcz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e02f8d5-a3a2-4f57-8e3b-a09fb705ba93_937x447.png 424w, https://substackcdn.com/image/fetch/$s_!RJcz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e02f8d5-a3a2-4f57-8e3b-a09fb705ba93_937x447.png 848w, https://substackcdn.com/image/fetch/$s_!RJcz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e02f8d5-a3a2-4f57-8e3b-a09fb705ba93_937x447.png 1272w, https://substackcdn.com/image/fetch/$s_!RJcz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e02f8d5-a3a2-4f57-8e3b-a09fb705ba93_937x447.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RJcz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e02f8d5-a3a2-4f57-8e3b-a09fb705ba93_937x447.png" width="937" height="447" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0e02f8d5-a3a2-4f57-8e3b-a09fb705ba93_937x447.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:447,&quot;width&quot;:937,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:77817,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/192879054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e02f8d5-a3a2-4f57-8e3b-a09fb705ba93_937x447.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RJcz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e02f8d5-a3a2-4f57-8e3b-a09fb705ba93_937x447.png 424w, https://substackcdn.com/image/fetch/$s_!RJcz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e02f8d5-a3a2-4f57-8e3b-a09fb705ba93_937x447.png 848w, https://substackcdn.com/image/fetch/$s_!RJcz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e02f8d5-a3a2-4f57-8e3b-a09fb705ba93_937x447.png 1272w, https://substackcdn.com/image/fetch/$s_!RJcz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e02f8d5-a3a2-4f57-8e3b-a09fb705ba93_937x447.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="pullquote"><p style="text-align: justify;"><em><strong>Kubernetes CPU limit note</strong>: The official Starburst Kubernetes configuration documentation explicitly warns about a known issue with Kubernetes CPU throttling. For SEP coordinator and worker pods, set CPU requests but do NOT set CPU limits. A hard CPU limit in Kubernetes causes JVM performance degradation due to CPU throttling even when the node has headroom. See docs.starburst.io/latest/k8s/sep-config-examples.html for the recommended values.yaml configuration.</em></p></div><h1>Chapter 5: What High Utilisation Actually Means</h1><p style="text-align: justify;">Returning to where we started: the junior engineer&#8217;s observation of 80% RAM and fully utilised worker nodes. With the architectural understanding from this document, let me now give a proper answer.</p><h2>5.1 The Correct Diagnostic Framework</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VPSw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F715684d8-9342-45f5-8729-40a491a9f7bb_841x796.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VPSw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F715684d8-9342-45f5-8729-40a491a9f7bb_841x796.png 424w, https://substackcdn.com/image/fetch/$s_!VPSw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F715684d8-9342-45f5-8729-40a491a9f7bb_841x796.png 848w, https://substackcdn.com/image/fetch/$s_!VPSw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F715684d8-9342-45f5-8729-40a491a9f7bb_841x796.png 1272w, https://substackcdn.com/image/fetch/$s_!VPSw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F715684d8-9342-45f5-8729-40a491a9f7bb_841x796.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VPSw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F715684d8-9342-45f5-8729-40a491a9f7bb_841x796.png" width="841" height="796" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/715684d8-9342-45f5-8729-40a491a9f7bb_841x796.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:796,&quot;width&quot;:841,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:126482,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/192879054?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F715684d8-9342-45f5-8729-40a491a9f7bb_841x796.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VPSw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F715684d8-9342-45f5-8729-40a491a9f7bb_841x796.png 424w, https://substackcdn.com/image/fetch/$s_!VPSw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F715684d8-9342-45f5-8729-40a491a9f7bb_841x796.png 848w, https://substackcdn.com/image/fetch/$s_!VPSw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F715684d8-9342-45f5-8729-40a491a9f7bb_841x796.png 1272w, https://substackcdn.com/image/fetch/$s_!VPSw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F715684d8-9342-45f5-8729-40a491a9f7bb_841x796.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>5.2 Metrics That Actually Matter</h2><p style="text-align: justify;">Rather than watching raw CPU or RAM percentages, monitor these metrics for a true picture of cluster health:</p><ul><li><p style="text-align: justify;">Query success rate: What percentage of submitted queries complete successfully? OOM errors show up here.</p></li><li><p style="text-align: justify;">Query wait time (queue time): How long does a query spend waiting to be scheduled before execution begins? High queue time means the cluster is at capacity for concurrent queries.</p></li><li><p style="text-align: justify;">Query execution time percentiles (P50, P95, P99): Are these stable or trending upward over time?</p></li><li><p style="text-align: justify;">EXCEEDED_LOCAL_MEMORY_LIMIT error count: Queries failing with this error indicate memory pressure.</p></li><li><p>GC time ratio (via JMX): If GC time exceeds 10% of wall clock time on workers, memory pressure is affecting JVM performance.</p></li></ul><div class="pullquote"><p style="text-align: justify;"><em><strong>Closing thought for Part 1</strong>: 80% RAM with 25 busy workers is exactly what a healthy, well-loaded MPP cluster looks like. The goal of a query engine is to efficiently use the resources you have provisioned. If utilisation were consistently low, that would mean you are over-provisioned and paying for idle resources. Monitor outcomes (query success, latency) not raw utilisation numbers.</em></p></div><h1>References</h1><ul><li><p><a href="https://www.youtube.com/watch?v=Pu80FkBRP-k">Presto Training Series, Session 4: Cluster Sizing &amp; Performance Tuning</a> </p></li><li><p><a href="http://starburst.io/info/oreilly-trino-guide/">Trino: The Definitive Guide, 2nd Edition &#8212; Matt Fuller, Manfred Moser, Martin Traverso. O&#8217;Reilly Media, October 2022</a>. </p></li><li><p><a href="http://docs.starburst.io/latest/">Starburst Enterprise Documentation, SEP 477-e LTS (November 2025)</a> </p></li><li><p><a href="https://docs.starburst.io/latest/ecosystems/microsoft/index.html">Starburst Enterprise on Azure</a></p></li><li><p><a href="http://docs.starburst.io/latest/ecosystems/google/">Starburst Enterprise on Google Cloud</a> </p></li><li><p><a href="http://docs.starburst.io/latest/k8s/sep-config-examples.html">Starburst Kubernetes Configuration Examples</a></p></li><li><p><a href="http://learn.microsoft.com/en-us/azure/virtual-machines/sizes/memory-optimized/e-family">Microsoft Azure: E Family VM Size Series</a></p></li><li><p><a href="http://cloud.google.com/compute/docs/general-purpose-machines">Google Cloud: General-Purpose Machine Family</a></p></li><li><p><a href="http://trino.io/blog/2020/12/27/announcing-trino.html">Announcing Trino (PrestoSQL rename)</a> </p></li><li><p><a href="http://starburst.io/blog/prestosql-becomes-trino/">Starburst: PrestoSQL becomes Trino</a> </p></li></ul><p></p><p style="text-align: center;"><em>Series: Starburst Enterprise Performance Tuning | Part 1 of 3 | March 2026 | Validated against SEP 477-e LTS</em></p><p style="text-align: center;"><em><strong>will Continue with Part 2: Configuration Deep Dive : Memory, JVM &amp; Concurrency</strong></em></p>]]></content:encoded></item><item><title><![CDATA[What Starburst Knows About Your Query Before It Runs a Single Row]]></title><description><![CDATA[A practitioner&#8217;s guide to reading distributed query plans, spotting skew, confirming pushdown, and diagnosing failures &#8212; with six real TPC-DS examples from Starburst Galaxy]]></description><link>https://tshasankda.substack.com/p/what-starburst-knows-about-your-query</link><guid isPermaLink="false">https://tshasankda.substack.com/p/what-starburst-knows-about-your-query</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Wed, 18 Mar 2026 20:04:48 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!14lz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa16eeb02-7dc6-42b8-b577-8c8ad3cd0c9d_1490x973.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Working with Starburst clusters in production, I have been asked a version of the same question more times than I can count: </p><div class="pullquote"><p>why is this query slow?</p></div><p> Sometimes it comes from a teammate who has been staring at a SQL statement for an hour. Sometimes it comes from a stakeholder who just wants to know why their dashboard is taking so long to load and occasionally it comes from a senior engineer who knows something is wrong but cannot pinpoint where.</p><blockquote><p>My answer is almost always the same: let us look at the plan.</p></blockquote><p>Reading query plans has become the most reliable diagnostic habit I have built working on this platform. Not because it solves every problem on the first read, but because it consistently points me in the right direction faster than anything else. This post walks through how to use Starburst&#8217;s EXPLAIN tooling and how to interpret what you find. Every example uses real queries run against the TPC-DS benchmark dataset on Starburst Galaxy at scale factor 1 (tpcds.sf1). The plan outputs shown are taken directly from those runs, with key lines highlighted.</p><p>Whether you are an analyst trying to understand why a query is crawling, an engineer chasing a cluster error, or a stakeholder wanting a clearer picture of what this platform actually does when you hit run, this is written for you.</p><h2>How Starburst processes a query</h2><p>Before EXPLAIN makes sense, it helps to understand the two-tier model Starburst uses. There is a single coordinator node responsible for receiving your SQL, parsing it, building a query plan, and scheduling work. Then there are one or more worker nodes that do the actual work of retrieving and processing data in parallel.</p><p>When the coordinator receives your query, it builds a distributed query plan: a structured description of how work will be divided into stages and tasks across workers. The EXPLAIN command lets you see that plan before any data is touched. It is a way of asking the coordinator to show its thinking before committing to execution.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!14lz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa16eeb02-7dc6-42b8-b577-8c8ad3cd0c9d_1490x973.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!14lz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa16eeb02-7dc6-42b8-b577-8c8ad3cd0c9d_1490x973.png 424w, https://substackcdn.com/image/fetch/$s_!14lz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa16eeb02-7dc6-42b8-b577-8c8ad3cd0c9d_1490x973.png 848w, https://substackcdn.com/image/fetch/$s_!14lz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa16eeb02-7dc6-42b8-b577-8c8ad3cd0c9d_1490x973.png 1272w, https://substackcdn.com/image/fetch/$s_!14lz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa16eeb02-7dc6-42b8-b577-8c8ad3cd0c9d_1490x973.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!14lz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa16eeb02-7dc6-42b8-b577-8c8ad3cd0c9d_1490x973.png" width="1456" height="951" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a16eeb02-7dc6-42b8-b577-8c8ad3cd0c9d_1490x973.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:951,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:248570,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/191310727?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa16eeb02-7dc6-42b8-b577-8c8ad3cd0c9d_1490x973.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!14lz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa16eeb02-7dc6-42b8-b577-8c8ad3cd0c9d_1490x973.png 424w, https://substackcdn.com/image/fetch/$s_!14lz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa16eeb02-7dc6-42b8-b577-8c8ad3cd0c9d_1490x973.png 848w, https://substackcdn.com/image/fetch/$s_!14lz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa16eeb02-7dc6-42b8-b577-8c8ad3cd0c9d_1490x973.png 1272w, https://substackcdn.com/image/fetch/$s_!14lz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa16eeb02-7dc6-42b8-b577-8c8ad3cd0c9d_1490x973.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Sourced  from Martin Traverso Presentation</figcaption></figure></div><div class="pullquote"><p><em>&#8220;When Trino parses a statement, it converts it into a query and creates a distributed query plan, which is then realized as a series of interconnected stages running on Trino workers.&#8221; &#8212; Starburst architecture documentation</em></p></div><p>That pre-execution window is the point. You can read the plan, spot problems, and rethink the query before a single byte of data is processed.</p><h2>How Starburst&#8217;s EXPLAIN compares to other tools</h2><p>Query plan inspection exists in most modern query engines, and it is worth being precise about what is different here.</p><p>Snowflake has an EXPLAIN command, but it returns a logical plan only. Snowflake&#8217;s own documentation states the actual physical execution order may differ, and there are no tuning knobs to act on the output. The real distributed picture with actual row counts and timing is only available post-execution via the Query Profile UI.</p><p>Redshift&#8217;s EXPLAIN is explicitly limited. Amazon&#8217;s documentation states it does not illustrate the details of parallel query processing. You have to run the query first and query system tables after.</p><p>BigQuery has no SQL-level EXPLAIN command at all. Plan information is post-execution only, via the console or API.</p><p>Starburst gives you the full physical fragment graph before execution as a standard SQL command, with fragment types, data distribution strategy, cost estimates, and pushdown confirmation all visible. Because Starburst queries across multiple data sources simultaneously, the plan shows the full execution graph across all connectors in a single view. And the plan tells you with certainty whether optimizations like predicate pushdown actually fired. With that context, here is how to use it.</p><h1>1. The command to start with</h1><p>EXPLAIN (TYPE LOGICAL) also exists but is confirmed deprecated in current Starburst releases. The recommended command is EXPLAIN (TYPE DISTRIBUTED), which shows the full physical execution graph. It does not run the query.</p><blockquote><p><strong>Important:</strong> <em>EXPLAIN (TYPE LOGICAL) is officially deprecated in current Starburst releases and will be removed in a future version. If you have existing scripts or workflows using TYPE LOGICAL, migrate them to EXPLAIN (TYPE DISTRIBUTED). All examples in this post use the distributed type, which is the recommended command going forward.</em></p></blockquote><p>All examples below use tpcds.sf1 on Starburst Galaxy. The TPC-DS table sizes at this scale factor are: store_sales 2.88 million rows, date_dim 73,049 rows, item 18,000 rows, store 12 rows, customer 100,000 rows. These sizes drive the optimizer&#8217;s decisions about how to distribute joins.</p><p><strong>Example 1: Simple scan</strong></p><p><strong>Query:</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:&quot;f0046da9-431a-42da-b566-36d6dd22cc39&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">EXPLAIN (TYPE DISTRIBUTED)
SELECT * FROM tpcds.sf1.store_sales LIMIT 10;</code></pre></div><p><strong>Key plan output:</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;34c7c47b-d10b-4c4f-b441-57e7a5261e52&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Fragment 0 [SINGLE]
Output partitioning: SINGLE []
Limit[count = 10]
&#9492;&#9472; LocalExchange[partitioning = SINGLE]
&#9492;&#9472; RemoteSource[sourceFragmentIds = [1]]
Fragment 1 [SOURCE]
Output partitioning: SINGLE []
LimitPartial[count = 10]
&#9492;&#9472; TableScan[table = tpcds:store_sales:sf1.0]
Estimates: {rows: 2880404 (537.32MB), cpu: 537.32M, memory: 0B, network: 0B}</code></pre></div><blockquote><p><em>Even for LIMIT 10, the TableScan estimates reading all 2.88 million rows (537MB) from the connector. The LimitPartial in Fragment 1 stops sending rows once 10 are found, but the scan starts against the full table. This is why SELECT * with a LIMIT is not as cheap as it looks.</em></p></blockquote><p>Read the plan from bottom to top. Fragment 1 executes first, reads from the connector, and feeds rows upward to Fragment 0 via the RemoteSource. Fragment 0 applies the final limit and returns results to the client.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6jhw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6583c558-909e-4f5d-891b-5a59cd2dcda5_1174x571.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6jhw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6583c558-909e-4f5d-891b-5a59cd2dcda5_1174x571.png 424w, https://substackcdn.com/image/fetch/$s_!6jhw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6583c558-909e-4f5d-891b-5a59cd2dcda5_1174x571.png 848w, https://substackcdn.com/image/fetch/$s_!6jhw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6583c558-909e-4f5d-891b-5a59cd2dcda5_1174x571.png 1272w, https://substackcdn.com/image/fetch/$s_!6jhw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6583c558-909e-4f5d-891b-5a59cd2dcda5_1174x571.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6jhw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6583c558-909e-4f5d-891b-5a59cd2dcda5_1174x571.png" width="1174" height="571" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6583c558-909e-4f5d-891b-5a59cd2dcda5_1174x571.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:571,&quot;width&quot;:1174,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:659617,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/191310727?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6583c558-909e-4f5d-891b-5a59cd2dcda5_1174x571.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6jhw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6583c558-909e-4f5d-891b-5a59cd2dcda5_1174x571.png 424w, https://substackcdn.com/image/fetch/$s_!6jhw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6583c558-909e-4f5d-891b-5a59cd2dcda5_1174x571.png 848w, https://substackcdn.com/image/fetch/$s_!6jhw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6583c558-909e-4f5d-891b-5a59cd2dcda5_1174x571.png 1272w, https://substackcdn.com/image/fetch/$s_!6jhw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6583c558-909e-4f5d-891b-5a59cd2dcda5_1174x571.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Fig 1 &#8212; Galaxy output: EXPLAIN for store_sales LIMIT 10. Fragment 0 [SINGLE] at top, Fragment 1 [SOURCE] at bottom. Read bottom to top.</figcaption></figure></div><p><strong>Example 2: Join with a small dimension table and see BROADCAST in action</strong></p><p>Query:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:&quot;fa978342-ad9f-4674-b1e0-18a85dca48fd&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">EXPLAIN (TYPE DISTRIBUTED)
SELECT
  ss.ss_store_sk, s.s_store_name, COUNT(*) AS transaction_count
FROM tpcds.sf1.store_sales ss
JOIN tpcds.sf1.store s ON ss.ss_store_sk = s.s_store_sk
GROUP BY ss.ss_store_sk, s.s_store_name;</code></pre></div><p>Key plan output:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;259563ac-b578-472a-a31c-e2fe60a21d6b&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Fragment 0 [HASH]
    Output partitioning: SINGLE []
    Aggregate[type = FINAL, keys = [ss_store_sk, s_store_name]]
Fragment 1 [SOURCE]
    Output partitioning: HASH [ss_store_sk, s_store_name]
    InnerJoin[criteria = (ss_store_sk = s_store_sk), distribution = REPLICATED]
    &#9500;&#9472; ScanFilter[table = tpcds:store_sales:sf1.0]
    &#9492;&#9472; RemoteSource[sourceFragmentIds = [2]]
Fragment 2 [SOURCE]
    Output partitioning: BROADCAST []
    TableScan[table = tpcds:store:sf1.0]
        Estimates: {rows: 12 (201B), cpu: 201, memory: 0B, network: 0B}</code></pre></div><blockquote><p>Fragment 2 uses BROADCAST partitioning because the store table has only 12 rows at 201 bytes. Every worker node receives a complete copy of it, and the join runs locally on each worker without any network shuffle. A BROADCAST on a table with millions of rows would be a serious warning sign.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vBNY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5422f275-2476-462e-9759-0996c5cc7084_1174x640.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vBNY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5422f275-2476-462e-9759-0996c5cc7084_1174x640.png 424w, https://substackcdn.com/image/fetch/$s_!vBNY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5422f275-2476-462e-9759-0996c5cc7084_1174x640.png 848w, https://substackcdn.com/image/fetch/$s_!vBNY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5422f275-2476-462e-9759-0996c5cc7084_1174x640.png 1272w, https://substackcdn.com/image/fetch/$s_!vBNY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5422f275-2476-462e-9759-0996c5cc7084_1174x640.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vBNY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5422f275-2476-462e-9759-0996c5cc7084_1174x640.png" width="1174" height="640" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5422f275-2476-462e-9759-0996c5cc7084_1174x640.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:640,&quot;width&quot;:1174,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:563630,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/191310727?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5422f275-2476-462e-9759-0996c5cc7084_1174x640.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vBNY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5422f275-2476-462e-9759-0996c5cc7084_1174x640.png 424w, https://substackcdn.com/image/fetch/$s_!vBNY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5422f275-2476-462e-9759-0996c5cc7084_1174x640.png 848w, https://substackcdn.com/image/fetch/$s_!vBNY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5422f275-2476-462e-9759-0996c5cc7084_1174x640.png 1272w, https://substackcdn.com/image/fetch/$s_!vBNY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5422f275-2476-462e-9759-0996c5cc7084_1174x640.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Fig 2 &#8212; Galaxy output: store_sales JOIN store. Fragment 2 [SOURCE] output partitioning: BROADCAST []. 12 rows, 201 bytes. Fragment 0 [HASH] handles the final aggregation.</figcaption></figure></div><p><strong>Example 3: Three-table join with all fragment types together</strong></p><p>Query:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:&quot;6ed6855a-410d-4200-a446-2837b26fe6ef&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">EXPLAIN (TYPE DISTRIBUTED)
SELECT i.i_category, d.d_year, SUM(ss.ss_net_paid) AS total_revenue
FROM tpcds.sf1.store_sales ss
JOIN tpcds.sf1.item i ON ss.ss_item_sk = i.i_item_sk
JOIN tpcds.sf1.date_dim d ON ss.ss_sold_date_sk = d.d_date_sk
WHERE d.d_year = 2001
GROUP BY i.i_category, d.d_year
ORDER BY total_revenue DESC;</code></pre></div><p>Key plan output (6 fragments):</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;c435e67d-439a-4127-ba95-23d943c141ea&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Fragment 0 [SINGLE]      &#8592; final output
Fragment 1 [ROUND_ROBIN] &#8592; distributed sort
Fragment 2 [HASH]        &#8592; final aggregation
Fragment 3 [SOURCE]      &#8592; store_sales scan + both joins
    dynamicFilterAssignments = {i_item_sk -&gt; #df_633, d_date_sk -&gt; #df_634}
Fragment 4 [SOURCE]  Output partitioning: BROADCAST []
    ScanFilter[date_dim]  filterPredicate = (d_year = 2001)
    Estimates: {rows: 73049}/{rows: 363}   &#8592; 73K rows read, 363 survive
Fragment 5 [SOURCE]  Output partitioning: BROADCAST []
    TableScan[item]  Estimates: {rows: 18000 (246.15kB)}</code></pre></div><blockquote><p>Fragments 4 and 5 are both BROADCAST SOURCE fragments. date_dim is filtered to 363 rows (from 73,049) before being broadcast. item has 18,000 rows and is broadcast in full. The optimizer also assigns dynamic filters from these dimension scans to the store_sales scan in Fragment 3, further narrowing the rows processed before the joins.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ssuc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ce3159-5eb0-4c77-9b46-b0e65e1307f3_1179x649.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ssuc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ce3159-5eb0-4c77-9b46-b0e65e1307f3_1179x649.png 424w, https://substackcdn.com/image/fetch/$s_!Ssuc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ce3159-5eb0-4c77-9b46-b0e65e1307f3_1179x649.png 848w, https://substackcdn.com/image/fetch/$s_!Ssuc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ce3159-5eb0-4c77-9b46-b0e65e1307f3_1179x649.png 1272w, https://substackcdn.com/image/fetch/$s_!Ssuc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ce3159-5eb0-4c77-9b46-b0e65e1307f3_1179x649.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ssuc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ce3159-5eb0-4c77-9b46-b0e65e1307f3_1179x649.png" width="1179" height="649" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42ce3159-5eb0-4c77-9b46-b0e65e1307f3_1179x649.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:649,&quot;width&quot;:1179,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:502547,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/191310727?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ce3159-5eb0-4c77-9b46-b0e65e1307f3_1179x649.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ssuc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ce3159-5eb0-4c77-9b46-b0e65e1307f3_1179x649.png 424w, https://substackcdn.com/image/fetch/$s_!Ssuc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ce3159-5eb0-4c77-9b46-b0e65e1307f3_1179x649.png 848w, https://substackcdn.com/image/fetch/$s_!Ssuc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ce3159-5eb0-4c77-9b46-b0e65e1307f3_1179x649.png 1272w, https://substackcdn.com/image/fetch/$s_!Ssuc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42ce3159-5eb0-4c77-9b46-b0e65e1307f3_1179x649.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Fig 3 &#8212; Galaxy output: three-table join plan. Fragment 0 [SINGLE], Fragment 1 [ROUND_ROBIN], Fragment 2 [HASH], Fragment 3 [SOURCE] visible. Fragments 4 and 5 (BROADCAST) are below.</figcaption></figure></div><p><strong>2. Layout, Estimates, and what question marks mean</strong></p><p>Inside each fragment, every operator shows two sub-sections.</p><p><strong>Layout</strong> describes the data shape at that point: column names and types. In the store_sales scan from Example 1, the Layout lists all 23 columns. A SELECT * carries all of them through every exchange, making downstream operations more expensive than necessary.</p><p><strong>Estimates</strong> show the optimizer&#8217;s predicted resource consumption. When table statistics exist, they are reasonably accurate:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;de18618d-8108-461f-8aa8-d4179e8e5888&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Estimates: {rows: 2880404 (537.32MB), cpu: 537.32M, memory: 0B, network: 0B}</code></pre></div><p>When statistics are missing or the optimizer cannot derive costs from available information, you see question marks:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;33f4cb05-e12d-4480-8d62-64c293aeef33&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Estimates: {rows: 50 (1.79kB), cpu: ?, memory: ?, network: ?}</code></pre></div><p><strong>Example 4: Question marks and ScanFilterProject</strong></p><p>Query:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:&quot;15f17873-14c2-4d3a-b20d-3a5988da4cb6&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">EXPLAIN (TYPE DISTRIBUTED)
SELECT c.c_customer_sk, c.c_first_name, c.c_last_name,
       SUM(ss.ss_net_paid) AS lifetime_value
FROM tpcds.sf1.store_sales ss
JOIN tpcds.sf1.customer c ON ss.ss_customer_sk = c.c_customer_sk
JOIN tpcds.sf1.date_dim d ON ss.ss_sold_date_sk = d.d_date_sk
WHERE d.d_year BETWEEN 2000 AND 2002
  AND c.c_birth_country = &#8216;UNITED STATES&#8217;
GROUP BY c.c_customer_sk, c.c_first_name, c.c_last_name
ORDER BY lifetime_value DESC LIMIT 50;</code></pre></div><p>Key plan output:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;4c75fa08-25ab-4570-9202-f547ee58aef7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Fragment 1 [HASH]
    TopNPartial[count = 50]
    &#9474;   Estimates: {rows: 50 (1.79kB), cpu: ?, memory: ?, network: ?}
Fragment 3 [SOURCE]  Output partitioning: BROADCAST []
    ScanFilterProject[table = tpcds:customer:sf1.0,
        filterPredicate = (c_birth_country = varchar(20) &#8216;UNITED STATES&#8217;)]
        Estimates: {rows: 100000 (1.87MB)}/{rows: 458 (8.75kB)}
Fragment 4 [SOURCE]  Output partitioning: BROADCAST []
    ScanFilterProject[table = tpcds:date_dim:sf1.0,
        filterPredicate = (d_year BETWEEN 2000 AND 2002)]
        Estimates: {rows: 73049}/{rows: 730}</code></pre></div><blockquote><p>Two things to notice. First, the TopN operator in Fragment 1 shows cpu: ? and memory: ? because the optimizer cannot predict the cost of ranking without reliable row estimates from the join below it. Second, <strong>ScanFilterProject</strong> on customer reads 100,000 rows from the connector and 99.5% are discarded by the filter in Trino. If the customer connector supported predicate pushdown for this filter, only 458 rows would be read. The presence of <strong>ScanFilterProject</strong> rather than <strong>ScanFilter</strong> is how you confirm pushdown did not fire.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!k9fl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9084f89-d0a1-4e66-a435-b998c1553f33_1177x643.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!k9fl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9084f89-d0a1-4e66-a435-b998c1553f33_1177x643.png 424w, https://substackcdn.com/image/fetch/$s_!k9fl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9084f89-d0a1-4e66-a435-b998c1553f33_1177x643.png 848w, https://substackcdn.com/image/fetch/$s_!k9fl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9084f89-d0a1-4e66-a435-b998c1553f33_1177x643.png 1272w, https://substackcdn.com/image/fetch/$s_!k9fl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9084f89-d0a1-4e66-a435-b998c1553f33_1177x643.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!k9fl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9084f89-d0a1-4e66-a435-b998c1553f33_1177x643.png" width="1177" height="643" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9084f89-d0a1-4e66-a435-b998c1553f33_1177x643.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:643,&quot;width&quot;:1177,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:624961,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/191310727?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9084f89-d0a1-4e66-a435-b998c1553f33_1177x643.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!k9fl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9084f89-d0a1-4e66-a435-b998c1553f33_1177x643.png 424w, https://substackcdn.com/image/fetch/$s_!k9fl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9084f89-d0a1-4e66-a435-b998c1553f33_1177x643.png 848w, https://substackcdn.com/image/fetch/$s_!k9fl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9084f89-d0a1-4e66-a435-b998c1553f33_1177x643.png 1272w, https://substackcdn.com/image/fetch/$s_!k9fl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9084f89-d0a1-4e66-a435-b998c1553f33_1177x643.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Fig 4 &#8212; Galaxy output: customer query plan. Fragment 1 [HASH] showing cpu: ?, memory: ? on TopN. Fragment 3 [SOURCE] showing ScanFilterProject on customer table.</figcaption></figure></div><p><strong>3. EXPLAIN ANALYZE: what actually happened</strong></p><p>Once a query has run, <strong>EXPLAIN ANALYZE</strong> replaces estimates with real measured values: actual CPU time, scheduled time, blocked time, and row counts per fragment. It also reports standard deviation per task, which is the signal for data skew.</p><p><strong>Example 5: Finding where time went</strong></p><p>Query:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:&quot;47762f39-a695-4949-b522-c4caf3760f3c&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">EXPLAIN ANALYZE
SELECT i.i_category, SUM(ss.ss_net_paid) AS revenue
FROM tpcds.sf1.store_sales ss
JOIN tpcds.sf1.item i ON ss.ss_item_sk = i.i_item_sk
GROUP BY i.i_category ORDER BY revenue DESC;</code></pre></div><p>Key plan output (Fragment 4 &#8212; where the work happened):</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;9fd4efa3-b041-449b-931a-432b4b97cbcb&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Fragment 4 [SOURCE]
    CPU: 21.59s, Scheduled: 24.07s, Input: 2898404 rows (49.79MB)
    InnerJoin[criteria = (ss_item_sk = i_item_sk), distribution = REPLICATED]
        Left (probe) Input avg.: 66799.50 rows, Input std.dev.: 100.00%
        Right (build) Input avg.: 18000.00 rows, Input std.dev.: 0.00%
    &#9492;&#9472; ScanFilter[table = tpcds:store_sales:sf1.0,
           dynamicFilters = {ss_item_sk = #df_452 await}]
           CPU: 20.70s (90.15% of total query CPU)
           Input avg.: 1440202.00 rows, Input std.dev.: 100.00%
           Dynamic filters:
               df_452, [18000 item key ranges], collection time=3.62s</code></pre></div><blockquote><p>Fragment 4 consumed over 90% of total CPU time. The std.dev.: 100.00% on the probe side of the join means one worker task processed all the rows while the other processed none, that is data skew. The dynamic filter collected 18,000 valid item key ranges from the item scan and applied them to the store_sales scan before rows reached the join, reducing unnecessary work. Without this filter, all 2.88 million store_sales rows would have been evaluated against the join condition.</p></blockquote><div class="pullquote"><p>Note: execution times vary between runs. The screenshot below is from a separate run of the same query on the same cluster and shows 18.75s. Both are real results from the same query; run-to-run variation is normal.</p></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KdO3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5605a901-0cbe-4d90-abd7-312a8f8aaebc_1174x589.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KdO3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5605a901-0cbe-4d90-abd7-312a8f8aaebc_1174x589.png 424w, https://substackcdn.com/image/fetch/$s_!KdO3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5605a901-0cbe-4d90-abd7-312a8f8aaebc_1174x589.png 848w, https://substackcdn.com/image/fetch/$s_!KdO3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5605a901-0cbe-4d90-abd7-312a8f8aaebc_1174x589.png 1272w, https://substackcdn.com/image/fetch/$s_!KdO3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5605a901-0cbe-4d90-abd7-312a8f8aaebc_1174x589.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KdO3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5605a901-0cbe-4d90-abd7-312a8f8aaebc_1174x589.png" width="1174" height="589" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5605a901-0cbe-4d90-abd7-312a8f8aaebc_1174x589.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:589,&quot;width&quot;:1174,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:598162,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/191310727?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5605a901-0cbe-4d90-abd7-312a8f8aaebc_1174x589.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KdO3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5605a901-0cbe-4d90-abd7-312a8f8aaebc_1174x589.png 424w, https://substackcdn.com/image/fetch/$s_!KdO3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5605a901-0cbe-4d90-abd7-312a8f8aaebc_1174x589.png 848w, https://substackcdn.com/image/fetch/$s_!KdO3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5605a901-0cbe-4d90-abd7-312a8f8aaebc_1174x589.png 1272w, https://substackcdn.com/image/fetch/$s_!KdO3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5605a901-0cbe-4d90-abd7-312a8f8aaebc_1174x589.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Fig 5 &#8212; Galaxy output: EXPLAIN ANALYZE for store_sales JOIN item. Fragment 4 [SOURCE] CPU: 18.75s. Input std.dev.: 100.00% on probe side confirms data skew. Dynamic filter df_452 applied to scan.</figcaption></figure></div><p><strong>4. Operators to watch in complex plans</strong></p><p>When a query joins multiple fact tables using CTEs, the plan grows significantly. The following query compares revenue across store, catalog, and web sales channels for 2001:</p><p><strong>Example 6: Multi-CTE cross-channel query</strong></p><p>Query:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:&quot;f9900a7d-61be-4334-96f2-9fe6434e8970&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">EXPLAIN (TYPE DISTRIBUTED)
WITH store_channel AS (
  SELECT ss.ss_store_sk, ss.ss_item_sk, ss.ss_customer_sk,
         SUM(ss.ss_net_paid) AS store_revenue
  FROM tpcds.sf1.store_sales ss
  JOIN tpcds.sf1.date_dim d ON ss.ss_sold_date_sk = d.d_date_sk
  WHERE d.d_year = 2001
  GROUP BY ss.ss_store_sk, ss.ss_item_sk, ss.ss_customer_sk
),

catalog_channel AS (
  SELECT cs.cs_item_sk, cs.cs_bill_customer_sk,
         SUM(cs.cs_net_paid) AS catalog_revenue
  FROM tpcds.sf1.catalog_sales cs
  JOIN tpcds.sf1.date_dim d ON cs.cs_sold_date_sk = d.d_date_sk
  WHERE d.d_year = 2001
  GROUP BY cs.cs_item_sk, cs.cs_bill_customer_sk
)

SELECT i.i_item_sk, i.i_item_desc, i.i_category,
       sc.store_revenue, cc.catalog_revenue
FROM tpcds.sf1.item i
LEFT JOIN store_channel sc ON i.i_item_sk = sc.ss_item_sk
LEFT JOIN catalog_channel cc ON i.i_item_sk = cc.cs_item_sk
ORDER BY (COALESCE(sc.store_revenue,0) + COALESCE(cc.catalog_revenue,0)) DESC
LIMIT 100;</code></pre></div><p>Key plan output (11 fragments total):</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;2a47bc09-60c6-4fb3-ba51-44417305a587&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Fragment 1 [SOURCE]
    LeftJoin[criteria = (i_item_sk = ws_item_sk), distribution = REPLICATED]
        Estimates: {rows: 77018361 (11.65GB), cpu: 11.65G, memory: 3.56MB}
Fragment 4 [SOURCE]  Output partitioning: BROADCAST []
    ScanFilterProject[table = tpcds:date_dim:sf1.0,
        filterPredicate = (d_year = integer &#8216;2001&#8217;)]
        Estimates: {rows: 73049}/{rows: 363}  &#8592; 73K in, 363 survive
Fragment 7 [SOURCE]  Output partitioning: BROADCAST []   &#8592; catalog_sales date_dim
    ScanFilterProject  filterPredicate = (d_year = 2001)
        Estimates: {rows: 73049}/{rows: 363}  &#8592; same pattern
Fragment 10 [SOURCE] Output partitioning: BROADCAST []   &#8592; web_sales date_dim
    ScanFilterProject  filterPredicate = (d_year = 2001)
        Estimates: {rows: 73049}/{rows: 363}  &#8592; same pattern, third time</code></pre></div><blockquote><p>Three things to notice. First, the intermediate LeftJoin in Fragment 1 has an estimated size of 11.65GB before the TopN filter reduces this to 100 rows. This is an expensive intermediate result. Second, ScanFilterProject appears three times , once per sales channel CTE each scanning all 73,049 date_dim rows and discarding 99.5% of them. Third, the pattern repeating three times is a direct consequence of the CTE structure: the optimizer unrolls each CTE independently rather than scanning date_dim once and reusing the result.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WJGI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ae4d72e-90f4-4b8d-bbda-54b9f8808779_1171x609.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WJGI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ae4d72e-90f4-4b8d-bbda-54b9f8808779_1171x609.png 424w, https://substackcdn.com/image/fetch/$s_!WJGI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ae4d72e-90f4-4b8d-bbda-54b9f8808779_1171x609.png 848w, https://substackcdn.com/image/fetch/$s_!WJGI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ae4d72e-90f4-4b8d-bbda-54b9f8808779_1171x609.png 1272w, https://substackcdn.com/image/fetch/$s_!WJGI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ae4d72e-90f4-4b8d-bbda-54b9f8808779_1171x609.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WJGI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ae4d72e-90f4-4b8d-bbda-54b9f8808779_1171x609.png" width="1171" height="609" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ae4d72e-90f4-4b8d-bbda-54b9f8808779_1171x609.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:609,&quot;width&quot;:1171,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:542132,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/191310727?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ae4d72e-90f4-4b8d-bbda-54b9f8808779_1171x609.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WJGI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ae4d72e-90f4-4b8d-bbda-54b9f8808779_1171x609.png 424w, https://substackcdn.com/image/fetch/$s_!WJGI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ae4d72e-90f4-4b8d-bbda-54b9f8808779_1171x609.png 848w, https://substackcdn.com/image/fetch/$s_!WJGI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ae4d72e-90f4-4b8d-bbda-54b9f8808779_1171x609.png 1272w, https://substackcdn.com/image/fetch/$s_!WJGI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ae4d72e-90f4-4b8d-bbda-54b9f8808779_1171x609.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Fig 6 &#8212; Galaxy output: multi-CTE plan. Fragment 3 and 4 [SOURCE] visible. Fragment 4 shows ScanFilterProject on date_dim: 73,049 rows read, 363 survive. BROADCAST partitioning. Pattern repeats in fragments 7 and 10.</figcaption></figure></div><p>The operators worth stopping on in any complex plan:</p><ul><li><p><strong>Exchange</strong> data moving between workers. Each one carries a network cost. Unnecessary exchanges mean data cannot be co-located for a join.</p></li><li><p><strong>ScanFilterProject</strong> filter evaluated in Trino after reading from the connector rather than pushed to the source. The two-stage estimates ({rows in}/{rows out}) tell you exactly how many rows were wasted.</p></li><li><p><strong>Join distribution</strong> = <strong>REPLICATED</strong> confirms a BROADCAST join. Right on small tables, wrong on large ones.</p></li><li><p><strong>Dynamic filter</strong> <strong>assignments</strong> the optimizer&#8217;s plan to pre-filter a scan using runtime values from a join. Confirm it fired using EXPLAIN ANALYZE.</p></li></ul><p><strong>5. When a query fails before it reaches the workers</strong></p><p>At some point you will likely receive this error before any data is processed:</p><blockquote><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;99a03608-852f-4dbc-bc92-f5eb8194ba0d&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Number of stages in the query (126) exceeds the allowed maximum (100)</code></pre></div></blockquote><p>This is the <strong>QUERY_HAS_TOO_MANY_STAGES</strong> error. The coordinator built the distributed plan, counted the stages, and refused to proceed. The query never reached the workers.</p><p>The multi-CTE query in Example 6 already produces 11 fragments from two sales channels and one shared date dimension. Extending it to include web_sales as a third channel (as shown in the full version with fragments 0 through 10) adds another four fragments. In production, queries built on views that reference multiple other views can reach the stage limit without anyone realizing how deep the dependency chain runs.</p><p>The limit is controlled by <strong>query.max-stage-count</strong> with a default of 100. The official Starburst documentation notes that setting it too high can cause instability, causing unrelated queries to fail with <strong>REMOTE_TASK_ERROR</strong> due to request queue pressure. The fix is simplification: intermediate CTEs or temporary tables give the optimizer smaller problems to solve. If that is not possible, the limit can be raised in <strong>config.properties</strong> or overridden per session using <strong>query_max_stage_count</strong>.</p><blockquote><p>The stage count in the error is exact, not an estimate. If it says 126, that is precisely how many stages the coordinator determined the query requires.</p></blockquote><p>One important limit of EXPLAIN: it shows the plan the optimizer intends based on statistics and configuration at planning time. It cannot predict runtime behaviour if data distribution is uneven. EXPLAIN ANALYZE is the ground truth. For any query that runs but performs worse than expected, EXPLAIN ANALYZE is the next step after EXPLAIN.</p><p><strong>Final thoughts</strong></p><p>EXPLAIN is the first command I run when something unexpected happens on the cluster. Not because it answers every question immediately, but because it answers the most important one first: what is the engine actually planning to do?</p><p>The practical workflow: run EXPLAIN (TYPE DISTRIBUTED) before submitting any uncertain query. Read for unexpected fragment types, unnecessary exchanges, and question marks in cost estimates. Check for ScanFilterProject and use the two-stage row estimates to quantify what is being wasted. Follow up with EXPLAIN ANALYZE for real CPU time, row counts, and the standard deviation figure that signals skew. If a query fails at planning with a stage-count error, simplify the SQL before raising the configuration limit.</p><p>For stakeholders: Starburst gives you a level of pre-execution transparency that is genuinely distinctive in the context of federated multi-source queries. You can see the full distributed plan before running anything, confirm whether optimisations fired, and trace exactly which connector handled which part of the execution. That visibility is available as a standard SQL command and it compounds in value the more complex your data landscape becomes.</p><div class="pullquote"><p>Across more than sixteen years working in data, the tools have changed constantly. What has not changed is the value of being able to see what a system is actually doing rather than guessing at it. Starburst&#8217;s query plan tooling is a serious capability, and it is worth investing the time to read it properly.</p></div><h2>Suggested Reads &amp; References</h2><p>To deepen your understanding of query planning, execution, and performance tuning across Trino and modern data platforms, explore the following resources:</p><h3>Starburst &amp; Trino</h3><ul><li><p><a href="https://docs.starburst.io/latest/sql/explain.html">EXPLAIN documentation</a></p></li><li><p><a href="https://docs.starburst.io/latest/sql/explain-analyze.html">EXPLAIN ANALYZE documentation</a></p></li><li><p><a href="https://docs.starburst.io/latest/admin/properties-query-management.html#query-max-stage-count">Query management properties (query.max-stage-count)</a></p></li><li><p><a href="https://trino.io/docs/current/overview/concepts.html">Concepts: stages, tasks, and splits</a></p></li><li><p><a href="https://www.starburst.io/info/trino-query-plan-analysis-webinar-series/">Query plan analysis</a></p></li><li><p><a href="https://trino.io/docs/current/optimizer/pushdown.html">Pushdown optimizations (predicate, aggregation, Top-N)</a></p></li><li><p><a href="https://docs.starburst.io/latest/overview/concepts.html">Architecture overview: Coordinator, Workers, and Cost-Based Optimizer (CBO)</a></p></li></ul><div><hr></div><h3>Other Data Platforms (for Comparison)</h3><ul><li><p>Snowflake &#8211; <a href="https://docs.snowflake.com/en/sql-reference/sql/explain">EXPLAIN documentation</a></p></li><li><p>Amazon Redshift &#8211; <a href="https://docs.aws.amazon.com/redshift/latest/dg/r_EXPLAIN.html">EXPLAIN documentation</a></p></li><li><p>Google BigQuery &#8211; <a href="https://cloud.google.com/bigquery/docs/query-plan-explanation">Query plan and execution timeline</a></p></li></ul><div><hr></div><h3>Why These Matter</h3><p>These resources will help you:</p><ul><li><p>Understand how query planners work across engines</p></li><li><p>Compare execution strategies between distributed systems</p></li><li><p>Learn optimization techniques like pushdown and cost-based planning</p></li><li><p>Troubleshoot performance issues more effectively</p></li></ul><div id="youtube2-GcS02yTNwC0" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;GcS02yTNwC0&quot;,&quot;startTime&quot;:&quot;491s&quot;,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/GcS02yTNwC0?start=491s&amp;rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h4 style="text-align: center;"></h4><p></p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[The Odd Number Rule: A Principle That Has Followed Me From Tableau Server to Kubernetes]]></title><description><![CDATA[How a Tableau Server deployment in 2012 taught me something about distributed systems that I still apply every day in Kubernetes]]></description><link>https://tshasankda.substack.com/p/the-odd-number-rule-a-principle-that</link><guid isPermaLink="false">https://tshasankda.substack.com/p/the-odd-number-rule-a-principle-that</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Tue, 17 Mar 2026 09:01:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZjkW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1dd86436-f6d9-43e2-bc15-4c58dd135ee9_715x724.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The first time I genuinely understood why distributed systems prefer odd numbers, I wasn&#8217;t working with Kubernetes. I wasn&#8217;t even working in containers. It was around 2012, and I was deploying Tableau Server for an organization that needed it to stay up. Three nodes minimum, the documentation said, for high availability. Specifically three, not two and not four. At the time I read that as a product requirement and moved on. It was only when I started digging into why the system behaved the way it did under failure that I realized it wasn&#8217;t an arbitrary number at all.</p><p>What I found underneath that requirement was a piece of mathematics that shows up in almost every distributed system I&#8217;ve worked with since: quorum. The idea that a cluster of nodes needs to reach a majority agreement before it can make a decision, and that this majority requirement makes odd-numbered clusters fundamentally more efficient than even-numbered ones. Once I understood that, a lot of architectural decisions that had previously felt like conventions started to feel like consequences, things that had to be true given how consensus works rather than rules someone had decided on.</p><p>This post traces that understanding from Tableau Server in 2012 through to how I think about Kubernetes replica counts today. The tools have changed significantly over the past decade, but the underlying reasoning has not moved at all, and I find that worth writing about.</p><h3><strong>Where it started: Tableau Server and the three-node minimum</strong></h3><p>Tableau Server&#8217;s high availability requirement of a minimum three nodes has been part of its architecture for a long time. The documentation frames it clearly: with fewer than three nodes, you cannot eliminate single points of failure. With two nodes, at least one critical component (the repository, the file store, or the service responsible for cluster coordination) has only a single instance running. A single instance is a single point of failure, regardless of how many other components are redundant.</p><p>What made this click for me was spending time with what Tableau actually needed at the coordination layer. The service responsible for maintaining cluster state, tracking which nodes were available, and making sure the cluster could reach agreement during failures required an odd number of instances to function correctly. You could run one instance, three instances, or five instances. Running two was not a supported configuration, and understanding why is what opened up the quorum concept for me.</p><p>At the time of my 2012 deployments, Tableau&#8217;s internal coordination mechanism worked on the same majority-agreement principle, even if the specific ZooKeeper-based Coordination Service component that Tableau documents today was formalized in later versions of the product. The three-node minimum and the odd-instance requirement were present and enforced from those early versions. The underlying reason, that a coordination service needs to be able to reach a majority decision and that two instances cannot break a tie, was the same then as it is now.</p><blockquote><p><em>Tableau Server documentation: &#8220;A minimum of three Tableau Server nodes are required for high availability.&#8221; The Coordination Service, which is built on Apache ZooKeeper, is deployed in ensembles of one, three, or five instances (never two or four) to guarantee quorum. Source: Tableau Server documentation</em></p></blockquote><p>It took me a while to fully internalize why the even numbers were excluded. The short answer is that two instances can disagree and have no way to resolve the disagreement. Three instances can disagree, and the third instance breaks the tie. That&#8217;s the entire principle, and it scales: five instances can lose two and still form a majority of three. Four instances can lose one, but then you&#8217;re down to three, which is the same fault tolerance as a three-node cluster at greater cost. The even numbers are not just suboptimal; they provide no additional safety over the odd number below them.</p><h3><strong>The mathematics of quorum: why odd numbers are structurally different</strong></h3><p>To understand why even-numbered clusters are wasteful from a resilience perspective, it helps to work through the quorum formula directly. In any distributed system that requires nodes to agree before making a decision, whether that is electing a leader, committing a write, or changing cluster membership, quorum is the minimum number of nodes that must participate in that agreement for it to be valid.</p><p>The formula is simple and worth knowing:</p><pre><code>quorum = floor(N / 2) + 1
</code></pre><p>For a three-node cluster, quorum is 2. For a five-node cluster, quorum is 3. This means a three-node cluster can lose one node and still reach quorum with its remaining two. A five-node cluster can lose two nodes and still reach quorum with its remaining three. Now look at what happens when you apply the same formula to even-numbered clusters:</p><pre><code>Cluster  Quorum    Can Fail  Gain vs previous
------   -------   --------  ----------------
2        2         0         &#8212;
3        2         1         +1 node, +1 failure tolerance
4        3         1         +1 node, no extra tolerance
5        3         2         +1 node, +1 failure tolerance
6        4         2         +1 node, no extra tolerance</code></pre><p>The pattern in that table is the core of the odd-number argument. Every time you move from an odd cluster size to the even number above it, you add a node and gain nothing in fault tolerance. A four-node cluster tolerates exactly one failure, the same as a three-node cluster. A six-node cluster tolerates exactly two failures, the same as a five-node cluster. The etcd documentation makes this explicit, noting that adding a member to bring a cluster to an even number does not buy additional fault tolerance. You have simply paid for a node that makes the quorum requirement stricter without giving you any more room to absorb failures.</p><blockquote><p><em><strong>etcd FAQ</strong>: &#8220;An odd-size cluster tolerates the same number of failures as an even-size cluster but with fewer nodes. Adding a member to bring the size of cluster up to an even number doesn&#8217;t buy additional fault tolerance.&#8221; Source: etcd official documentation</em></p></blockquote><p>There is a second problem with even-numbered clusters that is worth understanding: split-brain. If a four-node cluster experiences a network partition that isolates two nodes from the other two, neither half has quorum. Both sides see exactly half the cluster, neither can elect a leader or process writes, and the cluster becomes unavailable until the partition heals. With a three-node cluster experiencing the same type of partition, one side has two nodes and the other has one. The side with two nodes has quorum and continues operating normally. The minority side correctly stops accepting writes. The system degrades gracefully rather than becoming completely unavailable.</p><p>This is not a theoretical edge case. Network partitions happen in production, and the behaviour of a cluster during a partition is determined by its topology at the time the partition occurs. Odd-numbered clusters are designed to handle this scenario gracefully. Even-numbered clusters are structurally exposed to it.</p><p>ZooKeeper, which as we&#8217;ve seen underpins Tableau&#8217;s Coordination Service, has always recommended clusters of 3 or 5 nodes for exactly these reasons. The Raft consensus algorithm, which etcd uses and which is the foundation of Kubernetes&#8217; control plane, works on the same majority-quorum principle. These are not independent design choices by different teams; they are the same mathematical conclusion, arrived at separately because the mathematics does not leave much room for alternatives.</p><h3><strong>From coordination services to Kubernetes: the same principle, a different layer</strong></h3><p>When I started working seriously with Kubernetes, the architecture felt immediately recognisable. Kubernetes uses etcd as its cluster state store, the same consensus-based coordination layer with the same Raft algorithm and the same odd-number requirement. Production Kubernetes control planes run three or five etcd nodes. Running two etcd nodes gives you zero fault tolerance, because any single failure drops you below quorum. Running four gives you the same fault tolerance as three, at higher cost. The recommendation of three or five is not a convention that the Kubernetes community adopted by habit; it is a direct consequence of how Raft works.</p><p>But etcd is the infrastructure layer, and most engineers working with Kubernetes day-to-day are not making decisions about the control plane topology. The decisions that come up more frequently are about application workloads: how many replicas should this deployment run? And here the odd-number principle applies, but in a different way that is worth explaining carefully, because the mechanism is not the same as quorum consensus.</p><p>Stateless pods do not vote. When a request arrives at a service, any available pod can handle it. There is no leader election, no write quorum, no consensus protocol. The reason three replicas is the right answer for stateless services is not that three pods can form a majority; it is that three pods give you enough redundancy to absorb the operational events that happen routinely on a live cluster without ever dropping to a state where you have no redundancy at all. The connection to the odd-number principle is structural rather than mathematical: in both cases, three is the number where the system can lose one instance and still function safely, and in both cases, two is the number where any single loss leaves you without a safety net.</p><p>With consensus systems, that safety net is quorum: the ability to continue making decisions. With stateless pod replicas, the safety net is operational continuity: the ability to continue serving traffic while the cluster does the things it routinely does, draining nodes, rolling out deployments, rescheduling pods. The minimum that delivers that continuity, under realistic operating conditions, is three.</p><h3><strong>What the numbers look like under real operating conditions</strong></h3><h3><strong>The probability view</strong></h3><p>Let&#8217;s make the availability argument concrete. Assume each pod has a 99% availability rate, which is a reasonable figure for a healthy pod in a well-run cluster that accounts for restarts, rescheduling events, and brief periods of unavailability. With two replicas, you need both pods to be healthy simultaneously to maintain full redundancy. With three replicas, you need any two of the three to be healthy. The difference in degraded exposure across a working year is significant:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZjkW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1dd86436-f6d9-43e2-bc15-4c58dd135ee9_715x724.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZjkW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1dd86436-f6d9-43e2-bc15-4c58dd135ee9_715x724.png 424w, https://substackcdn.com/image/fetch/$s_!ZjkW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1dd86436-f6d9-43e2-bc15-4c58dd135ee9_715x724.png 848w, https://substackcdn.com/image/fetch/$s_!ZjkW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1dd86436-f6d9-43e2-bc15-4c58dd135ee9_715x724.png 1272w, https://substackcdn.com/image/fetch/$s_!ZjkW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1dd86436-f6d9-43e2-bc15-4c58dd135ee9_715x724.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZjkW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1dd86436-f6d9-43e2-bc15-4c58dd135ee9_715x724.png" width="715" height="724" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1dd86436-f6d9-43e2-bc15-4c58dd135ee9_715x724.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:724,&quot;width&quot;:715,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ZjkW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1dd86436-f6d9-43e2-bc15-4c58dd135ee9_715x724.png 424w, https://substackcdn.com/image/fetch/$s_!ZjkW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1dd86436-f6d9-43e2-bc15-4c58dd135ee9_715x724.png 848w, https://substackcdn.com/image/fetch/$s_!ZjkW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1dd86436-f6d9-43e2-bc15-4c58dd135ee9_715x724.png 1272w, https://substackcdn.com/image/fetch/$s_!ZjkW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1dd86436-f6d9-43e2-bc15-4c58dd135ee9_715x724.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The &#8216;degraded state&#8217; referred above means running on a single pod with no redundancy. The service is technically up, but any restart during that period becomes a user-facing outage. What&#8217;s worth noting is that the better your individual pods are, the larger the proportional benefit of the third replica. At 99.9% per-pod availability, the risk gap between two and three replicas is around 600 times. The third replica is not giving you marginally more reliability; it is placing you in a fundamentally different risk category.</p><h3><strong>The node drain: the scenario that makes this practical</strong></h3><p>Beyond the probability framing, the operational scenario that most clearly illustrates why two replicas is not enough is the routine node drain. A kernel security patch, a Kubernetes version upgrade, or a cloud provider recycling the underlying hardware all require taking a node offline. The standard mechanism is kubectl drain, which gracefully evicts all pods from the target node before it goes offline.</p><p>With two replicas, consider what happens during a drain where one of your pods happens to be on the affected node. That pod is evicted. A replacement is scheduled on another node and begins its startup sequence. Between the moment of eviction and the moment the replacement passes its readiness probe and is marked healthy, your service is running on a single pod. That window can be seconds or it can be several minutes, depending on image pull time, initialization duration, and readiness check configuration. During that window, if the surviving pod restarts for any reason, whether that is an OOM kill, a liveness probe timeout, or a brief spike in a downstream dependency, you have an outage. And this is not an edge case scenario; node drains happen on every production cluster, regularly, as part of normal maintenance.</p><p>With three replicas, the same drain plays out with two pods continuing to serve traffic throughout the entire process. The evicted pod&#8217;s replacement starts up in the background, and by the time it is healthy, the cluster is back to full redundancy. The on-call engineer does not need to think about whether the drain timing is safe, or whether there might be a risk window. The service is continuously protected throughout the operation.</p><blockquote><p><em>The official Kubernetes documentation notes that kubectl drain respects Pod Disruption Budgets, but PDBs only protect against voluntary disruptions. If the surviving pod is lost to a hardware failure or an OOM kill while the replacement is starting up, the PDB offers no protection at that point. The replica count itself is what determines how much margin you have.</em></p></blockquote><h3><strong>Rolling deployments and what happens when a new pod gets stuck</strong></h3><p>Kubernetes&#8217; default rolling update strategy sets both maxUnavailable and maxSurge to 25% of the desired replica count. For small deployments, the rounding behaviour is important to understand: 25% of 2 rounds down to 0, meaning Kubernetes will not terminate any existing pod until its replacement is healthy and ready. That sounds like a sensible safeguard, and in the normal case it is. But consider what happens when the new pod fails its readiness probe.</p><p>With two replicas, a new pod that gets stuck in a failing state leaves one old pod carrying all traffic while the deployment is blocked and cannot proceed. The rollout is frozen. To resolve it, someone needs to manually intervene, diagnose the issue, and trigger a rollback. During the time that takes, your service is running without any redundancy, and the deployment pipeline is blocked. With three replicas in the same scenario, two healthy old pods continue serving traffic throughout the investigation. The deployment failure is a problem to be solved, but it is not simultaneously degrading your service. That distinction matters considerably when the failure happens at an inconvenient time.</p><h3><strong>Translating this into configuration</strong></h3><p>The principles above translate into a small set of configuration decisions that work together. None of them are complicated in isolation, but they are most valuable when applied as a set rather than individually.</p><h3><strong>Set a minimum of three replicas</strong></h3><p>The starting point is simply setting the replica count explicitly at three, rather than leaving it at whatever default was in the deployment template:</p><pre><code>spec:
  replicas: 3</code></pre><h3><strong>Add a Pod Disruption Budget</strong></h3><p>A PDB tells Kubernetes how many pods your service needs to remain available during voluntary operations such as node drains, cluster upgrades, and maintenance windows. Without a PDB, Kubernetes may evict multiple pods from the same service simultaneously during a drain. With minAvailable set to 2, Kubernetes is instructed to ensure at least two pods are healthy before proceeding with any eviction. It is important to understand that PDBs only cover voluntary disruptions: hardware failures, OOM kills, and pod crashes are involuntary and bypass PDB enforcement entirely. The replica count itself is your protection against those.</p><pre><code>apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: your-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: your-service</code></pre><h3><strong>Configure the rolling update strategy explicitly</strong></h3><p>Rather than relying on the 25% default behaviour, which rounds unpredictably at small replica counts, it is better to state your intent directly. Setting maxUnavailable to 0 means the full desired number of pods must be available at all times during a rollout. A new pod must pass its readiness probe before any old pod is removed. This eliminates the risk of a stuck deployment silently reducing your available pod count:</p><pre><code>strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0</code></pre><h3><strong>Distribute pods across availability zones</strong></h3><p>Three replicas running on the same node or the same availability zone give you pod-level redundancy but not infrastructure-level redundancy. A single node failure or a zonal outage can take all three pods down simultaneously. Topology spread constraints tell the Kubernetes scheduler to distribute pods across failure domains, ensuring that infrastructure-level events can only affect a subset of your replicas at once:</p><pre><code>topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: your-service</code></pre><h3><strong>Apply the same logic to etcd</strong></h3><p>If you are responsible for the Kubernetes control plane itself, the same odd-number reasoning applies directly to etcd. A production cluster should run three or five etcd nodes. Two etcd nodes provide zero fault tolerance, because any single failure drops below quorum. Four etcd nodes provide the same fault tolerance as three, at greater infrastructure cost. Losing etcd quorum does not degrade your workloads gradually; it makes the entire cluster unable to process any changes until quorum is restored, which makes etcd topology one of the most consequential architectural decisions in a Kubernetes deployment.</p><h3><strong>When running fewer than three replicas is the right call</strong></h3><p>Three replicas is the appropriate default for stateless services running in production environments where availability matters. But not every workload and not every environment meets that description, and it is worth being clear about where the tradeoff genuinely falls in the other direction.</p><p>Development and staging environments are the most obvious case. If a pod restart causes a brief disruption in a development environment, the cost is a developer noticing a timeout and refreshing their browser. The risk calculus is completely different from production, and running two replicas or even one in those environments is entirely reasonable. Similarly, low-traffic internal tools where the business impact of a brief outage is negligible may not justify the infrastructure cost of a third replica, particularly if the team responsible for them has explicitly weighed that tradeoff.</p><p>Async workers and batch processing jobs represent another legitimate exception. If a worker pod restarts mid-job and the job retries from the beginning or resumes from a checkpoint, the impact is a delay rather than a failure. For those workloads, replica count is more about throughput than availability, and the right number depends on the processing requirements rather than the resilience argument made in this post.</p><p>The underlying principle is that these should be deliberate decisions made with the tradeoffs in mind, not defaults that were never reconsidered. Two replicas in a staging environment is a sensible choice. Two replicas in a production service because the task to increase it never made it to the top of the backlog is a different situation entirely, and one that tends to surface at the worst possible moments.</p><h3><strong>The principle outlasts the tools</strong></h3><p>Looking back across more than a decade of infrastructure work, what strikes me most about the odd-number rule is how stable it has been. The tools I have worked with have changed substantially, from Tableau&#8217;s internal coordination mechanism in 2012, through ZooKeeper-based systems in the mid-2010s, to etcd and Kubernetes today, but the mathematical reason for preferring odd cluster sizes has not changed at all. Quorum requires a majority. A majority decision cannot be made by an even number of participants without the risk of a tie. The minimum odd number that provides both redundancy and majority decision-making is three, and that is why three appears as the minimum HA configuration in so many different systems across so many different years.</p><p>What I find useful about understanding the origin of the rule rather than just following it is that it tells you when the rule applies and when it does not. For consensus-based systems like etcd or ZooKeeper, the odd-number requirement is mathematically strict: two nodes provides no fault tolerance, and four nodes provides no more fault tolerance than three. For stateless pod replicas, the argument is structurally similar but not mathematically identical. Three replicas is not required for quorum, but it is the minimum that keeps your service in a safe state across the routine operational events that every production cluster experiences regularly.</p><p>When you configure replicas: 3 for a Kubernetes deployment, you are not following a convention that someone decided on and everyone else copied. You are making a decision that is consistent with the same reasoning that Tableau encoded into its HA architecture in the early 2010s, that ZooKeeper encoded into its ensemble configuration, and that the Raft algorithm formalizes in its design. The context is different each time, but the underlying logic is the same one, and it has held up across every distributed system I&#8217;ve encountered since I first came across it in a Tableau Server deployment guide over a decade ago.</p><p><strong>References :</strong></p><ul><li><p><a href="https://help.tableau.com/current/server/en-us/server_process_coordination.htm">Tableau Server Coordination Service &#8212; quorum, ensemble, three-node HA requirement</a></p></li><li><p><a href="https://help.tableau.com/current/server/en-us/distrib_ha_zk.htm">Deploy a Coordination Service Ensemble &#8212; odd-only instances (1, 3, or 5)</a></p></li><li><p><a href="https://help.tableau.com/current/server/en-us/distrib_ha.htm">Tableau Server high availability installation &#8212; minimum three nodes</a></p></li><li><p><a href="https://etcd.io/docs/v3.3/faq/">etcd FAQ &#8212; odd cluster size, even-numbered clusters provide no extra fault tolerance</a></p></li><li><p><a href="https://sre.google/sre-book/managing-critical-state/">Google SRE Book: Managing Critical State &#8212; distributed consensus and quorum</a></p></li><li><p><a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/">Kubernetes Deployments &#8212; rolling update defaults, maxUnavailable and maxSurge (25%)</a></p></li><li><p><a href="https://kubernetes.io/docs/concepts/workloads/pods/disruptions/">Kubernetes Pod Disruptions &#8212; voluntary vs involuntary, PDB scope and limits</a></p></li><li><p><a href="https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/">Safely drain a Kubernetes node &#8212; kubectl drain, PDB enforcement</a></p></li><li><p><a href="https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/">Pod Topology Spread Constraints &#8212; zone and node distribution</a></p></li></ul>]]></content:encoded></item><item><title><![CDATA[RPO and RTO in Distributed Systems: Concepts, Architecture, and Practical Design Considerations]]></title><description><![CDATA[Understanding Recovery Objectives for Reliable and Resilient Infrastructure]]></description><link>https://tshasankda.substack.com/p/rpo-and-rto-in-distributed-systems</link><guid isPermaLink="false">https://tshasankda.substack.com/p/rpo-and-rto-in-distributed-systems</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Sat, 07 Mar 2026 22:40:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!BHHr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051d792b-bdec-406c-bee2-7bdd2c974edb_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Modern software systems operate in environments where infrastructure failures are unavoidable. Hardware components can fail, network connectivity may become unstable, storage devices can degrade, and entire availability zones or regions can experience outages. As digital services have become central to business operations, the tolerance for downtime and data loss has decreased significantly. Consequently, system architects and engineering teams must design platforms that continue operating even when failures occur.</p><p>Two fundamental metrics are commonly used to define recovery expectations for systems that must maintain reliability in the presence of failures. These metrics are <strong>Recovery Point Objective (RPO)</strong> and <strong>Recovery Time Objective (RTO)</strong>. Together they provide a framework for determining how much data loss is acceptable and how quickly a system must recover when disruptions occur.</p><p>Understanding these concepts is essential for designing distributed systems that preserve both <strong>data durability and service availability</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BHHr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051d792b-bdec-406c-bee2-7bdd2c974edb_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BHHr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051d792b-bdec-406c-bee2-7bdd2c974edb_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!BHHr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051d792b-bdec-406c-bee2-7bdd2c974edb_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!BHHr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051d792b-bdec-406c-bee2-7bdd2c974edb_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!BHHr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051d792b-bdec-406c-bee2-7bdd2c974edb_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BHHr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051d792b-bdec-406c-bee2-7bdd2c974edb_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/051d792b-bdec-406c-bee2-7bdd2c974edb_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2334286,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/190234734?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051d792b-bdec-406c-bee2-7bdd2c974edb_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BHHr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051d792b-bdec-406c-bee2-7bdd2c974edb_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!BHHr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051d792b-bdec-406c-bee2-7bdd2c974edb_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!BHHr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051d792b-bdec-406c-bee2-7bdd2c974edb_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!BHHr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F051d792b-bdec-406c-bee2-7bdd2c974edb_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Recovery Point Objective (RPO)</h1><p>Recovery Point Objective represents the <strong>maximum amount of data that a system can afford to lose after a failure occurs</strong>. The objective is typically defined as a time interval that describes how far back the system may need to revert during recovery.</p><p>For example, consider a system with an RPO of ten minutes. In this situation, if a failure occurs, the organization accepts that up to ten minutes of recently generated data may be lost during recovery. This lost information could include user transactions, database updates, application events, or other forms of system state that were generated after the last successful recovery point.</p><h3>RPO Timeline Illustration</h3><pre><code>Time  &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&gt;

Last Backup          Failure Occurs
     &#9474;                     &#9474;
     &#9474;&lt;------ RPO -------&gt; &#9474;
     &#9474;                     &#9474;
Data that may potentially be lost</code></pre><p>RPO is closely related to the mechanisms used for data replication and backup. Systems with larger RPO values often rely on periodic backups or asynchronous replication strategies. In contrast, systems that require very low RPO values must replicate data more frequently or maintain near real-time copies across multiple storage nodes.</p><h3>Zero RPO Architecture</h3><pre><code>          Client
             &#9474;
             &#9660;
       Primary Database
             &#9474;
    Synchronous Replication
             &#9474;
     &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
     &#9660;               &#9660;
Replica Database  Replica Database</code></pre><p>In this model, transactions are committed only after they are written to multiple replicas, ensuring that data is preserved even if the primary node fails.</p><div><hr></div><h1>Recovery Time Objective (RTO)</h1><p>Recovery Time Objective defines the <strong>maximum acceptable duration that a system may remain unavailable after a failure occurs</strong>. While RPO addresses data durability, RTO focuses on service availability.</p><p>When an outage occurs, several processes must take place before normal system operation resumes. These processes include detecting the failure, activating backup resources, restarting services, and redirecting traffic toward healthy components.</p><h3>RTO Recovery Timeline</h3><pre><code>Time &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&gt;

Failure      Detection      Failover      Recovery
  &#9474;              &#9474;             &#9474;             &#9474;
  &#9474;&lt;------------------ RTO -----------------&gt;&#9474;</code></pre><p>If a system has an RTO of five minutes, the entire recovery process&#8212;from failure detection to service restoration&#8212;must complete within that five-minute window.</p><p>Achieving low RTO values requires automated monitoring, load balancing, and redundant infrastructure.</p><div><hr></div><h1>Relationship Between RPO and RTO</h1><p>Although RPO and RTO are closely related, they address different dimensions of system reliability.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;dd53d8e8-1bed-4c07-bb80-1b9f3197d0ab&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">| Metric | Focus           | What It Measures                |
| ------ | --------------- | ------------------------------- |
| RPO    | Data durability | Maximum data loss after failure |
| RTO    | Availability    | Maximum downtime after failure  |
</code></pre></div><h3>Combined RPO and RTO Illustration</h3><pre><code>Time &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&gt;

Last Safe Data        Failure            Service Restored
      &#9474;                  &#9474;                      &#9474;
      &#9474;&lt;---- RPO -----&gt;  &#9474;&lt;-------- RTO --------&gt;&#9474;</code></pre><p>The section before the failure represents potential data loss, while the section after the failure represents system downtime.</p><h1>Failure Scenario in a Distributed System</h1><p>Consider a distributed application consisting of multiple service nodes positioned behind a load balancer.</p><h3>Normal System Architecture</h3><pre><code>                 Load Balancer
               /       |        \
             Node1   Node2    Node3
               &#9474;       &#9474;        &#9474;
               &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
                       &#9474;
                Distributed Database</code></pre><p>During normal operation:</p><ul><li><p>incoming requests are distributed across nodes</p></li><li><p>each node processes requests independently</p></li><li><p>the database replicates data across storage nodes</p></li></ul><div><hr></div><h1>Node Failure Scenario</h1><p>Suppose <strong>Node2 experiences a hardware failure</strong>.</p><pre><code>                 Load Balancer
               /       X        \
             Node1   Node2    Node3
                        DOWN
               &#9474;                &#9474;
               &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
                       &#9474;
                Distributed Database</code></pre><p>Monitoring systems detect that Node2 is no longer responding.</p><h1>Traffic Rerouting After Failure</h1><p>The load balancer removes the failed node and redirects requests to the remaining nodes.</p><pre><code>                 Load Balancer
                 /          \
               Node1      Node3
                 &#9474;           &#9474;
                 &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9496;
                         &#9474;
                Distributed Database</code></pre><p>The system continues operating while infrastructure automation replaces the failed node.</p><h1>Node Recovery and Rejoin</h1><pre><code>           Replacement Node
                  &#9474;
                  &#9660;
           Data Synchronization
                  &#9474;
                  &#9660;
        Node Rejoins Service Cluster</code></pre><p>Once the node synchronizes with the cluster, it resumes handling requests.</p><h1>Measuring Recovery Objectives in the Scenario</h1><p>A typical incident timeline might look like this:</p><pre><code>00:00  Node failure occurs
00:03  Monitoring detects failure
00:05  Load balancer reroutes traffic
00:30  Replacement node joins cluster</code></pre><p>From a user&#8217;s perspective:</p><ul><li><p><strong>RTO &#8776; 5 seconds</strong> (time to restore service)</p></li><li><p><strong>RPO = 0</strong> if synchronous replication preserved all writes</p></li></ul><div><hr></div><h1>Tiered Recovery Objectives Across Systems</h1><p>Organizations often assign recovery objectives based on system importance.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;6f1cf1af-0bef-4a9c-8516-b395e4fad48f&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">| Tier   | Example Systems              | Typical RPO | Typical RTO |
| ------ | ---------------------------- | ----------- | ----------- |
| Tier 1 | Payments, healthcare systems | near-zero   | seconds     |
| Tier 2 | Core business applications   | minutes     | minutes     |
| Tier 3 | Internal services            | hours       | hours       |
| Tier 4 | Analytics and reporting      | hours&#8211;days  | hours&#8211;days  |
</code></pre></div><p>This classification ensures that mission-critical systems receive the highest level of resilience.</p><h1>Architectural Techniques for Reducing RPO and RTO</h1><h3>Distributed Infrastructure</h3><pre><code>         Service Cluster
      /        |        \
   Node A    Node B    Node C</code></pre><p>Multiple nodes prevent single points of failure.</p><div><hr></div><h3>Multi-Region Deployment</h3><pre><code>             Global Traffic Router
               /              \
           Region A          Region B
         /    |    \        /    |    \
      Node  Node  Node   Node  Node  Node</code></pre><p>If one region fails, traffic can shift to the other region.</p><h3>Replication Strategies</h3><h4>Synchronous Replication</h4><pre><code>Client &#8594; Primary DB &#8594; Replica1 &#8594; Replica2</code></pre><p>Ensures no data loss but increases write latency.</p><h4>Asynchronous Replication</h4><pre><code>Client &#8594; Primary DB
              &#9474;
              &#9660;
         Replica DB</code></pre><p>Improves performance but introduces a small window of possible data loss.</p><h1>DevOps Practices That Improve Recovery</h1><p>Modern engineering teams integrate reliability into deployment processes. Organizations such as Netflix and Stripe rely heavily on automation to maintain operational resilience.</p><p>Important practices include:</p><ul><li><p>Infrastructure as Code for reproducible environments</p></li><li><p>automated deployment pipelines for rapid recovery</p></li><li><p>immutable infrastructure that replaces failed instances rather than repairing them</p></li></ul><p>These practices help reduce recovery times and improve system stability during failures.</p><div><hr></div><h1>Business Impact of Downtime</h1><p>System outages have measurable financial and operational consequences. Research from Gartner suggests that the cost of downtime can reach thousands of dollars per minute for many organizations.</p><p>Beyond financial losses, outages may lead to:</p><ul><li><p>failed customer transactions</p></li><li><p>inconsistent data records</p></li><li><p>loss of customer trust</p></li><li><p>operational disruptions for internal teams</p></li></ul><p>For this reason, recovery objectives are an essential part of infrastructure planning.</p><h1>Practical Limits of Zero Recovery Objectives</h1><p>Although organizations often aim for zero RPO and zero RTO, achieving absolute zero is extremely difficult due to network latency, failure detection delays, and coordination overhead in distributed systems.</p><p>Most production systems instead target <strong>near-zero recovery objectives</strong>, ensuring that failures cause minimal disruption and that services remain effectively available from a user&#8217;s perspective.</p><h1>Conclusion</h1><p>Reliable system design requires acknowledging that infrastructure failures are inevitable. Recovery Point Objective and Recovery Time Objective provide a structured framework for defining how systems should respond to these failures.</p><p>RPO determines how much data loss is acceptable, while RTO defines how quickly services must recover. By combining distributed architectures, automated failover mechanisms, robust replication strategies, and modern DevOps practices, engineering teams can design systems capable of maintaining reliability even in the presence of infrastructure failures.</p>]]></content:encoded></item><item><title><![CDATA[Leading Big Bets: Lessons from Executive Engineering Leadership]]></title><description><![CDATA[A Leadership Reflection While Actively Preparing for Executive Interviews]]></description><link>https://tshasankda.substack.com/p/leading-big-bets-lessons-from-executive</link><guid isPermaLink="false">https://tshasankda.substack.com/p/leading-big-bets-lessons-from-executive</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Sun, 01 Mar 2026 11:05:31 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!b_Hi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01390170-2f18-4192-a832-5d91f10b3a82_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Leading large engineering teams&#8212;100+ engineers&#8212;or driving strategic &#8220;big bet&#8221; projects changes the nature of leadership. At the executive level, success is no longer just about execution&#8212;it&#8217;s about strategy, influence, people, and measurable business impact.</p><p>I&#8217;m currently preparing for executive-level interviews, and reflecting on my leadership experiences has been invaluable. As I attend these interviews, I&#8217;ve realized how critical it is to structure your stories, show measurable impact, and communicate your leadership thinking effectively. In this blog, I share lessons from my experience, combined with insights from resources like Austen McDonald&#8217;s thebehavioral.tech, Byte-by-Byte, Leetcode, and Gayle Laakmann McDowell.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b_Hi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01390170-2f18-4192-a832-5d91f10b3a82_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b_Hi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01390170-2f18-4192-a832-5d91f10b3a82_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!b_Hi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01390170-2f18-4192-a832-5d91f10b3a82_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!b_Hi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01390170-2f18-4192-a832-5d91f10b3a82_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!b_Hi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01390170-2f18-4192-a832-5d91f10b3a82_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b_Hi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01390170-2f18-4192-a832-5d91f10b3a82_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01390170-2f18-4192-a832-5d91f10b3a82_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2112540,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/189537950?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01390170-2f18-4192-a832-5d91f10b3a82_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b_Hi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01390170-2f18-4192-a832-5d91f10b3a82_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!b_Hi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01390170-2f18-4192-a832-5d91f10b3a82_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!b_Hi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01390170-2f18-4192-a832-5d91f10b3a82_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!b_Hi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01390170-2f18-4192-a832-5d91f10b3a82_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Leading Big Bets (AI generated image)</figcaption></figure></div><h2>1. Resolving Cross-Team Conflict at Scale (CARL)</h2><ul><li><p>Context: During a rollout of a new analytics platform, Data Engineering and Product Infrastructure teams disagreed on delivery timelines.</p></li><li><p>Action: I facilitated an alignment workshop with team leads and product owners, mapping dependencies and creating a shared delivery plan.</p></li><li><p>Result: Both teams aligned and delivered the project on time.</p></li><li><p>Learning: Structured dialogue, empathy, and focus on organizational goals are essential at scale.</p></li></ul><h2>2. Making High-Stakes Decisions Under Uncertainty (CARL)</h2><ul><li><p>Context: Two hours before a major platform release, a critical issue was discovered.</p></li><li><p>Action: I evaluated risks, consulted key leads, and approved a targeted hotfix.</p></li><li><p>Result: The release went live successfully without user impact.</p></li><li><p>Learning: Calculated risks, rapid evaluation, and clear escalation paths define executive leadership.</p></li></ul><h2>3. Delivering Tough Feedback to Leaders (CARL)</h2><ul><li><p>Context: A newly promoted engineering manager was missing key deadlines.</p></li><li><p>Action: I shared specific examples, proposed a clear improvement plan, and offered support.</p></li><li><p>Result: Six months later, their team&#8217;s delivery improved by 40%.</p></li><li><p>Learning: Feedback must be clear, empathetic, and measurable.</p></li></ul><h2>4. Owning Big Bets or Strategic Projects &#8211; Feature Y (CARL)</h2><ul><li><p>Context: At BigCo, the company struggled to close additional clients due to lack of differentiation from competition. I suggested building Feature Y after exploring technical feasibility in a proof-of-concept.</p></li><li><p>Action: I worked with design, PM, data science, and a senior engineer to plan and build Feature Y over a quarter.</p></li><li><p>Result: The feature brought in 50% more revenue.</p></li><li><p>Learning: Proof-of-concepts are extremely helpful for convincing leaders of what&#8217;s possible and securing buy-in for strategic initiatives.</p></li></ul><h2>5. Learning from Failures (CARL)</h2><ul><li><p>Context: A migration of our legacy data pipeline was delayed due to underestimated complexity.</p></li><li><p>Action: I redesigned the timeline with phased delivery and introduced stronger risk assessment protocols.</p></li><li><p>Result: Future projects now face 60% fewer delays.</p></li><li><p>Learning: Ownership, learning, and process improvement define great leadership.</p></li></ul><h2>6. Balancing Vision and Execution (CARL)</h2><ul><li><p>Context: While focusing on our AI platform roadmap, we had to deliver quarterly business-critical enhancements.</p></li><li><p>Action: I implemented prioritization frameworks and aligned executives and engineers through clear communication.</p></li><li><p>Result: Both long-term and short-term goals were delivered successfully.</p></li><li><p>Learning: Balancing trade-offs and keeping teams aligned is key at the executive level.</p></li></ul><h2>7. Vendor Negotiation at Scale (CARL)</h2><ul><li><p>Context: During a cloud migration, we needed to negotiate terms with two enterprise vendors.</p></li><li><p>Action: I highlighted long-term partnership value, strategic importance, and key project constraints.</p></li><li><p>Result: Secured a 20% discount, additional support hours, and guaranteed SLA improvements, while keeping the project on schedule.</p></li><li><p>Learning: Vendor negotiation requires data, leverage, and clear communication, not just price haggling.</p></li></ul><h2>8. Preparing for Executive-Level Behavioral Interviews</h2><p>As I prepare for executive-level interviews, I&#8217;ve invested significant time studying behavioral interview preparation resources that are especially relevant for leadership roles.</p><p>These helped me:</p><ul><li><p>Structure stories around big bets, cross-team impact, and influence</p></li><li><p>Map experiences to commonly assessed signal areas</p></li><li><p>Communicate not just outcomes, but leadership thinking</p></li></ul><p>Frameworks like STARR (Situation, Task, Action, Result, Reflection) and CARL (Context, Action, Result, Learning) have been particularly effective in helping me articulate lessons from conflict, risk-taking, and vendor negotiations.</p><p>Key takeaway: Behavioral interview preparation is essential for executive roles. It&#8217;s about reflection and clarity, not memorization.</p><h2>9. Key Considerations for Executive Behavioral Interviews</h2><p>When preparing for executive-level interviews, it&#8217;s important to go beyond stories and frameworks. Here are the critical areas I focus on:</p><ul><li><p>The Importance of Behavioral Interviews: Assess how you think, lead, and influence, not just what you did.</p></li><li><p>Evaluating Candidates in Behavioral Interviews: Interviewers look for clarity, consistency, and evidence of leadership impact.</p></li><li><p>Understanding Interviewer Signals: Watch verbal and non-verbal cues&#8212;they reveal what qualities are valued.</p></li><li><p>Presenting Your Experience Effectively: Show context, action, result, and learning. Link technical work to business impact.</p></li><li><p>The Importance of Authenticity: Executive interviewers spot rehearsed answers&#8212;be genuine while professional.</p></li><li><p>Common Candidate Mistakes and Red Flags: Overloading details, avoiding accountability, or failing to show cross-functional influence.</p></li><li><p>Making a Strong First Impression: Confidence, clarity, and humility matter.</p></li><li><p>Team Match and Asking Questions: Thoughtful questions about culture, strategy, and leadership expectations show alignment and curiosity.</p></li></ul><p>Key takeaway: Understanding these fundamentals ensures you communicate experience effectively, demonstrate leadership qualities, and make a lasting impression.</p><h2>Key Lessons for Executive Leaders</h2><ol><li><p>Conflict resolution: Align teams strategically.</p></li><li><p>Decision-making: Take calculated risks under uncertainty.</p></li><li><p>Feedback: Deliver clear, empathetic guidance.</p></li><li><p>Big bets: Lead with vision, execution, and influence.</p></li><li><p>Failure: Own it, learn, and improve processes.</p></li><li><p>Strategy vs execution: Balance long-term goals with short-term needs.</p></li><li><p>Vendor negotiation: Influence partnerships strategically.</p></li><li><p>Behavioral prep: Reflect on experiences, structure them effectively using CARL or STARR, and communicate impact confidently.</p></li></ol><p>Behavioral interviews and leadership challenges go hand in hand. Preparing thoughtfully, reflecting on leadership experience, and framing stories with CARL or STARR ensures that when it comes to big bets, tough decisions, and high-stakes projects, you can communicate both your impact and your thought process effectively.</p><h2>References:</h2><p>The following resources played an important role in shaping my understanding of behavioral interviews, executive storytelling, and leadership reflection while preparing for executive-level interviews. I highly recommend them to anyone seriously investing in interview preparation and long-term leadership growth.</p><ul><li><p><a href="https://thebehavioral.tech/">thebehavioral.tech</a>: This is the primary site for <a href="https://www.google.com/search?q=Austen+McDonald&amp;kgmid=/g/11gbhw608r&amp;sa=X&amp;ved=2ahUKEwiElIHPw_6SAxWrSvEDHW88MlkQ3egRegYIAQgDEAM">Austen McDonald</a>, where he hosts his complete behavioral preparation system and <a href="https://thebehavioral.substack.com/">Substack newsletter</a>.</p></li><li><p>Byte-by-Byte: Note that Byte-by-Byte was recently acquired by <a href="https://interviewing.io/">interviewing.io</a>, and much of its content is currently being integrated there.</p></li><li><p><a href="https://leetcode.com/">LeetCode</a>: The industry-standard platform for practicing technical coding problems and participating in competitive contests.</p></li><li><p><a href="https://www.google.com/search?q=Gayle+Laakmann+McDowell&amp;kgmid=/m/0klx28l&amp;sa=X&amp;ved=2ahUKEwiElIHPw_6SAxWrSvEDHW88MlkQ3egRegYIAQgDEAw">Gayle Laakmann McDowell</a>: The author&#8217;s personal site. You can also find her interview database and forum at <a href="https://github.com/careercup">CareerCup</a> and book-specific resources at <a href="https://www.crackingthecodinginterview.com/">Cracking the Coding Interview</a>.<br><br></p></li></ul>]]></content:encoded></item><item><title><![CDATA[Leading Without a Title: Influencing Stakeholders When You Don’t Own the Decision]]></title><description><![CDATA[Why informal power, clarity, and credibility matter more than authority in modern engineering teams]]></description><link>https://tshasankda.substack.com/p/leading-without-a-title-influencing</link><guid isPermaLink="false">https://tshasankda.substack.com/p/leading-without-a-title-influencing</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Fri, 20 Feb 2026 21:27:31 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!AxfN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bed34c7-af15-497a-9749-9b281f7e5da9_1050x700.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In many engineering organizations, responsibility often extends beyond formal authority. Some of the most consequential decisions are shaped by individuals who do not officially own them. This reflection explores leadership that operates through influence rather than designation.</p><p>Leadership, in its earliest and most enduring form, predates formal titles, organizational hierarchies, and defined reporting lines. Across time, leadership has emerged not through designation but through action, particularly in situations where uncertainty demanded judgment and responsibility extended beyond defined roles. Groups have consistently gravitated toward individuals who could interpret ambiguity, manage risk, and guide collective direction when outcomes mattered. Authority, as it is commonly understood today, evolved later as a mechanism for structure and scale, but influence has always remained the foundation upon which leadership rests.</p><p>Modern engineering organizations often assume that leadership accompanies position or mandate. Agile and platform-driven environments repeatedly expose the limitations of this assumption. As systems become more interconnected and delivery more distributed, authority fragments while accountability persists. This tension is especially visible in platform and infrastructure engineering, where teams operate at the intersection of reliability, security, compliance, and delivery. The consequences of technical decisions often extend far beyond the boundaries of any single team, yet ownership of those decisions is rarely centralized.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AxfN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bed34c7-af15-497a-9749-9b281f7e5da9_1050x700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AxfN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bed34c7-af15-497a-9749-9b281f7e5da9_1050x700.png 424w, https://substackcdn.com/image/fetch/$s_!AxfN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bed34c7-af15-497a-9749-9b281f7e5da9_1050x700.png 848w, https://substackcdn.com/image/fetch/$s_!AxfN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bed34c7-af15-497a-9749-9b281f7e5da9_1050x700.png 1272w, https://substackcdn.com/image/fetch/$s_!AxfN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bed34c7-af15-497a-9749-9b281f7e5da9_1050x700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AxfN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bed34c7-af15-497a-9749-9b281f7e5da9_1050x700.png" width="1050" height="700" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2bed34c7-af15-497a-9749-9b281f7e5da9_1050x700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:700,&quot;width&quot;:1050,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!AxfN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bed34c7-af15-497a-9749-9b281f7e5da9_1050x700.png 424w, https://substackcdn.com/image/fetch/$s_!AxfN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bed34c7-af15-497a-9749-9b281f7e5da9_1050x700.png 848w, https://substackcdn.com/image/fetch/$s_!AxfN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bed34c7-af15-497a-9749-9b281f7e5da9_1050x700.png 1272w, https://substackcdn.com/image/fetch/$s_!AxfN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bed34c7-af15-497a-9749-9b281f7e5da9_1050x700.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Informal Leadership in Tech collaboration (AI generated image)</figcaption></figure></div><p>In such environments, platform engineers and senior individual contributors are routinely drawn into decisions during incidents, releases, and architectural transitions. They may influence infrastructure design, operational controls, and risk posture without formally owning application architecture, product priorities, or delivery roadmaps. Decisions cannot always wait for clarity of ownership. Systems demand timely judgment, and leadership emerges through those who provide direction when authority is unclear.</p><p>Influence in these contexts does not come from control but from the ability to create alignment under constraint. Product teams operate under delivery pressure. Application teams focus on maintainability and velocity. Security and compliance functions prioritize risk mitigation and regulatory adherence. Effective influence requires understanding these differing priorities and framing technical considerations in terms of shared outcomes rather than isolated constraints. When discussions begin with common objectives instead of positional mandates, collaboration becomes more natural and resistance decreases.</p><p>Clarity plays an equally critical role. Platform and infrastructure work is inherently complex and often invisible until something fails. During moments of uncertainty, stakeholders are rarely seeking exhaustive technical detail. They are seeking orientation and direction. Leaders operating without formal authority add the most value by clearly articulating the decision at hand, outlining viable options, explaining trade-offs, and making consequences visible. The ability to reduce complexity without oversimplifying reality is a defining characteristic of informal leadership in technical environments.</p><p>Credibility remains the enduring currency of influence. It is built gradually through consistency, fairness, and presence, particularly during high-pressure situations such as production incidents or critical releases. Credibility strengthens when individuals demonstrate sound judgment, acknowledge uncertainty when it exists, and share credit across teams. It erodes quickly when ego, defensiveness, or misattribution enters the equation. In environments where authority is ambiguous, trust becomes the mechanism through which influence is sustained.</p><p>Leading without owning the decision requires deliberate attention to behaviour and conduct. Psychological safety and clarity form the foundation of such leadership. When individuals feel able to question assumptions, raise concerns, and express dissent without fear of dismissal, decision quality improves and systemic risks surface earlier. This is not merely a cultural aspiration; it is an operational necessity in complex, interdependent systems.</p><p>Equally important is the ability to subordinate personal ego to collective outcomes. Informal leadership is not concerned with being seen as correct but with enabling effective decisions. Those who influence most successfully tend to ask questions that sharpen understanding rather than assert conclusions prematurely. They recognize that formal ownership of decisions matters less than shared responsibility for outcomes. When outcomes improve, leadership has already been exercised, regardless of attribution.</p><p>Visibility and recognition also play a meaningful role. Platform and infrastructure contributions often operate in the background, and success is frequently measured by the absence of failure rather than visible achievements. Leaders who consciously acknowledge contributions and ensure recognition is distributed fairly strengthen trust across organizational boundaries. Over time, this practice becomes a powerful source of informal authority.</p><p>The capacity to influence others is closely tied to personal effectiveness. Sustained leadership without authority requires emotional and cognitive clarity, and these resources are finite. Continuous fatigue, excessive context switching, and reactive working patterns reduce one&#8217;s ability to engage constructively with stakeholders. Reflection on personal effectiveness therefore becomes essential rather than optional. The quality of influence one brings to an organization often reflects the quality of attention, composure, and preparation brought to each interaction.</p><p>Time offers a revealing signal of priorities. Environments dominated by perpetual firefighting and unstructured decision-making leave little space for thoughtful influence. Protecting time for preparation, reflection, and synthesis enables leaders to frame issues more effectively and engage stakeholders with clarity. In complex systems, disciplined use of time is inseparable from effective leadership practice.</p><p>Leadership without a title ultimately begins with self-examination. Before attempting to influence stakeholders or shape decisions, it is worth considering whether one is acting from clarity or anxiety, whether trust is being built or consumed, and whether actions strengthen the system as a whole rather than advance narrow interests. Influence may be external in its expression, but leadership remains deeply personal in its foundation.</p><p>The ability to lead without authority is not an innate trait reserved for a few. It is a capability developed through experience, observation, and deliberate practice. Structured learning can help, but the most meaningful growth occurs through consistent application in real organizational contexts.</p><p>In contemporary engineering organizations, authority is often distributed, situational, or unclear, while responsibility remains constant. Platform and infrastructure leaders operate within this tension daily. Leadership without a title is not a secondary form of leadership. In complex engineering systems, it is often the primary way leadership is exercised. Authority may be defined by role, but influence is shaped by conduct, consistency, and trust built over time.</p><p>For readers interested in exploring this further, the following resources offer useful perspectives on influencing without authority:</p><ul><li><p><a href="https://www.youtube.com/watch?v=26YI1ctY6eY">Leadership and Persuasion: Influencing Without Authority</a></p></li><li><p><a href="https://www.youtube.com/watch?v=ZhK9Ibezq1o">Influence Without Authority: Practical Strategies</a></p></li><li><p><a href="https://www.youtube.com/watch?v=mcCEkcnULh4">How to Influence Without Authority at Work</a></p></li><li><p><a href="https://www.classcentral.com/course/edx-leadership-and-influence-11672">Leadership and Influence (edX)</a></p></li><li><p><a href="https://www.codecademy.com/learn/ext-courses/influencing-without-authority">Influencing Without Authority (Codecademy)</a></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://tshasankda.substack.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share Data, Chai &amp; Dialogue - The Leadership Lounge&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://tshasankda.substack.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share Data, Chai &amp; Dialogue - The Leadership Lounge</span></a></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://tshasankda.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://tshasankda.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[ADR : Document Architecture Decisions Effectively]]></title><description><![CDATA[How to Capture, Track, and Learn from Your System Architecture Choices]]></description><link>https://tshasankda.substack.com/p/series-2-adr-document-architecture</link><guid isPermaLink="false">https://tshasankda.substack.com/p/series-2-adr-document-architecture</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Tue, 10 Feb 2026 09:00:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!s0Ek!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F293e5cca-cada-4554-8de9-1606974690c7_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Architecture Decision Records (ADRs) are concise documents that capture the important architectural decisions made for a system, why they were made, and the consequences. They provide a transparent history of the architecture and help teams understand past choices, make better decisions in the future, and onboard new members faster.</p><p>Without ADRs, teams often repeat debates, lose institutional knowledge, or struggle to understand why a system is designed a certain way. ADRs solve this by making decisions explicit, traceable, and easily accessible.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s0Ek!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F293e5cca-cada-4554-8de9-1606974690c7_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s0Ek!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F293e5cca-cada-4554-8de9-1606974690c7_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!s0Ek!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F293e5cca-cada-4554-8de9-1606974690c7_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!s0Ek!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F293e5cca-cada-4554-8de9-1606974690c7_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!s0Ek!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F293e5cca-cada-4554-8de9-1606974690c7_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s0Ek!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F293e5cca-cada-4554-8de9-1606974690c7_1024x1536.png" width="1024" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/293e5cca-cada-4554-8de9-1606974690c7_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b266ffdc-a108-4a18-b21f-122047bf02a8_1024x1536.png&quot;,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1045518,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/186159083?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb266ffdc-a108-4a18-b21f-122047bf02a8_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s0Ek!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F293e5cca-cada-4554-8de9-1606974690c7_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!s0Ek!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F293e5cca-cada-4554-8de9-1606974690c7_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!s0Ek!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F293e5cca-cada-4554-8de9-1606974690c7_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!s0Ek!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F293e5cca-cada-4554-8de9-1606974690c7_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Why ADRs Matter</h2><p>ADRs are not just bureaucratic overhead; they are a critical tool for sustainable architecture. They help teams align on choices, track the evolution of systems, and support continuous improvement. For new team members, ADRs serve as a practical learning resource that explains the &#8220;why&#8221; behind decisions rather than just the &#8220;what.&#8221;</p><p>Consistently maintaining ADRs ensures that architectural knowledge is not lost over time, helps prevent repeated mistakes, and enables confident decision-making in complex systems. For more details, see <a href="https://www.techtarget.com/searchapparchitecture/tip/4-best-practices-for-creating-architecture-decision-records">TechTarget ADR Best Practices</a>.</p><div class="pullquote"><p>Want to dive deeper into ADR? <a href="https://adr.github.io/">This is one of the go-to resources</a>. </p></div><h2>Good Practices for Creating ADRs</h2><p>Creating ADRs effectively requires discipline and consistency. Here are some best practices to follow:</p><ol><li><p><strong>Use a consistent format and template.</strong> Ensure all ADRs follow the same structure so that team members can quickly understand each record.</p></li><li><p><strong>Keep ADRs concise.</strong> Focus on the key decision points and avoid unnecessary detail that does not aid understanding.</p></li><li><p><strong>Document decisions promptly.</strong> Capture the decision as soon as possible after it is made to preserve context and reasoning.</p></li><li><p><strong>Store ADRs in an accessible repository.</strong> Use tools like GitLab or Confluence so all stakeholders can view and reference them easily.</p></li><li><p><strong>Encourage collaboration.</strong> Gather feedback from all relevant stakeholders to ensure decisions are well-informed.</p></li><li><p><strong>Track architectural evolution.</strong> Use ADRs to maintain a clear history of decisions, showing how architecture has evolved over time.</p></li><li><p><strong>Use ADRs for continuous improvement.</strong> Review past decisions to learn lessons and refine architecture practices.</p></li><li><p><strong>Leverage ADRs for onboarding.</strong> New team members can quickly understand the system and its design rationale.</p></li><li><p><strong>Review and update ADRs regularly.</strong> Ensure they remain relevant as the system and context evolve.</p></li></ol><h2>Key Considerations</h2><ul><li><p>ADRs should be maintained for all products and systems, even if the team decides the storage location or process.</p></li><li><p>ADRs must be searchable, auditable, and uniquely referencable.</p></li><li><p>While teams may choose the tools that fit their workflow, Git repositories or Confluence are commonly used for ADR storage.</p></li><li><p>Retention and governance must meet the minimum organizational standards to ensure compliance and long-term usefulness.</p></li></ul><h2>Recommended ADR Template</h2><p>Each ADR should follow a consistent structure, with short, clear titles and logical sections. Titles should use present-tense imperative phrasing, similar to a Git commit message, and less than 50 characters.</p><h3>Template Structure</h3><p><strong>Title:</strong> Short, action-oriented description of the decision</p><blockquote><p>Example: <code>Use GraphQL for internal APIs</code></p></blockquote><p><strong>1. Status</strong><br>Current state of the decision: Proposed, Accepted, Deprecated, Superseded.</p><p><strong>2. Context</strong><br>Describe the <strong>problem, issue, or motivation</strong> for this decision. Explain why a decision is necessary and what triggered it.</p><blockquote><p>Example: &#8220;The current REST APIs are causing repeated versioning conflicts and increased overhead in frontend-backend coordination.&#8221;</p></blockquote><p><strong>3. Considered Options</strong><br>List all the options that were considered to solve the problem.</p><blockquote><p>Example: &#8220;1) Continue using REST 2) Switch to GraphQL&#8221;</p></blockquote><p><strong>4. Pros and Cons</strong><br>Provide advantages and disadvantages for each option.</p><blockquote><p>Example:</p><ul><li><p>REST: Well-known, widely supported; difficult to manage multiple versions</p></li><li><p>GraphQL: Flexible, single endpoint, reduces over-fetching; requires learning and standardization</p></li></ul></blockquote><p><strong>5. Decision</strong><br>State the chosen option clearly, describing the <strong>change or action being implemented</strong>.</p><blockquote><p>Example: &#8220;Adopt GraphQL for all new internal APIs.&#8221;</p></blockquote><p><strong>6. Consequences</strong><br>Explain what becomes easier or harder due to this decision. Include both intended benefits and potential drawbacks.</p><blockquote><p>Example:</p><ul><li><p>Frontend developers can fetch exactly the data they need</p></li><li><p>Backend teams require training on GraphQL patterns</p></li><li><p>Monitoring and caching patterns must be updated</p></li></ul></blockquote><div><hr></div><h2>Storage Recommendations</h2><ul><li><p>Store ADRs in a central, version-controlled repository for accessibility and traceability.</p></li><li><p>Use Git repositories for code-aligned projects or Confluence for documentation-focused projects.</p></li><li><p>ADRs should be searchable and linked to the relevant project or system.</p></li><li><p>Ensure ADRs are reviewed and updated periodically.</p></li></ul><h2>ADR Tools with Features</h2><p>A variety of ADR tools exist to help teams create, manage, and visualize architecture decisions. For a full list, see <a href="https://www.techtarget.com/searchapparchitecture/tip/4-best-practices-for-creating-architecture-decision-records">TechTarget ADR Tools</a>.</p><p>From personal experience, two tools stand out for practical day-to-day use:</p><h3>Backstage ADR Plugin</h3><ul><li><p><strong>Description:</strong> Integrates ADR management directly into the Backstage developer portal.</p></li><li><p><strong>Features:</strong> Visual ADR overview, linking decisions to components and services, easy access for developers.</p></li><li><p><strong>Use case:</strong> Ideal for teams already using Backstage as their internal developer portal, providing a centralized place for architectural decisions.</p></li></ul><h3>Visual Studio Code Extensions</h3><ul><li><p><strong>Description:</strong> In-editor extensions for creating ADRs directly in VS Code.</p></li><li><p><strong>Features:</strong> Predefined templates, scaffolding, and linking ADRs to related files or projects.</p></li><li><p><strong>Use case:</strong> Developers can write ADRs alongside code, enforcing template consistency and saving time.</p></li></ul><blockquote><p>For other ADR tools such as ADR Manager, dotnet-adr, Log4brains, or PyADR, see <a href="https://www.techtarget.com/searchapparchitecture/tip/4-best-practices-for-creating-architecture-decision-records">TechTarget ADR Tools</a></p></blockquote><h2>Closing Thoughts</h2><p>ADRs help teams capture decisions early, track the evolution of systems, and onboard new members efficiently. When used consistently, ADRs form the backbone of maintainable and transparent architecture practices.</p><p>By combining ADRs with documentation as code, teams can create architecture documentation that is always up to date, accessible, and useful, setting the foundation for confident decision-making and scalable system design.</p><div><hr></div><h2>ADR Markdown Template (Ready to Use)</h2><p>You can copy and adapt this template for any decision:</p><pre><code><code># [Short, Imperative Title]  
**Status:** Proposed / Accepted / Deprecated / Superseded  

## Context
Describe the issue, motivation, or problem that triggered this decision.

## Considered Options
1. Option A  
2. Option B  
3. Option C  

## Pros and Cons
**Option A:**  
- Pros: ...  
- Cons: ...  

**Option B:**  
- Pros: ...  
- Cons: ...  

## Decision
State the chosen option and the action being implemented.

## Consequences
Explain benefits, challenges, and trade-offs resulting from this decision.
</code></code></pre><p>Next blog posts in the series will cover:</p><ul><li><p><strong>Series 3: Documentation as Code</strong> &#8211; How to integrate ADRs and diagrams in Git/Markdown workflows.</p></li><li><p><strong>Series 4: AI-assisted System Documentation</strong> &#8211; Using AI to generate, update, and review architecture documentation automatically.</p></li></ul>]]></content:encoded></item><item><title><![CDATA[Speaking the Same Language: How Product Owners Can Use Gherkin to Communicate with Engineers]]></title><description><![CDATA[Making behaviour explicit, predictable, and actionable across platform and data teams]]></description><link>https://tshasankda.substack.com/p/speaking-the-same-language-how-product</link><guid isPermaLink="false">https://tshasankda.substack.com/p/speaking-the-same-language-how-product</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Wed, 04 Feb 2026 03:10:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!H99_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee1cc5a-4516-4240-8110-32afc12fab3f_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Once upon a time, I joined a platform team where a Product Owner had started writing feature specifications in Gherkin. As engineers, we were both surprised and impressed the clarity was almost magical. Here was someone who could translate abstract product ideas into the exact system behaviours we could test, build, and trust.</p><div class="pullquote"><p>Note: I really enjoyed crafting this story! After writing the draft, I used AI to help Polish the grammar and flow while ensuring my original experiences and thoughts remained the heart of the piece</p></div><p>Gherkin is more than a testing framework. It is a bridge between product thinking and engineering execution. While Behaviour Driven Development (BDD) and Test Driven Development (TDD) have been part of engineering practices for years, surprisingly few Product Owners use Gherkin as a communication tool. That&#8217;s a missed opportunity, because when used well, it eliminates ambiguity, shortens feedback loops, and aligns everyone to the same definition of &#8220;done.&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H99_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee1cc5a-4516-4240-8110-32afc12fab3f_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H99_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee1cc5a-4516-4240-8110-32afc12fab3f_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!H99_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee1cc5a-4516-4240-8110-32afc12fab3f_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!H99_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee1cc5a-4516-4240-8110-32afc12fab3f_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!H99_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee1cc5a-4516-4240-8110-32afc12fab3f_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H99_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee1cc5a-4516-4240-8110-32afc12fab3f_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fee1cc5a-4516-4240-8110-32afc12fab3f_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2286903,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/185806325?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee1cc5a-4516-4240-8110-32afc12fab3f_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H99_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee1cc5a-4516-4240-8110-32afc12fab3f_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!H99_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee1cc5a-4516-4240-8110-32afc12fab3f_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!H99_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee1cc5a-4516-4240-8110-32afc12fab3f_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!H99_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffee1cc5a-4516-4240-8110-32afc12fab3f_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Why User Stories Aren&#8217;t Always Enough</h3><p>Most of us are used to writing requirements as User Stories:</p><blockquote><p>&#8220;As a user, I want to generate a daily report so I can monitor anomalies.&#8221;</p></blockquote><p>Stories are great. They capture the user, the action, and the expected result. They help us focus on value, cover expected interactions, and provide context. But when the system grows complex &#8212; multiple services, data flows, or dependent APIs &#8212; stories start to fall short. You end up stacking acceptance criteria, notes, exceptions, and examples, bloating Jira tickets with what should have been clear from the start.</p><p>The gaps show up as rework, misunderstandings, and frustrated engineers. Gherkin comes to the rescue. By writing behaviour in examples, Product Owners can provide explicit instructions without dictating implementation.</p><h3>What Gherkin Brings to the Table</h3><p>Gherkin introduces a simple syntax:</p><p><strong>Given</strong> &#8212; the initial state<br><strong>When</strong> &#8212; the action performed<br><strong>Then</strong> &#8212; the expected outcome</p><p>This seems trivial at first, but it&#8217;s transformative. Every scenario becomes a conversation starter. Every ambiguity is exposed before coding begins. For platform and data teams, this means that pipelines, APIs, and workflows can be specified with precision while still keeping the language accessible to non-technical stakeholders.</p><h3>A Real Platform Engineering Example</h3><p>Consider a banking data platform where transaction data flows into a fraud detection pipeline. A naive requirement might read:</p><blockquote><p>&#8220;The system should process daily transaction files.&#8221;</p></blockquote><p>With Gherkin, behaviour is explicit:</p><p><strong>Scenario: Successful daily ingestion</strong><br>Given the daily transaction file arrives before 2 AM<br>When the ingestion pipeline starts<br>Then schema validation runs successfully<br>Then valid records are enriched with customer metadata<br>Then fraud rules are executed<br>Then clean data is stored in the warehouse<br>Then alerts are triggered for suspicious activity</p><p>Now imagine an edge case:</p><p><strong>Scenario: Corrupted records in transaction file</strong><br>Given the file contains corrupted rows<br>When validation runs<br>Then corrupted records are quarantined<br>Then processing continues for valid data<br>Then operations receives an error summary</p><p>Every step is explicit. No assumptions. No hidden gaps.</p><h3>The Cheat Sheet: Given, When, Then</h3><p>Over time, teams start thinking in this flow naturally:</p><ul><li><p><strong>Given</strong> &#8212; current state of the system</p></li><li><p><strong>When</strong> &#8212; the action or event</p></li><li><p><strong>Then</strong> &#8212; what must be true afterward</p></li><li><p><strong>And / But</strong> &#8212; extend previous steps without breaking clarity</p></li></ul><p>This becomes the backbone of all behaviour specification. A cheat sheet that guides design, testing, and automation. Complex flows, like multi-stage pipelines or API orchestrations, can be broken into multiple scenarios while keeping them readable and verifiable.</p><h3>Complexity, Heuristics, and Controlled Granularity</h3><p>Not every feature needs dozens of scenarios. But as complexity grows, Gherkin exposes it early. You&#8217;ll notice when you keep adding scenarios: maybe your feature is too broad, maybe your user segmentation is unclear, or maybe you can split the feature into smaller behaviours.</p><p>The goal isn&#8217;t exhaustive detail but <strong>enough clarity to communicate intent and allow engineering to implement safely</strong>. Think of it as early risk detection before code even hits the repository.</p><h3>Product Owner Practices for Using Gherkin Effectively</h3><ol><li><p><strong>Start from the user goal</strong> &#8212; What outcome matters to them?</p></li><li><p><strong>Define system states and events</strong> &#8212; Given/When/Then for each behaviour.</p></li><li><p><strong>Collaborate with engineers</strong> &#8212; Review scenarios together; ensure feasibility and coverage.</p></li><li><p><strong>Keep the language business-friendly</strong> &#8212; Avoid endpoints, DB columns, or implementation details.</p></li><li><p><strong>Use scenarios to drive tests</strong> &#8212; Engineers can implement step definitions for automation.</p></li><li><p><strong>Review and refine</strong> &#8212; Complexity grows; adjust granularity as the system evolves.</p></li></ol><h3>Why It Matters</h3><p>When Product Owners use Gherkin effectively:</p><ul><li><p>Engineers understand requirements faster and more accurately</p></li><li><p>Quality improves naturally, reducing rework</p></li><li><p>Releases become predictable</p></li><li><p>The team builds trust and shared understanding</p></li></ul><p>In short, it&#8217;s not just a syntax. It&#8217;s a method for turning abstract objectives into measurable, testable outcomes &#8212; the same way OKRs turn strategy into execution for teams.</p><h3>The Takeaway</h3><p>A well-crafted Gherkin feature is like a map. It shows exactly where you are, what happens next, and what &#8220;success&#8221; looks like. It turns ambiguity into shared understanding and accelerates delivery without creating bureaucracy.</p><p>For Product Owners in platform and data engineering, this is a superpower: designing behaviours upfront, communicating clearly, and enabling teams to own outcomes confidently.</p>]]></content:encoded></item><item><title><![CDATA[A Practical Guide to Architecture Documentation]]></title><description><![CDATA[Where to Start, What to Document, and How to Build Maintainable Architecture Knowledge]]></description><link>https://tshasankda.substack.com/p/a-practical-guide-to-architecture</link><guid isPermaLink="false">https://tshasankda.substack.com/p/a-practical-guide-to-architecture</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Fri, 30 Jan 2026 10:01:09 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!PHW5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff14cae15-1869-4727-ad18-05759ce98a25_1024x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In many teams and organizations, architecture documentation is either missing, outdated, or created only for compliance. Most engineers and architects have questions around documentation &#8212; what to document, which tools to use, where to find templates, and how to even start learning system architecture properly. But for various reasons, these questions are rarely raised openly.</p><p>Over time, I&#8217;ve received consistent feedback from mentees, peers, and team members:</p><ul><li><p>Where are our architecture documents?</p></li><li><p>What tools should we use to create diagrams?</p></li><li><p>Do we have any templates or standards?</p></li><li><p>How do we start learning system architecture properly?</p></li></ul><p>So in today&#8217;s post, I want to focus specifically on architecture documentation &#8212; why it matters, how to approach it, and what views you should ideally cover.</p><p>This blog series is mainly focused on standardizing architecture documentation and sharing best practices to create documentation that is easy to maintain and keep up to date.</p><p>My goal is also to help junior engineers develop interest in architecture concepts early in their careers &#8212; not just coding, but understanding systems as a whole.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PHW5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff14cae15-1869-4727-ad18-05759ce98a25_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PHW5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff14cae15-1869-4727-ad18-05759ce98a25_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!PHW5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff14cae15-1869-4727-ad18-05759ce98a25_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!PHW5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff14cae15-1869-4727-ad18-05759ce98a25_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!PHW5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff14cae15-1869-4727-ad18-05759ce98a25_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PHW5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff14cae15-1869-4727-ad18-05759ce98a25_1024x1536.png" width="1024" height="1536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f14cae15-1869-4727-ad18-05759ce98a25_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f17cb69d-02b1-47ac-83ae-10e5bf4ddb96_1024x1536.png&quot;,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:967922,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/186151542?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff17cb69d-02b1-47ac-83ae-10e5bf4ddb96_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PHW5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff14cae15-1869-4727-ad18-05759ce98a25_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!PHW5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff14cae15-1869-4727-ad18-05759ce98a25_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!PHW5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff14cae15-1869-4727-ad18-05759ce98a25_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!PHW5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff14cae15-1869-4727-ad18-05759ce98a25_1024x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3></h3><h3>Core Principles of Good Architecture Documentation</h3><p>Before jumping into tools and diagrams, a few fundamentals matter more than anything else.</p><h4>1. Identify your primary audience</h4><p>Your audience could be:</p><ul><li><p>Business stakeholders</p></li><li><p>Developers</p></li><li><p>Operations teams</p></li><li><p>Security and compliance</p></li><li><p>Leadership</p></li></ul><p>Each group needs different levels of detail. One document cannot serve everyone perfectly.</p><h4>2. Create only what your audience needs</h4><p>A common mistake is documenting everything. More documentation does not mean better documentation.</p><p>Focus on:</p><ul><li><p>What decisions people need to make</p></li><li><p>What information helps them do their job</p></li></ul><h4>3. Keep documentation up to date</h4><p>Outdated architecture is dangerous. It leads to wrong assumptions, broken integrations, and poor decisions.</p><p>If you cannot maintain a diagram &#8212; don&#8217;t create it.</p><h4>4. Make documentation easy to find and accessible</h4><p>Store documentation in places people already use:</p><ul><li><p>Confluence / Notion</p></li><li><p>Git repositories</p></li><li><p>SharePoint / Wiki</p></li></ul><p>Hidden documentation is as good as no documentation.</p><h3>Tools for Creating Architecture Diagrams</h3><p>Architecture documentation is primarily represented using diagrams. There is a wide range of tools available today from simple collaboration tools to specialized enterprise modelling software. The key is choosing the right tool based on purpose, collaboration, and accessibility.</p><h4>1. Giffy &#8211; great for brainstorming and early design</h4><p>In my view, Giffy works best during the brainstorming and design phase.<br>It allows better collaboration with development teams and makes it easy to sketch ideas quickly during discussions and workshops.</p><p>Best for: early design, workshops, team collaboration.</p><h4>2. Microsoft Visio &#8211; the classic enterprise standard</h4><p>Most people are already familiar with Visio as it has been a standard Microsoft diagramming tool for years.</p><p>It is especially useful when:</p><ul><li><p>Embedding diagrams into MS Word and PowerPoint</p></li><li><p>Working in organisations with Office 365 (often part of the license)</p></li></ul><p>Best for: formal documentation in enterprise environments.</p><h4>3. PlantUML &#8211; diagrams as code</h4><p>PlantUML is an open-source tool that allows users to create diagrams using a plain text language.</p><p>It is a great choice when:</p><ul><li><p>Storing diagrams in Git repositories</p></li><li><p>Working with Markdown documentation</p></li><li><p>Creating sequence diagrams and system flows as code</p></li></ul><p>Best for: DevOps-friendly documentation.</p><p><a href="https://plantuml.com/">https://plantuml.com/</a></p><h4>4. Mermaid &#8211; lightweight open source alternative</h4><p>Mermaid is another open-source tool similar to PlantUML.</p><p>It works very well inside:</p><ul><li><p>Markdown files</p></li><li><p>GitHub/GitLab documentation</p></li></ul><p>Best for: quick technical documentation.</p><p><a href="https://mermaid.js.org/">https://mermaid.js.org/</a></p><h4>5. Sparx Enterprise Architect &#8211; full enterprise modelling tool</h4><p>Sparx EA is a comprehensive enterprise architecture and UML modelling platform.</p><p>It is commonly used for:</p><ul><li><p>Detailed UML analysis</p></li><li><p>Formal architecture governance</p></li><li><p>Large organisations</p></li></ul><p><a href="https://sparxsystems.com/">https://sparxsystems.com/</a></p><h4>6. Visual &amp; Collaborative Tools</h4><ul><li><p>Draw.io (diagrams.net) &#8211; free, simple, widely used</p></li><li><p>Lucidchart &#8211; cloud-based, very popular</p></li><li><p>Miro &#8211; excellent for workshops and brainstorming</p></li><li><p>Figma / FigJam &#8211; increasingly used for system flows</p></li></ul><h4>7. Enterprise Modelling Tools</h4><ul><li><p>Archi (ArchiMate)</p></li><li><p>BizzDesign</p></li></ul><h3>Architecture Documentation Categories</h3><p><em>(With Diagram References &amp; Mandatory Key Elements)</em></p><h4>1. Business Context View</h4><p>Reference: <a href="https://c4model.com/#SystemContextDiagram">C4 Model &#8211; System Context Diagram</a><br>Key Elements</p><ul><li><p>Business goals</p></li><li><p>Stakeholders</p></li><li><p>User personas</p></li><li><p>External partners</p></li><li><p>High-level business processes</p></li></ul><h5>Why it matters</h5><p>Aligns technology with business value.</p><h4>2. Technical Context View</h4><p>Reference: <a href="https://c4model.com/#SystemContextDiagram">C4 System Context Diagram</a></p><h5>Key Elements</h5><ul><li><p>External systems</p></li><li><p>APIs and integrations</p></li><li><p>Data flow directions</p></li><li><p>Communication protocols</p></li><li><p>Ownership of systems</p></li></ul><h5>Why it matters</h5><p>Avoids dependency surprises.</p><h4>3. Logical Solution Overview</h4><p>Reference: <a href="https://c4model.com/#ContainerDiagram">C4 Container Diagram</a></p><h5>Key Elements</h5><ul><li><p>Major services/components</p></li><li><p>Responsibilities</p></li><li><p>Communication paths</p></li><li><p>Technology type (API, DB, queue, UI)</p></li></ul><h5>Why it matters</h5><p>Foundation of the system design.</p><h4>4. Interface Overview</h4><p>Reference: <a href="https://www.uml-diagrams.org/component-diagrams.html">UML Component Diagrams</a></p><h5>Key Elements</h5><ul><li><p>APIs</p></li><li><p>Message brokers</p></li><li><p>Protocols</p></li><li><p>Data formats</p></li><li><p>Versioning</p></li></ul><h5>Why it matters</h5><p>Smooth integrations.</p><h4>5. Application / Service Inventory</h4><p>Reference: <a href="https://sparxsystems.com/enterprise_architect_user_guide/17.1/guide_books/ea_application_architecture.html">Application Portfolio Views</a></p><h5>Key Elements</h5><ul><li><p>Service name</p></li><li><p>Owner</p></li><li><p>Technology stack</p></li><li><p>Criticality</p></li><li><p>Lifecycle status</p></li></ul><h5>Why it matters</h5><p>Governance and modernization.</p><h4>6. Data Flow View</h4><p>Reference: <a href="https://sparxsystems.com/enterprise_architect_user_guide/17.1/guide_books/tools_ba_data_flow_diagram.html">Data Flow Diagrams (DFD)</a></p><h5>Key Elements</h5><ul><li><p>Data sources</p></li><li><p>Processing steps</p></li><li><p>Storage</p></li><li><p>Consumers</p></li><li><p>Frequency</p></li></ul><h5>Why it matters</h5><p>Analytics and compliance.</p><h4>7. Conceptual Data Model</h4><p>Reference: <a href="https://www.lucidchart.com/pages/er-diagrams">High-level ER Diagrams</a> , <a href="https://www.thoughtspot.com/data-trends/data-modeling/conceptual-vs-logical-vs-physical-data-models">Conceptual data model</a></p><h5>Key Elements</h5><ul><li><p>Business entities</p></li><li><p>Relationships</p></li><li><p>Definitions</p></li></ul><h5>Why it matters</h5><p>Common understanding.</p><h4>8. Logical Data Model</h4><p>Reference: <a href="https://www.thoughtspot.com/data-trends/data-modeling/conceptual-vs-logical-vs-physical-data-models">Logical data model</a></p><h5>Key Elements</h5><ul><li><p>Attributes</p></li><li><p>Keys</p></li><li><p>Relationships</p></li><li><p>Constraints</p></li></ul><h5>Why it matters</h5><p>Schema design.</p><h4>9. Physical Data Model</h4><p>Reference: <a href="https://www.visual-paradigm.com/guide/data-modeling/physical-data-model/">Database Schema Diagrams</a></p><h5>Key Elements</h5><ul><li><p>Tables</p></li><li><p>Indexes</p></li><li><p>Partitions</p></li><li><p>Storage design</p></li></ul><h5>Why it matters</h5><p>Performance.</p><h4>10. Security View</h4><p>Reference: <a href="https://owasp.org/www-community/Threat_Modeling">Threat Modeling</a></p><h5>Key Elements</h5><ul><li><p>Authentication</p></li><li><p>Authorization</p></li><li><p>Encryption</p></li><li><p>Trust boundaries</p></li><li><p>Network zones</p></li></ul><h5>Why it matters</h5><p>Risk control.</p><h4>11. Deployment View</h4><p>Reference: <a href="https://aws.amazon.com/architecture/icons/">Infrastructure Architecture Diagrams</a></p><h5>Key Elements</h5><ul><li><p>Cloud services/servers</p></li><li><p>Environments</p></li><li><p>Scaling</p></li><li><p>Load balancers</p></li><li><p>Monitoring</p></li></ul><h5>Why it matters</h5><p>Reliability.</p><h4>12. Sequence Diagrams</h4><p>Reference: <a href="https://www.uml-diagrams.org/sequence-diagrams.html">UML Sequence Diagrams</a></p><h5>Key Elements</h5><ul><li><p>Actors</p></li><li><p>Components</p></li><li><p>Message flow</p></li><li><p>Error paths</p></li></ul><h5>Why it matters</h5><p>Clear behaviour flow.</p><h4>13. Cloud Account / Subscription Model</h4><p>Reference: <a href="https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ready/landing-zone/">Cloud Landing Zone Architecture</a></p><h5>Key Elements</h5><ul><li><p>Accounts/subscriptions</p></li><li><p>Network segmentation</p></li><li><p>Shared services</p></li><li><p>Access controls</p></li></ul><h5>Why it matters</h5><p>Security and cost control.</p><h4>Additional Highly Recommended Sections</h4><h5>Architecture Decision Records (ADR)</h5><p>Reference: <a href="https://adr.github.io/">https://adr.github.io/</a></p><h5>Key Elements</h5><ul><li><p>Context</p></li><li><p>Decision</p></li><li><p>Alternatives</p></li><li><p>Consequences</p></li></ul><h5>Non-Functional Requirements</h5><h5>Key Elements</h5><ul><li><p>Performance</p></li><li><p>Availability</p></li><li><p>Scalability</p></li><li><p>Disaster recovery</p></li><li><p>Security</p></li></ul><h5>Technology Standards</h5><h5>Key Elements</h5><ul><li><p>Approved tools</p></li><li><p>Frameworks</p></li><li><p>Cloud services</p></li><li><p>Design patterns</p></li></ul><h4>Blog Series Roadmap</h4><p>This post is part of a larger blog series focused on architecture documentation standardization and best practices.</p><p>Upcoming posts will cover:</p><h5>&#128204;<a href="https://tshasankda.substack.com/p/series-2-adr-document-architecture"> Series 2 &#8211; Architecture Decision Records (ADR)</a></h5><p>What ADR is, why it matters, and how to write good decisions.</p><h5>&#128204; Series 3 &#8211; Documentation as Code</h5><p>Using Git, Markdown, PlantUML, Mermaid to keep docs always updated.</p><h5>&#128204; Series 4 &#8211; Using AI for System Documentation</h5><p>How AI can generate, update, and review architecture docs.</p><h3>Final Thoughts</h3><p>Architecture documentation is not about drawing pretty diagrams.</p><p>It is about:</p><ul><li><p>Clear communication</p></li><li><p>Better decisions</p></li><li><p>Long-term system health</p></li><li><p>Faster onboarding</p></li><li><p>Less chaos</p></li><li><p>Start small.</p></li><li><p>Document what matters.</p></li><li><p>Keep it updated.</p></li><li><p>Make it accessible.</p></li></ul><p></p>]]></content:encoded></item><item><title><![CDATA[OKRs in Action: Turning Platform Vision into Measurable Impact]]></title><description><![CDATA[How clarity, causality, and continuous measurement build high-performing data and engineering teams]]></description><link>https://tshasankda.substack.com/p/okrs-in-action-turning-platform-teams</link><guid isPermaLink="false">https://tshasankda.substack.com/p/okrs-in-action-turning-platform-teams</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Tue, 27 Jan 2026 10:02:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DN4t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefc3da9-123a-4c37-9503-da6d4fe510da_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At the heart of every high-performing platform or data team is clarity &#8212; clarity about purpose, clarity about priorities, and clarity about outcomes. For years, I watched engineers work tirelessly, close tickets quickly, and ship features continuously, yet the platform often remained fragile. Incidents repeated in familiar patterns, pipelines broke the same way, and reliability improved slowly despite enormous effort. The problem was never motivation. It was direction.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DN4t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefc3da9-123a-4c37-9503-da6d4fe510da_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DN4t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefc3da9-123a-4c37-9503-da6d4fe510da_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!DN4t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefc3da9-123a-4c37-9503-da6d4fe510da_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!DN4t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefc3da9-123a-4c37-9503-da6d4fe510da_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!DN4t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefc3da9-123a-4c37-9503-da6d4fe510da_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DN4t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefc3da9-123a-4c37-9503-da6d4fe510da_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/defc3da9-123a-4c37-9503-da6d4fe510da_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1073599,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/185803984?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefc3da9-123a-4c37-9503-da6d4fe510da_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DN4t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefc3da9-123a-4c37-9503-da6d4fe510da_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!DN4t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefc3da9-123a-4c37-9503-da6d4fe510da_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!DN4t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefc3da9-123a-4c37-9503-da6d4fe510da_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!DN4t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdefc3da9-123a-4c37-9503-da6d4fe510da_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Everything changed when I began using OKRs not as a reporting exercise, but as a way to deliberately redesign systems and measure meaningful outcomes. Reading <em>Measure What Matters</em> by John Doerr helped me understand that objectives without measurable outcomes are just intentions. Experience taught me something deeper: metrics alone don&#8217;t create improvement. Improvement happens when teams change the systems that produce those metrics. This insight now shapes how I run quarterly and half-yearly planning sessions with platform and data teams. We focus not on tasks alone, but on impact, learning, and the behaviour of the systems we operate.</p><h3>Why OKRs Matter in Platform and Data Engineering</h3><p>Most engineering goals sound sensible on paper: improve reliability, reduce incidents, make pipelines faster. Yet these are aspirations, not mechanisms for change. High-performing platform teams don&#8217;t succeed by working harder. They succeed by improving the underlying systems that naturally produce reliability, speed, and trust.</p><p>OKRs shift the focus from shipping more work to engineering better outcomes. A DevOps team succeeds when recovery becomes automatic. A data platform succeeds when failures are rare and visible. A BI team succeeds when trusted data is adopted by default. OKRs make these system-level improvements intentional, measurable, and aligned with broader business priorities.</p><h3>Objectives, Outcomes, and the Missing Link: Causality</h3><p>An objective defines direction. A key result proves that something changed. But high-impact OKRs include a third, often overlooked layer &#8212; causality.</p><p>Causality asks: <em>what actions or system changes will directly produce the desired outcome?</em> Without this, OKRs risk becoming little more than scoreboard watching or checkbox exercises. With it, OKRs become engineering experiments.</p><p>For example, &#8220;Reduce MTTR to 15 minutes&#8221; is measurable but vague. Reframed causally: &#8220;Reduce MTTR to 15 minutes by automating rollback, introducing SLO-based alerting, and removing manual recovery steps&#8221; is actionable. The metric tells you if it worked. The causal design explains why it should.</p><p>This logic applies across all domains &#8212; technical, operational, and even human productivity. Every key result must break down the objective into measurable outcomes that matter, clearly linking action &#8594; system change &#8594; benefit.</p><h3>Bridging Strategy and Daily Engineering Work</h3><p>Leadership goals often live at a high level: faster innovation, lower operational risk, better data-driven decisions. OKRs translate these into concrete engineering levers.</p><p>If the business needs faster experimentation, platform OKRs focus on deployment speed and recovery. If the business needs trustworthy analytics, data OKRs focus on freshness, validation, and visibility. Each key result becomes a controllable system input that drives measurable business impact.</p><p>Even behavioral OKRs &#8212; mentoring, documentation, process improvements &#8212; fit naturally. They ensure that achieving technical outcomes does not come at the cost of team health or long-term capability.</p><h3>Common Causal Mistakes Teams Make</h3><p>Even skilled teams fall into familiar traps. One is confusing activity with impact: migrating tooling, building dashboards, or writing scripts &#8212; these are tasks, not outcomes. Another is metric chasing without system change: &#8220;reduce incidents by 50%&#8221; without specifying how. The worst is mistaking busyness for progress: lots of work completed, but no improvement in reliability, speed, or quality.</p><p>Strong OKRs always connect: action &#8594; system change &#8594; measurable outcome.</p><h3>Real OKRs Across Teams</h3><h4>Platform / DevOps &#8212; Designing Reliability Into the System</h4><p>When deployments kept failing, the team stopped chasing individual incidents and reframed their direction around a single objective: make production deployments predictable, low-risk, and uneventful. Reliability was no longer something hoped for after each release; it became something engineered into the delivery system itself.</p><p>Over the quarter, Mean Time to Recovery dropped from ninety minutes to under fifteen as automated rollback mechanisms and service-level alerting replaced manual intervention. Deployment frequency increased from weekly releases to daily changes by removing approval bottlenecks and introducing automated validation gates. At the same time, pipeline success rates climbed to 99.9 percent as testing and infrastructure checks became part of every release.</p><p>What changed wasn&#8217;t just tooling. The system itself evolved. Engineers stopped firefighting and started trusting the platform. The objective wasn&#8217;t automation for its own sake &#8212; it was building a delivery pipeline that naturally produced reliability.</p><h4>Data Platform &#8212; Making Trust a System Property</h4><p>The data platform team faced a familiar problem: pipelines were breaking constantly, but failures were often discovered hours or days later. Rather than reacting faster, the team reframed their objective: build reliable and observable data systems.</p><p>Schema validation at ingestion, real-time freshness monitoring, and automated retries for transient failures reduced defects by more than 70% within one quarter. Data freshness SLAs consistently stayed within target windows, and failures were detected within minutes instead of hours. Trust grew organically. Analysts spent less time validating numbers and more time using insights. Reliability was no longer heroic effort &#8212; it became the system default.</p><h4>BI &amp; Analytics &#8212; Turning Usage Into a Natural Outcome</h4><p>The BI team initially believed their challenge was a lack of dashboards. More reports didn&#8217;t help adoption. The real objective: enable confident self-service analytics. Standardized certified gold datasets, optimized dashboard performance, and focused enablement sessions helped users understand data meaning and limitations. Usage of certified datasets rose over 50%, while ad-hoc requests dropped. Trust plus speed changed behaviour; adoption became the natural outcome.</p><h4>Cloud &amp; Cost Management &#8212; Efficiency Without Slowing Teams</h4><p>Cost optimization often conflicts with delivery speed. The team set an objective to run infrastructure efficiently without compromising velocity. Autoscaling, idle resource cleanup, and real-time cost visibility reduced spend by 25% while deployment frequency remained unchanged. Efficiency became a shared responsibility; impact was measurable, immediate, and visible.</p><h3>Causal OKR Templates Teams Can Reuse</h3><p>Objective: Improve a system outcome that matters.<br>Key Results: Each KR describes what measurable result you want and what causal action will drive it.</p><p>Reliability / Stability: Reduce downtime to &lt;30 minutes/month via automated recovery; lower incident rate by 50% with validation checks; remove manual interventions through runbook automation.</p><p>Data Quality / Accuracy: Reduce defects by 40% using schema/freshness checks; achieve 99.9% SLA with monitoring and automated retries; detect failures within 10 minutes.</p><p>Delivery / Speed: Increase deployment frequency by removing manual approvals; reduce lead time via automated testing; maintain failure rate &lt;2% using rollback automation.</p><h3>Reclaiming Time: Golden Hours for High-Impact Work</h3><p>After optimizing systems, the next frontier is time. Every minute spent on repetitive tasks is a minute lost for learning, experimentation, and platform improvement. &#8220;Golden hours&#8221; are blocks of protected, high-value time.</p><p>Objective: Reclaim engineer time for learning and high-value work.<br>Key Results: Reduce manual deployment verification by 2 hours/engineer/week; cut incident recovery from 90 to 30 minutes with automated alerts, rollback, and monitoring; ensure 2 hours/week of protected &#8220;golden hour&#8221; for experimentation and learning.</p><p>Measurement matters: time saved is observable via logs, telemetry, or calendar tracking. Linking productivity to system-level improvements turns minutes into measurable outcomes. Teams that embrace this see engineers planning work that actually moves the needle.</p><h3>Universal / Personal OKRs</h3><p>Even outside core engineering, causal OKRs apply:</p><p>Productivity / Time Management: Increase focused work hours to 30/week by blocking deep work; reduce context switching; complete 90% of priority tasks.</p><p>Learning / Skill Growth: Complete 3 courses + 2 projects in 6 months; apply learnings to optimize pipelines; share insights in team sessions monthly.</p><p>Health: Reduce body weight by 5 kg in 3 months through diet and exercise; cut sugar intake 50%; track weekly progress.</p><p>The principle is universal: measurable outcomes + causal actions = real impact.</p><h3>Tracking, Evolving, Celebrating, and Reflecting</h3><p>OKRs are most powerful when they are visible, tracked, and reviewed. A simple cadence works:</p><p>Track &#8594; Evolve &#8594; Celebrate &amp; Reflect<br>Quarterly or half-yearly reflection sessions ask: Are we on track? Are key results driving meaningful system change? Which objectives should feed into the QBR? Did any OKR fail, and what did we learn?</p><p>Individual OKRs should be 8&#8211;10 total, divided between performance and behavioral outcomes. Setting them 1 year in advance is fine for strategy, but remain flexible &#8212; change is expected. Leadership isn&#8217;t about perfection; it&#8217;s about learning, adapting, and improving continuously.</p><h3>Measurement is Part of Engineering Work</h3><p>You can&#8217;t improve what you can&#8217;t observe. MTTR requires incident telemetry. Pipeline reliability requires failure metrics. Time savings require logs or calendar tracking. These aren&#8217;t overhead; they&#8217;re foundational platform engineering. Every OKR cycle should improve both systems and observability.</p><h3>How OKRs Change Culture</h3><p>Over time, the shift becomes visible. Engineers stop asking, &#8220;What ticket should I pick up?&#8221; They start asking, &#8220;What will actually improve reliability, speed, quality, or learning?&#8221; Ownership grows, learning accelerates, and impact becomes tangible. OKRs stop being goals and become feedback loops guiding continuous improvement.</p><h3>Final Thoughts</h3><p>OKRs work when they connect vision to execution, action to outcome, and effort to system improvement. Causality is what makes that connection real. When objectives set direction, key results measure change, and teams design the system improvements that produce results, OKRs stop being management theatre. They become the operating system of high-performing platform and data teams.</p><p>Not about doing more work. About building better systems that naturally produce better outcomes. When golden hours, learning time, and measurement are included, teams don&#8217;t just work &#8212; they grow, innovate, and lead.</p>]]></content:encoded></item><item><title><![CDATA[Good to Great : Leading Teams When Everything Keeps Changing]]></title><description><![CDATA[Practical leadership lessons from financial sector and platform teams]]></description><link>https://tshasankda.substack.com/p/good-teams-to-great-teams-leading</link><guid isPermaLink="false">https://tshasankda.substack.com/p/good-teams-to-great-teams-leading</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Thu, 22 Jan 2026 21:00:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!MPGr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2db8f9fa-6dd1-45c2-8b93-4f608dad1512_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="pullquote"><p><strong>I strongly believe that people are the driving force behind every success story.</strong></p></div><p>Working in the financial and banking domain, within the business division of technology, I have experienced constant organizational change. Over the last eight years, I have moved across four different teams, each with different priorities, structures, and expectations. Change was never an exception. It was the norm.</p><p>What made this journey meaningful was not systems or processes alone, but the people and leaders I worked with. I had supportive managers, matrix managers, and leaders who were cooperative and open. These experiences shaped how I think about leadership today.</p><p>In a world increasingly shaped by automation, digital transformation, and shifting business models, organizations that succeed are the ones that do not lose sight of their people</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MPGr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2db8f9fa-6dd1-45c2-8b93-4f608dad1512_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MPGr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2db8f9fa-6dd1-45c2-8b93-4f608dad1512_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!MPGr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2db8f9fa-6dd1-45c2-8b93-4f608dad1512_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!MPGr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2db8f9fa-6dd1-45c2-8b93-4f608dad1512_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!MPGr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2db8f9fa-6dd1-45c2-8b93-4f608dad1512_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MPGr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2db8f9fa-6dd1-45c2-8b93-4f608dad1512_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2db8f9fa-6dd1-45c2-8b93-4f608dad1512_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/76dbf269-7d53-4f99-985d-0641c20c73e5_1024x1024.png&quot;,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2423191,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/185456043?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76dbf269-7d53-4f99-985d-0641c20c73e5_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MPGr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2db8f9fa-6dd1-45c2-8b93-4f608dad1512_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!MPGr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2db8f9fa-6dd1-45c2-8b93-4f608dad1512_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!MPGr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2db8f9fa-6dd1-45c2-8b93-4f608dad1512_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!MPGr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2db8f9fa-6dd1-45c2-8b93-4f608dad1512_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">My POD</figcaption></figure></div><div><hr></div><h2>The Foundation of High Performance</h2><p>Building high-performance teams is not easy, especially in high-pressure and fast-moving environments like banking, platform engineering, and shared services.</p><p>Most of what I practice today comes from leaders who guided me early in my career. Over the last five to six years, I have applied these lessons while building and running platform, service, and application teams that support multiple stakeholders.</p><p>The objective was never speed alone. It was sustainable delivery through people. These principles consistently helped move teams from good to great.</p><div><hr></div><h2>1. Prioritize a Thriving Culture</h2><blockquote><p>Culture needs active maintenance, not just well-written policies.</p></blockquote><p>While benefits and flexibility matter, culture is shaped more by <strong>daily signals</strong> than by documents. One practice I learned from a manager that stayed with me is the idea of a <strong>team health score or mood check</strong>.</p><p>This does not need to be complex. A quick check-in during stand-ups or weekly syncs helps leaders sense energy levels, stress, and early signs of burnout. Over time, these small signals provide far more insight than any annual engagement survey.</p><p>Another practice we adopted was protecting <strong>golden hours for learning</strong>. This meant consciously carving out focused time for upskilling, knowledge sharing, or deep work without meetings. In platform and banking teams, where operational load is high, learning time is usually the first thing to disappear unless it is explicitly protected.</p><p>When organizations regularly review how teams actually work and support practices like health checks and learning time culture stays aligned with reality, not assumptions. Without this, even well-intentioned policies slowly drift away from day-to-day experience.</p><h3>Actionable Practices:</h3><p><strong>Team or Pod Charter</strong><br>In my crew, we created a simple charter that defined purpose, responsibilities, and ways of working. This lightweight framework reduced ambiguity and improved focus without introducing heavy process.</p><p><strong>The Five-Minute Rule</strong><br>Starting meetings with five minutes of open discussion helped teams transition into complex technical conversations. This small habit improved context sharing and trust, especially during long or demanding sessions.</p><div><hr></div><h2>2. Embrace Open and Accessible Leadership</h2><p>At every stage of my journey, I worked with leaders who were approachable and available. That had a lasting influence on how I lead today.</p><p>People want to understand how their work connects to outcomes. When that connection is missing, engagement fades. In banking and platform environments, where work often spans multiple systems and timelines, this connection becomes critical.</p><p>Approachable leadership creates psychological safety. Teams that feel safe will raise risks early, challenge assumptions, and experiment responsibly. As a leader, you do not need to have all the answers. What matters is:</p><ul><li><p><strong>Presence</strong>: Being available when teams hit blockers</p></li><li><p><strong>Transparency</strong>: Connecting daily work to business and platform outcomes</p></li><li><p><strong>Safety</strong>: Making it easy for someone to message you with a concern or an early-stage idea</p></li></ul><div class="pullquote"><p>When teams do this naturally, trust is already in place.</p></div><h2>3. Foster Continuous Team Development</h2><p>Teams must evolve at the same pace as the platforms they build. In platform and service teams, we consciously avoided the <strong>SME trap</strong>. Relying on a single subject matter expert creates bottlenecks, increases risk, and limits team growth.</p><p>Instead, we focused on continuous upskilling, shared ownership, and knowledge distribution. The mindset shifted from <em>&#8220;who owns this task&#8221;</em> to <em>&#8220;how does this help our customers and stakeholders&#8221;</em>.</p><p>Growth is not only about learning new tools or frameworks. It is about learning how to think together, how to adapt under pressure, and how to respond to change without fear.</p><div><hr></div><h2>From Principles to Outcomes</h2><p>One hard lesson I learned along the way is this: If a minor infrastructure change requires detailed explanation to a junior engineer, the issue is rarely the person. It is usually the architecture, the practices, or both.</p><p>When this happens, it is not an individual failure. It is a <strong>team and system failure</strong>. Complexity has leaked into places where it should not exist. Clear paths are missing. Documentation, ownership, or automation is not doing its job.</p><p>As platform and data teams, our responsibility is to design systems that are safe to operate, easy to reason about, and forgiving by default. Teams should be able to make small changes confidently without fear or excessive hand-holding.</p><p>This shift in thinking led to tangible outcomes:</p><ul><li><p>We started saying <strong>&#8220;yes&#8221;</strong> to user and stakeholder requests with confidence &#8212; <em>&#8220;I can get that done tomorrow&#8221;</em> &#8212; because of a wonderful team that took ownership and consistently delivered.</p></li><li><p>Development environments were optimized for learning and speed: <strong>yesterday&#8217;s data with today&#8217;s code, across multiple environments</strong>. This was made possible through our <strong>ephemeral environments</strong>, an idea from the team and executed by a junior engineer, which gives everyone the freedom to test, learn, and iterate safely.</p></li><li><p>Production remained stable and predictable: <strong>today&#8217;s data with yesterday&#8217;s code</strong>, running in a single, controlled environment</p></li></ul><p>These outcomes were not achieved by pushing people harder. They came from fixing the platform, clarifying ownership, improving documentation, and simplifying paths.</p><div><hr></div><h2>From Good Teams to Great Teams</h2><p>Great teams do not happen by chance. They are built on:</p><ul><li><p>A culture that evolves with the organization</p></li><li><p>Trust created through approachable leadership</p></li><li><p>A shared commitment to continuous growth</p></li></ul><p>Every crew, pod, or platform team must recognize that its true strength lies in its people. When flexibility, collaboration, and development are prioritized, teams consistently deliver strong and sustainable results.</p><p>If you are a new leader, a team lead, or someone scaling teams in a changing environment, I hope these lessons help. I learned them from leaders who generously shared their frameworks with me, and I try to pass them forward with humility.</p><p>If this resonates and you want to exchange ideas on leadership, platform teams, or engineering culture, feel free to connect. We grow better by learning together.</p><p></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;a97821b4-139c-444c-b93d-ea1455d3bcfe&quot;,&quot;caption&quot;:&quot;I still remember my heart racing before that fateful project meeting. The room felt smaller, the air thicker, and every eye seemed to question my every move. I braced myself for the storm, only to re&#8230;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;showDescription&quot;:true,&quot;showImage&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Navigating Difficult Conversations &#8212; The 3 Conversations Framework&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28659751,&quot;name&quot;:&quot;Tharashasank Davuluru&quot;,&quot;bio&quot;:&quot;I&#8217;ve been living in Krakow, Poland, for the past eight years, with a background in engineering and technology. &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf2a8721-d298-4f9f-b6a7-bc77ba443e9f_1200x1200.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-05-27T04:27:37.993Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!4Bke!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bb3ffdc-3a34-43bd-af08-46e1f961c3b2_1536x1024.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://tshasankda.substack.com/p/navigating-difficult-conversations&quot;,&quot;section_name&quot;:&quot;Journeys &amp; Lessons&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:164302764,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:17,&quot;comment_count&quot;:5,&quot;publication_id&quot;:4523895,&quot;publication_name&quot;:&quot;Data, Chai &amp; Dialogue - The Leadership Lounge&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!cMG_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5d6ee-8e59-41a5-ae04-a55ee9c5b03e_1024x1024.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;a20abb6b-19b4-40ea-adea-a51ba5174768&quot;,&quot;caption&quot;:&quot;Leadership is about understanding people, not just telling them what to do. It&#8217;s about knowing who you're leading and connecting with them on a deeper level. When leaders understand the natural stren&#8230;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;showDescription&quot;:true,&quot;showImage&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The Colors of Leadership: A Smarter Way to Lead and Understand Your Team&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28659751,&quot;name&quot;:&quot;Tharashasank Davuluru&quot;,&quot;bio&quot;:&quot;I&#8217;ve been living in Krakow, Poland, for the past eight years, with a background in engineering and technology. &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf2a8721-d298-4f9f-b6a7-bc77ba443e9f_1200x1200.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-04-19T13:22:35.603Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!Kd-A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d54ef9-1f30-4437-927f-7ad193dc0497_1024x1024.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://tshasankda.substack.com/p/the-colors-of-leadership-a-smarter&quot;,&quot;section_name&quot;:&quot;Journeys &amp; Lessons&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:161671386,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:37,&quot;comment_count&quot;:13,&quot;publication_id&quot;:4523895,&quot;publication_name&quot;:&quot;Data, Chai &amp; Dialogue - The Leadership Lounge&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!cMG_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5d6ee-8e59-41a5-ae04-a55ee9c5b03e_1024x1024.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;d293517b-2b8e-4dc3-9c7c-d701d5b5242a&quot;,&quot;caption&quot;:&quot;If I Had a Superpower: The Gift of Listening&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;showDescription&quot;:true,&quot;showImage&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The Superpower Every Leader Needs: Deep Listening&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28659751,&quot;name&quot;:&quot;Tharashasank Davuluru&quot;,&quot;bio&quot;:&quot;I&#8217;ve been living in Krakow, Poland, for the past eight years, with a background in engineering and technology. &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf2a8721-d298-4f9f-b6a7-bc77ba443e9f_1200x1200.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-04-24T23:28:21.094Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!C3Z3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a698237-caa1-4871-b546-26de387ecf6a_1024x608.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://tshasankda.substack.com/p/the-superpower-every-leader-needs&quot;,&quot;section_name&quot;:&quot;The Growth Lab&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:161669902,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:21,&quot;comment_count&quot;:0,&quot;publication_id&quot;:4523895,&quot;publication_name&quot;:&quot;Data, Chai &amp; Dialogue - The Leadership Lounge&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!cMG_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5d6ee-8e59-41a5-ae04-a55ee9c5b03e_1024x1024.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div>]]></content:encoded></item><item><title><![CDATA[The Career Engineering Framework ]]></title><description><![CDATA[A practical guide for engineers, managers, and technical leaders]]></description><link>https://tshasankda.substack.com/p/the-career-engineering-framework</link><guid isPermaLink="false">https://tshasankda.substack.com/p/the-career-engineering-framework</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Tue, 30 Dec 2025 22:03:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!cMG_!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5d6ee-8e59-41a5-ae04-a55ee9c5b03e_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1><strong>Introduction</strong></h1><p>During my recent job search, I noticed a common pattern. Different companies use different titles for similar engineering roles. Sometimes the same title means very different responsibilities depending on the product or team. Around this time, a few of my juniors asked me direct questions: </p><ul><li><p><em>How do I choose the right position? </em></p></li><li><p><em>How do I compare what I&#8217;m doing today with what the market is offering? </em></p></li><li><p><em>How do I understand where I actually fit? </em></p></li></ul><p>To give them a proper answer, I started building this blog. I also looked at how various top companies structure their engineering functions and how they define growth. Every organization has its own style, but the fundamentals are largely the same. Whatever your discipline software, quality, data, ML, security, SRE, TPM, or engineering management the core principles of scope, responsibility, autonomy, and impact do not change. You simply become a subject matter expert in your chosen area.</p><p>This blog brings everything together as a simple and practical <strong>Career Engineering Framework</strong>. It explains how engineering roles are structured, how levels grow, and how engineers and managers can use these principles to choose roles confidently and plan long-term development with clarity.</p><h1><strong>1. What a Career Framework Is and Is Not</strong></h1><h3><strong>What it is?</strong></h3><ul><li><p>A transparent view of expectations at each level</p></li><li><p>A clear way to understand growth</p></li><li><p>A guide for impact, behaviour, and technical depth</p></li><li><p>A tool to drive better feedback and performance conversations</p></li></ul><h3><strong>What it is not?</strong></h3><ul><li><p>Not a checklist for promotion</p></li><li><p>Not a fixed list of tasks for every engineer</p></li><li><p>Not a replacement for mentorship or good judgement</p></li></ul><div class="pullquote"><p>It is simply <strong>clarity</strong> nothing more, nothing less.</p></div><h1><strong>2. The Structure of the Framework</strong></h1><p>Every engineering role and level can be understood through three common anchors:</p><h3><strong>A. Level Expectations</strong></h3><p>This defines how an engineer grows in:</p><ul><li><p><strong>Scope</strong> &#8211; size and complexity of problems</p></li><li><p><strong>Autonomy</strong> &#8211; level of guidance required</p></li><li><p><strong>Collaboration Reach</strong> &#8211; who and what your work impacts</p></li></ul><h3><strong>B. Core Responsibility Pillars</strong></h3><p>These apply to <em>all</em> engineering roles:</p><ol><li><p>Results</p></li><li><p>Direction</p></li><li><p>Talent</p></li><li><p>Culture</p></li></ol><h3><strong>C. Craft Responsibilities</strong></h3><p>This refers to role-specific technical depth such as:</p><ul><li><p>Coding</p></li><li><p>Testing</p></li><li><p>Data modelling</p></li><li><p>ML experimentation</p></li><li><p>SRE reliability patterns</p></li><li><p>Security architecture</p></li><li><p>Program leadership</p></li><li><p>People leadership<br></p><div class="pullquote"><p>This structure keeps things consistent across disciplines.</p></div></li></ul><h1><strong>3. Understanding Levels: Scope, Autonomy, and Collaboration</strong></h1><p>Engineering levels follow a predictable growth arc. To keep things simple and relatable across companies, this framework uses both naming styles:</p><ul><li><p><strong>L1&#8211;L6</strong> (commonly used in many global organizations)</p></li><li><p><strong>IC2&#8211;IC6</strong> (commonly used in tech companies)</p></li></ul><p>They map naturally, so you can understand where you stand regardless of the title used in any job description.</p><h2><strong>L1 &#8211; Entry Level (IC1/IC2 Equivalent)</strong></h2><p><strong>Focus:</strong> Learning and executing with support.<br><strong>Characteristics:</strong></p><ul><li><p>Takes clearly defined tasks</p></li><li><p>Learns tools, codebase, and processes</p></li><li><p>Requires regular guidance</p></li><li><p>Building consistency and fundamentals</p></li></ul><h2><strong>L2 &#8211; Junior Engineer (IC2/IC3 Equivalent)</strong></h2><p><strong>Focus:</strong> Executing independently on low-complexity tasks.<br><strong>Characteristics:</strong></p><ul><li><p>Works on small features end-to-end</p></li><li><p>Understands team workflows</p></li><li><p>Asks for help when blocked</p></li><li><p>Starting to recognize risks, but relies on senior engineers</p></li></ul><h2><strong>L3 &#8211; Mid-Level Engineer (IC3/IC4 Equivalent)</strong></h2><p><strong>Focus:</strong> Independent execution and steady ownership.<br><strong>Characteristics:</strong></p><ul><li><p>Designs and ships features independently</p></li><li><p>Works with limited direction</p></li><li><p>Handles moderate ambiguity</p></li><li><p>Collaborates with cross-functional partners</p></li><li><p>Begins contributing to design discussions</p></li></ul><h2><strong>L4 &#8211; Senior Engineer (IC4/IC5 Equivalent)</strong></h2><p><strong>Focus:</strong> Leading complex systems and improving team execution.<br> <strong>Characteristics:</strong></p><ul><li><p>Leads medium-to-large initiatives</p></li><li><p>Designs components or subsystems</p></li><li><p>Makes strong architectural decisions</p></li><li><p>Mentors junior engineers</p></li><li><p>Identifies long-term risks early</p></li><li><p>Drives cross-team conversations when needed</p></li></ul><h2><strong>L5 &#8211; Staff Engineer (IC5/IC6 Equivalent)</strong></h2><p><strong>Focus:</strong> System-wide ownership and high-impact leadership.<br><strong>Characteristics:</strong></p><ul><li><p>Owns major product areas or platforms</p></li><li><p>Leads multi-team efforts</p></li><li><p>Sets technical direction for a domain</p></li><li><p>Deep domain expertise</p></li><li><p>Strong influence across teams</p></li><li><p>Solves problems that have no predefined path</p></li></ul><h2><strong>L6 &#8211; Principal / Senior Staff (IC6+ Equivalent)</strong></h2><p><strong>Focus:</strong> Organization-wide technical strategy and long-term impact.<br><strong>Characteristics:</strong></p><ul><li><p>Shapes technical strategy for large org areas</p></li><li><p>Designs long-lived systems and frameworks</p></li><li><p>Mentors senior engineers and managers</p></li><li><p>Handles high ambiguity and high-risk decisions</p></li><li><p>Influence extends across products or business units</p></li><li><p>Trusted technical advisor for leadership</p></li></ul><h3><strong>How Levels Grow?</strong></h3><p>Across L1 &#8594; L6, four arcs evolve:</p><ol><li><p><strong>Scope<br></strong>Small tasks &#8594; features &#8594; systems &#8594; multi-team platforms &#8594; organization-wide strategy</p></li><li><p><strong>Autonomy<br></strong>Guided &#8594; Independent &#8594; Leading &#8594; Defining direction</p></li><li><p><strong>Collaboration Reach<br></strong>Immediate team &#8594; Multiple teams &#8594; Product area &#8594; Whole organization</p></li><li><p><strong>Impact<br></strong>Execution &#8594; Ownership &#8594; Influence &#8594; Strategy</p></li></ol><p>As you grow, you handle:</p><ul><li><p>bigger problems</p></li><li><p>deeper complexity</p></li><li><p>longer time horizons</p></li><li><p>broader influence</p></li></ul><div class="pullquote"><p>This is the natural progression of an engineering career.</p><p>Higher levels = bigger problems, longer horizons, broader influence.</p></div><h1><strong>4. What &#8220;Impact&#8221; Really Means?</strong></h1><p>Impact is the heart of career growth. It is not the number of tasks you close or the amount of code you write. Impact usually comes through five levers:</p><h3><strong>1. Deep domain expertise</strong></h3><p>Becoming the go-to person in an important product or system area.</p><h3><strong>2. Innovation</strong></h3><p>Finding faster, simpler, or more effective solutions.</p><h3><strong>3. Product thinking</strong></h3><p>Balancing user needs, engineering quality, and delivery speed.</p><h3><strong>4. Project leadership</strong></h3><p>Handling large, ambiguous, cross-team work.</p><h3><strong>5. Technical leadership and mentorship</strong></h3><p>Uplifting people around you through guidance and clarity.</p><div class="pullquote"><p>Your mix of these levers defines your growth style.</p></div><h1><strong>5. Core Responsibilities (Across All Engineering Roles)</strong></h1><h2><strong>A. Results</strong></h2><ul><li><p>Deliver measurable outcomes</p></li><li><p>Solve real problems end-to-end</p></li><li><p>Reduce risks and improve stability</p></li><li><p>Prioritize effectively</p></li></ul><h2><strong>B. Direction</strong></h2><ul><li><p>Bring clarity in ambiguous situations</p></li><li><p>Make sound technical and execution decisions</p></li><li><p>Introduce healthy change and improvements</p></li></ul><h2><strong>C. Talent</strong></h2><ul><li><p>Grow your own skills intentionally</p></li><li><p>Mentor others through examples and guidance</p></li><li><p>Help in hiring and onboarding</p></li></ul><h2><strong>D. Culture</strong></h2><ul><li><p>Communicate clearly</p></li><li><p>Build trust with teammates</p></li><li><p>Collaborate without ego</p></li><li><p>Support an inclusive and healthy work environment<br></p></li></ul><div class="pullquote"><p>These behaviours mature as you progress.</p></div><h1><strong>6. Craft Responsibilities by Discipline</strong></h1><h2><strong>6.1 Software Engineering</strong></h2><ul><li><p>Writing clean, maintainable code</p></li><li><p>Designing components and systems</p></li><li><p>Balancing short-term delivery with long-term stability</p></li><li><p>Driving technical decisions</p></li><li><p>Owning operational health</p></li></ul><h2><strong>6.2 Quality Engineering</strong></h2><ul><li><p>Strong test strategy</p></li><li><p>Automation mindset</p></li><li><p>Risk identification and communication</p></li><li><p>Quality metrics and measurement</p></li><li><p>Influencing teams towards quality-first thinking</p></li></ul><h2><strong>6.3 Site Reliability Engineering</strong></h2><ul><li><p>Deep systems understanding</p></li><li><p>Observability and incident response</p></li><li><p>Reliability architecture</p></li><li><p>Failure domain design</p></li><li><p>Long-term resilience improvements</p></li></ul><h2><strong>6.4 Data Engineering</strong></h2><ul><li><p>Data modelling</p></li><li><p>Pipelines and ETL architecture</p></li><li><p>Data governance and quality</p></li><li><p>Scalability and performance</p></li><li><p>Observability and lineage</p></li></ul><h2><strong>6.5 Machine Learning Engineering</strong></h2><ul><li><p>Algorithm selection</p></li><li><p>Experimentation and evaluation</p></li><li><p>ML system design</p></li><li><p>Productionising models</p></li><li><p>Monitoring model drift and behaviour</p></li></ul><h2><strong>6.6 Security Engineering</strong></h2><ul><li><p>Threat modelling</p></li><li><p>Secure architecture</p></li><li><p>Control implementation and validation</p></li><li><p>Security reviews and risk communication</p></li><li><p>Embedding security into design</p></li></ul><h1><strong>7. Technical Program Manager (TPM)</strong></h1><p>TPMs handle the execution backbone of a multi-team engineering organisation.</p><h3><strong>Scope</strong></h3><ul><li><p>Cross-team programs</p></li><li><p>Complex migrations</p></li><li><p>Infra and platform initiatives</p></li><li><p>Compliance and regulatory work</p></li><li><p>Multi-quarter product launches</p></li></ul><h3><strong>Responsibilities</strong></h3><ul><li><p>Breaking down large problems</p></li><li><p>Driving alignment across engineering, product, design, data, security</p></li><li><p>Managing risks, dependencies, and timelines</p></li><li><p>Creating visibility through dashboards and metrics</p></li><li><p>Keeping teams unblocked and focused</p></li><li><p>Ensuring predictable and sustainable execution</p></li></ul><h3><strong>TPM Impact Levers</strong></h3><ul><li><p>Clear execution</p></li><li><p>Strong stakeholder trust</p></li><li><p>Cross-team clarity</p></li><li><p>Reduction of delivery risk</p></li><li><p>Making the organisation move faster and safer</p></li></ul><h1><strong>8. Engineering Manager (EM)</strong></h1><h3><strong>Scope</strong></h3><p>One team, or a focused product area.</p><h3><strong>Key Responsibilities</strong></h3><ul><li><p>Build a healthy, high-performing team</p></li><li><p>Develop engineers through coaching and feedback</p></li><li><p>Guide technical direction with senior ICs</p></li><li><p>Manage execution and delivery</p></li><li><p>Partner with PMs, TPMs, and senior leaders</p></li><li><p>Maintain technical quality, reliability, and process health</p></li></ul><h3><strong>Impact Levers</strong></h3><ul><li><p>Team culture</p></li><li><p>Decision clarity</p></li><li><p>Talent development</p></li><li><p>Predictable delivery</p></li></ul><h1><strong>9. Senior Engineering Manager (Sr EM)</strong></h1><h3><strong>Scope</strong></h3><p>Multiple teams or a product area with broad responsibilities.</p><h3><strong>Responsibilities</strong></h3><ul><li><p>Setting long-term area strategy</p></li><li><p>Managing managers and senior ICs</p></li><li><p>Driving multi-team execution governance</p></li><li><p>Improving organizational structures</p></li><li><p>Balancing product, tech, infra, and operational work</p></li><li><p>Hiring and developing leadership talent</p></li></ul><h3><strong>Impact Levers</strong></h3><ul><li><p>Quality of leadership pipeline</p></li><li><p>Strategic decision-making</p></li><li><p>Area-level delivery and stability</p></li><li><p>Cross-functional influence</p></li></ul><h1><strong>10. How Engineers Should Use This Framework</strong></h1><p>A simple, practical approach:</p><ol><li><p>Understand your current level honestly</p></li><li><p>Identify your strongest impact levers</p></li><li><p>Pick one or two growth areas</p></li><li><p>Plan quarterly with your manager</p></li><li><p>Focus on outcomes, not activities</p></li><li><p>Document your contributions</p></li><li><p>Share your work openly</p></li></ol><div class="pullquote"><p>This makes growth intentional and transparent.</p></div><h1><strong>11. How Managers Should Use This Framework</strong></h1><p>Managers can use this framework to:</p><ul><li><p>Set transparent expectations</p></li><li><p>Give objective, behavioural feedback</p></li><li><p>Build structured growth plans</p></li><li><p>Ensure fairness in promotions</p></li><li><p>Hire better and faster</p></li><li><p>Build stronger, healthier teams</p></li></ul><div class="pullquote"><p>A consistent framework strengthens both performance and culture.</p></div><h1><strong>12. Closing Summary</strong></h1><p>Across all roles ,ICs, managers, TPMs, and domain specialists fundamentals remain the same:</p><ul><li><p>Grow your scope</p></li><li><p>Increase your autonomy</p></li><li><p>Expand your collaboration reach</p></li><li><p>Strengthen your craft</p></li><li><p>Deliver meaningful impact</p></li><li><p>Support people around you</p></li></ul><p>Titles, tools, and domains change. But growth always follows the same arc: </p><div class="pullquote"><p><strong>Learn &#8594; Own &#8594; Lead &#8594; Influence &#8594; Empower</strong></p><p>That is the essence of a healthy engineering career.</p></div><h2>13. Further Reading &amp; Resources</h2><p><strong>&#128203; Data Team Handbook</strong><br><a href="https://github.com/sdg-1/data-team-handbook">https://github.com/sdg-1/data-team-handbook</a><br>A practical companion for data engineers and leaders, covering IC-to-manager growth, team culture, hiring, onboarding, and data infrastructure.</p><p><strong>&#128203; Consulting Handbook</strong><br><a href="https://github.com/sdg-1/consulting-handbook">https://github.com/sdg-1/consulting-handbook</a><br>Templates and strategies for senior engineers and TPMs exploring consulting or independent paths.</p><p><em><strong>Bonus: Compare with real-world ladders from <a href="https://handbook.gitlab.com/handbook/engineering/">GitLab</a> , <a href="https://github.com/glitchdotcom/engineering-ladder#readme-ov-file">Glitch&#8217;s Engineering Ladder</a> and the <a href="https://github.com/posquit0/awesome-engineering-ladders">Awesome Engineering Ladders</a> repository.</strong></em></p>]]></content:encoded></item><item><title><![CDATA[When Bots Don’t Pay Taxes: The Economic Shift We Aren’t Talking Enough About]]></title><description><![CDATA[A reflection inspired by a 1:1 discussion with Michael Bharwani on automation, value creation, and the quiet tax gap in AI economies]]></description><link>https://tshasankda.substack.com/p/when-bots-dont-pay-taxes-the-economic</link><guid isPermaLink="false">https://tshasankda.substack.com/p/when-bots-dont-pay-taxes-the-economic</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Tue, 04 Nov 2025 06:24:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Ezgf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7fdeb6-cc13-4951-8cc6-d3ccc3b46dc8_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3><strong>Introduction</strong></h3><p>This blog was sparked by a thoughtful 1:1 discussion I had with <strong>Michael Bharwani (My Mentor)</strong>. We were talking about automation, how organizations are rapidly adapting to AI-led workflows, and what it truly means when machines start replacing human work. Somewhere in that conversation, one simple question came up:</p><p><strong>&#8220;If bots don&#8217;t pay taxes, what happens to the government&#8217;s income?&#8221;</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ezgf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7fdeb6-cc13-4951-8cc6-d3ccc3b46dc8_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ezgf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7fdeb6-cc13-4951-8cc6-d3ccc3b46dc8_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!Ezgf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7fdeb6-cc13-4951-8cc6-d3ccc3b46dc8_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!Ezgf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7fdeb6-cc13-4951-8cc6-d3ccc3b46dc8_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!Ezgf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7fdeb6-cc13-4951-8cc6-d3ccc3b46dc8_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ezgf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7fdeb6-cc13-4951-8cc6-d3ccc3b46dc8_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b7fdeb6-cc13-4951-8cc6-d3ccc3b46dc8_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2854099,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/177959912?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7fdeb6-cc13-4951-8cc6-d3ccc3b46dc8_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ezgf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7fdeb6-cc13-4951-8cc6-d3ccc3b46dc8_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!Ezgf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7fdeb6-cc13-4951-8cc6-d3ccc3b46dc8_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!Ezgf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7fdeb6-cc13-4951-8cc6-d3ccc3b46dc8_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!Ezgf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b7fdeb6-cc13-4951-8cc6-d3ccc3b46dc8_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>That question stayed with me. Because as AI grows in influence, it is not just jobs and skills that will change. Taxation systems, national budgets, and even social contracts will have to adapt.</p><p>Automation is already part of everyday operations. Across industries, bots and AI platforms are performing human tasks. But there is one big difference: <strong>they don&#8217;t pay taxes.</strong> No income tax, no social contribution, no pension.</p><p>This simple shift can have deep economic effects. So let&#8217;s unpack what it means for organizations, governments, and society.</p><h3><strong>What&#8217;s Happening Around the World</strong></h3><p>A few countries have already started experimenting with ways to handle this challenge.</p><ul><li><p><strong>South Korea</strong> introduced an automation levy on companies using robots, channeling the funds into reskilling programs.</p></li><li><p><strong>Italy</strong> debated a &#8220;robot tax&#8221; to support welfare systems for displaced workers.</p></li><li><p><strong>The European Union</strong> is exploring digital tax frameworks, recognizing that automation is now both an economic and social issue.</p></li></ul><p>These examples show one thing clearly: <strong>the world is already searching for fair ways to adapt tax systems to automation.</strong></p><p>The <strong>South Korea AI Framework Act</strong>, enacted in early 2025, focuses on AI system regulation, transparency, and responsible innovation but stops short of specifying an automation or &#8220;robot&#8221; tax. However, their broader 2025 tax reform aligns with global minimum tax principles targeting multinational enterprises&#8217; profits, reflecting a move to capture economic value from digital activities including AI. The government is also investing heavily in AI infrastructure and retraining programs, effectively channeling automation gains back into the workforce.</p><p>In <strong>Italy</strong>, the &#8220;robot tax&#8221; has been debated mainly in the political arena as a potential levy on companies that replace human labor with automation to fund social welfare systems for displaced workers. While a formal nationwide tax has not been enacted yet, ongoing discussions show a strong interest in linking automation to social equity. Several proposals suggest a productivity or profit-based levy instead of a simple headcount tax, reflecting the global trend of moving towards <strong>value-linked taxation</strong>.</p><p>Across all jurisdictions, similar policy challenges persist. Defining what counts as taxable automation or AI is complex because systems vary widely from assistive tools to fully autonomous agents. Measuring AI-driven value objectively is difficult without standardized metrics. Poorly designed taxes also risk slowing innovation or driving investment to regions with fewer regulations.</p><p>Data privacy and ethics add another layer of complexity. Any tax model that relies on measuring automation intensity such as query volumes or compute usage must balance transparency with privacy rights. Excessive monitoring or intrusive data collection could easily trigger resistance from both businesses and citizens.</p><h3><strong>The Big Question: What Counts as a Robot?</strong></h3><p>Defining &#8220;robot&#8221; used to be simple. In factories, robots replaced physical labor. But today, automation happens in code, in the cloud, through APIs, and in AI workflows.</p><p>So where do we draw the line?</p><ul><li><p>Is a chatbot a robot?</p></li><li><p>Is AI-generated content part of taxable work?</p></li><li><p>Or is automation inside data platforms like <strong>Starburst</strong> or <strong>Trino</strong>, processing millions of queries a day, also &#8220;labor&#8221; in some sense?</p></li></ul><p>Once automation becomes digital, even defining work becomes complex. Governments will have to rethink what counts as &#8220;value creation.&#8221;</p><p>This is a debatable area and still evolving. There are no clear global standards yet, but these questions are now part of serious policy discussions in many countries.</p><h3><strong>Can Governments Really Do Without Taxes?</strong></h3><p>No government can run without taxes. It funds everything from education and healthcare to infrastructure and welfare.</p><p>For now, automation seems beneficial, giving higher productivity and better margins that can increase corporate tax contributions. But over time, as human jobs reduce, <strong>income tax revenues will fall</strong>, and that is where the real gap begins.</p><p>Governments will then need to shift from <strong>taxing people</strong> to <strong>taxing productivity and automation</strong>. That change will not be easy, but it is necessary.</p><h3><strong>Possible Tax Models for the Future</strong></h3><p>Here are some practical models that are being explored or discussed:</p><ul><li><p><strong>Automation Offset Funds:</strong> Companies contribute a share of what they save through automation into skill development or retraining programs.</p></li><li><p><strong>Value-Based Taxes:</strong> Linked to productivity gains or automation-generated profits rather than headcount.</p></li><li><p><strong>Energy or Compute-Based Taxes:</strong> Levies tied to energy or compute resources used by large-scale AI systems.</p></li><li><p><strong>Data Monetization Taxes:</strong> Taxes on revenue derived from AI-driven data usage or analytics.</p></li></ul><p>Each approach moves away from labor and focuses on <strong>the value created by automation.</strong></p><p>Some of these ideas are still theoretical. Whether they become real policy tools depends on how governments interpret automation data and design fair frameworks.</p><h3><strong>Balancing Innovation and Fairness</strong></h3><p>This balance is delicate. Tax automation too heavily, and innovation could slow. Leave it untaxed, and inequality widens. The key lies in designing policies that encourage innovation but also ensure fairness and redistribution.</p><p>AI is global, but taxation is local. That makes policy design harder. Collaboration between economists, technologists, and governments will be crucial to keeping systems both competitive and fair.</p><h3><strong>Does It Still Make Sense to Automate If AI Is Taxed Like Humans?</strong></h3><p><em>(A thought that shaped this blog from my conversation with Michael Bharwani)</em></p><p>In that 1:1 discussion, Michael asked a powerful question: <strong>If governments start taxing AI like humans, will automation still make business sense? </strong>It made me think deeply. My answer: yes, automation will still be valuable. But the reason will change.</p><p><strong>Consistency and Scalability</strong><br>AI works round the clock, at scale, and with precision that humans can&#8217;t match.</p><p><strong>Lower Overheads</strong><br>Even with taxation, automation saves long-term costs, with no benefits, offices, or HR dependencies.</p><p><strong>Smarter Decisions</strong><br>AI learns continuously. Data feedback loops improve decision quality across business functions.</p><p><strong>Global Consistency</strong><br>Automation systems can operate uniformly across countries, unlike human processes that need localization.</p><p><strong>Human-AI Collaboration</strong><br>AI doesn&#8217;t replace human creativity; it amplifies it. Humans focus on strategy, judgment, and empathy.</p><p>So even if AI is taxed, it will still deliver strong operational and analytical value, not through cost-cutting but through smarter, scalable capabilities.</p><h3><strong>How Data Platforms Fit In</strong></h3><p>Platforms like <strong>Starburst</strong> and <strong>Trino</strong> are clear examples of invisible automation. They process massive data volumes, automate reporting, and support large-scale decision-making every second.</p><p>If governments start measuring automation intensity such as number of queries, data processed, or compute usage, these could become <strong>new indicators for taxation.</strong></p><p>That is how we may see <strong>platform-based tax models</strong>, where tax links to the amount of automation rather than the number of human workers.</p><p>Interestingly, we are already seeing hints of this approach. In <strong>Poland</strong>, for instance, a creative tax model was introduced that indirectly captures value from intellectual and digital work. It is not automation-specific yet, but it is a step in that direction.</p><p>Every country will eventually design its version of this model. It will also create new roles: economists, policy specialists, and domain experts who translate automation data into policy frameworks.</p><p>But this will not be simple. Balancing innovation with fairness will take time. It is a <em>wait and watch</em> phase right now, as the dialogue between automation and taxation continues to evolve.</p><p>This section remains debatable, as such models are not yet standardized or globally adopted. But the direction is clearly forming.</p><h3><strong>Defining Possible Automation Tax Metrics</strong></h3><p>For technically proficient audiences, such as those using platforms like Starburst or Trino, automation tax metrics could be defined through measurable system activity. For example:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mkjl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f8079b0-fa9d-44ca-a654-1a90b0cb293a_1234x157.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mkjl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f8079b0-fa9d-44ca-a654-1a90b0cb293a_1234x157.png 424w, https://substackcdn.com/image/fetch/$s_!mkjl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f8079b0-fa9d-44ca-a654-1a90b0cb293a_1234x157.png 848w, https://substackcdn.com/image/fetch/$s_!mkjl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f8079b0-fa9d-44ca-a654-1a90b0cb293a_1234x157.png 1272w, https://substackcdn.com/image/fetch/$s_!mkjl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f8079b0-fa9d-44ca-a654-1a90b0cb293a_1234x157.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mkjl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f8079b0-fa9d-44ca-a654-1a90b0cb293a_1234x157.png" width="1234" height="157" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0f8079b0-fa9d-44ca-a654-1a90b0cb293a_1234x157.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:157,&quot;width&quot;:1234,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43050,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/177959912?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f8079b0-fa9d-44ca-a654-1a90b0cb293a_1234x157.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mkjl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f8079b0-fa9d-44ca-a654-1a90b0cb293a_1234x157.png 424w, https://substackcdn.com/image/fetch/$s_!mkjl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f8079b0-fa9d-44ca-a654-1a90b0cb293a_1234x157.png 848w, https://substackcdn.com/image/fetch/$s_!mkjl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f8079b0-fa9d-44ca-a654-1a90b0cb293a_1234x157.png 1272w, https://substackcdn.com/image/fetch/$s_!mkjl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f8079b0-fa9d-44ca-a654-1a90b0cb293a_1234x157.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Using such metrics as inputs for value-based taxation models can help quantify the economic activity driven by AI systems rather than relying only on human labor metrics. Again, this idea is open for debate, as we are still figuring out how to measure such automation impact effectively.</p><h3><strong>Productivity vs. Equity</strong></h3><p>Automation brings productivity. But unchecked, it can also widen economic gaps. That is why taxation plays a critical balancing role, ensuring some of the gains go back into human skill development and social support.</p><p>The real success of automation will not be how many tasks it replaces, but how many people it uplifts.</p><h3><strong>Conclusion: Preparing for a Tax System That Understands AI</strong></h3><p>AI is redefining work, value, and contribution. Yet most taxation systems are still built around jobs, salaries, and payroll. The future will need a <strong>value-based taxation system</strong>, one that measures productivity, automation, and data-driven value creation.</p><p>It is a complex transformation, but a necessary one. Automation should not mean fewer opportunities for humans. It should mean <strong>smarter, fairer, and shared progress</strong> for all.</p><p>A collaborative effort between technologists, economists, and policymakers will be the key to shaping a fair, future-ready taxation framework that keeps innovation alive while ensuring everyone benefits from the value AI creates.</p>]]></content:encoded></item><item><title><![CDATA[Manager vs Leader: Why the Difference Matters More Than You Think]]></title><description><![CDATA[A Real Journey of Transformation : From Status Updates to True Leadership]]></description><link>https://tshasankda.substack.com/p/manager-vs-leader-why-the-difference</link><guid isPermaLink="false">https://tshasankda.substack.com/p/manager-vs-leader-why-the-difference</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Mon, 15 Sep 2025 04:41:37 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Vcxa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b40989-d394-4aba-bf99-f26609f0f733_1864x924.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><em>The Conversation That Changed Everything</em></h2><p><em>"How are you doing? Really doing?" I looked up from my laptop, surprised. It was our weekly one-on-one, and I'd prepared my usual status update: tasks completed, blockers identified, next week's priorities. But my manager wasn't opening his notebook or checking his calendar. He was just... waiting.</em></p><div class="pullquote"><p><em>"What's on your mind?" he continued. "What direction excites you in your career?"</em></p></div><p><em>That conversation marked the beginning of my transformation from someone who managed work to someone who could lead people. Over 17 years in various roles, but especially in the last 8, I've learned that manager and leader aren't just different job titles&#8212;they're completely different ways of showing up.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vcxa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b40989-d394-4aba-bf99-f26609f0f733_1864x924.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vcxa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b40989-d394-4aba-bf99-f26609f0f733_1864x924.png 424w, https://substackcdn.com/image/fetch/$s_!Vcxa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b40989-d394-4aba-bf99-f26609f0f733_1864x924.png 848w, https://substackcdn.com/image/fetch/$s_!Vcxa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b40989-d394-4aba-bf99-f26609f0f733_1864x924.png 1272w, https://substackcdn.com/image/fetch/$s_!Vcxa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b40989-d394-4aba-bf99-f26609f0f733_1864x924.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vcxa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b40989-d394-4aba-bf99-f26609f0f733_1864x924.png" width="1456" height="722" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0b40989-d394-4aba-bf99-f26609f0f733_1864x924.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:722,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1147362,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/161674965?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b40989-d394-4aba-bf99-f26609f0f733_1864x924.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Vcxa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b40989-d394-4aba-bf99-f26609f0f733_1864x924.png 424w, https://substackcdn.com/image/fetch/$s_!Vcxa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b40989-d394-4aba-bf99-f26609f0f733_1864x924.png 848w, https://substackcdn.com/image/fetch/$s_!Vcxa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b40989-d394-4aba-bf99-f26609f0f733_1864x924.png 1272w, https://substackcdn.com/image/fetch/$s_!Vcxa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b40989-d394-4aba-bf99-f26609f0f733_1864x924.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em><strong>A Note from the Author:</strong> I really enjoyed crafting this story from start to finish! After I completed the writing, I used ChatGPT to help polish the grammar and make it even easier to read, all while keeping my original thoughts intact.</em></p></blockquote><h2><em>The Difference That Matters</em></h2><p><em>Here's what I've observed after working under both managers and leaders:</em></p><p><em><strong>Managers focus on:</strong></em></p><ul><li><p><em>Tasks and deliverables</em></p></li><li><p><em>Status updates and timelines</em></p></li><li><p><em>Compliance and process</em></p></li><li><p><em>What needs to be done</em></p></li><li><p><em>Control and oversight</em></p></li></ul><p><em><strong>Leaders focus on:</strong></em></p><ul><li><p><em>People and their growth</em></p></li><li><p><em>Purpose and meaning</em></p></li><li><p><em>Development and potential</em></p></li><li><p><em>Why it matters</em></p></li><li><p><em>Trust and empowerment</em></p></li></ul><div class="pullquote"><p><em>The manager asks: "Is it done?" The leader asks: "How can I help you succeed?"</em></p></div><h2><em>Three Leaders Who Shaped My Journey</em></h2><p><em>My transformation didn't happen overnight. It came through working with exceptional people who modeled what leadership actually looks like:</em></p><p><em><strong>Paul Banoub</strong> laid the foundation through his "House View of Leadership" sessions and genuine investment in people. He was a great mentor and an even better listener. Every one-on-one felt like he had all the time in the world for whatever was on my mind. His belief in leadership training showed me that developing others wasn't optional&#8212;it was essential.</em></p><p><em><strong>Michael Bharwani</strong> has been my guiding presence for three years. He's been consistent with feedback, generous with encouragement, and sharp in identifying improvement areas. What sets him apart is his accessibility&#8212;he's still someone I turn to today, long after our direct reporting relationship ended.</em></p><p><em><strong>Kevin Ruscoe</strong> helped me discover strengths I didn't know I had in product thinking, process design, and stakeholder management. He was one of the first to believe I could do more than just deliver&#8212;that I could lead. He didn't just assign me work; he assigned me purpose.</em></p><p><em>Each of these leaders shared something crucial: they didn't just manage my output; they invested in my potential.</em></p><h2><em>My Leadership Framework: CARE</em></h2><p><em>Through observing these leaders and eventually leading teams myself, I developed what I call the CARE Framework:</em></p><p><em><strong>Connect</strong> &#8212; Build real relationships beyond tasks and tickets. Know what motivates your people, what challenges them outside of work, what they aspire to become.</em></p><p><em><strong>Align</strong> &#8212; Link people's individual goals to the larger purpose. Help them see how their work creates impact beyond just completing assignments.</em></p><p><em><strong>Recognize</strong> &#8212; Celebrate effort, outcomes, and progress openly. Not just the big wins, but the daily improvements and growth moments that often go unnoticed.</em></p><p><em><strong>Empower</strong> &#8212; Give ownership and create space for people to figure things out. Trust your team's judgment and provide support when they need it.</em></p><blockquote><p><em>This isn't a checklist to complete&#8212;it's how I try to show up every day.</em></p></blockquote><h2><em>The Practical Shift</em></h2><p><em>If you're a manager wanting to become a leader, here are the shifts that made the difference for me:</em></p><p><em><strong>Listen more than you talk.</strong> Your team is constantly telling you what matters to them, what frustrates them, and what energizes them. Most of us just aren't tuned in.</em></p><p><em><strong>Guide instead of just assigning.</strong> When you give someone a task, help them understand why it matters and how it connects to something bigger.</em></p><p><em><strong>Make career growth a regular conversation.</strong> Don't wait for annual reviews. Ask about aspirations, create development opportunities, and advocate for your people.</em></p><p><em><strong>Create visibility for your team.</strong> Let them present to leadership, lead important calls, and get recognition for their contributions.</em></p><p><em><strong>Celebrate progress, not just outcomes.</strong> Recognition builds confidence, and confidence drives performance.</em></p><h2><em>When Leadership Really Clicks</em></h2><p><em>The shift became real for me when I started managing my own teams. Instead of focusing solely on deliverables, I began asking different questions: How is this person growing? What do they need from me? How can I help them succeed beyond this project?</em></p><p><em>The results spoke for themselves. Team engagement improved, people stayed longer, and surprisingly, our actual delivery got better too. When people feel invested in, they invest back.</em></p><h2><em>Resources That Helped My Journey</em></h2><p><em>These books gave me language for what I was experiencing and frameworks for what I was trying to become:</em></p><ul><li><p><em><strong>"Leaders Eat Last" by Simon Sinek</strong> &#8212; Showed me how empathy builds safety and trust in teams</em></p></li><li><p><em><strong>"The 5 Levels of Leadership" by John Maxwell</strong> &#8212; Helped me view leadership as a journey of influence, not just authority</em></p></li><li><p><em><strong>"The Making of a Manager" by Julie Zhuo</strong> &#8212; A refreshingly honest take on modern leadership challenges</em></p></li></ul><h2><em>The Truth About Leadership</em></h2><p><em>Here's what I've learned: Leadership doesn't begin with a promotion or a title. It begins with intention. It starts when you decide that your success is measured not just by what you accomplish, but by how you help others accomplish things they didn't think were possible.</em></p><p><em>If you're a manager today, pause and ask yourself: Am I managing work, or am I leading people?</em></p><p><em>The answer lies in how you show up. Do you see your role as getting tasks completed, or developing the humans who complete those tasks?</em></p><div class="pullquote"><p><em><strong>Start with one real conversation. Start with believing someone on your team can do more than they think they can, and then tell them so.</strong></em></p></div><p><em>Leadership isn't about perfection. It's about getting better. It's not about having all the answers. It's about asking better questions.</em></p><p><em>Let's not just manage projects and processes. Let's lead people. And in the process, let's lead ourselves toward the kind of leaders we wish we'd had when we were starting out.</em></p><div><hr></div><p><em>What's one shift you could make today to move from managing tasks to leading people? Sometimes the most powerful leadership development happens one conversation at a time.</em></p>]]></content:encoded></item><item><title><![CDATA[Define. Deliver. Deliver. Deliver.: How a Team Charter Became My Professional Compass]]></title><description><![CDATA[From Year-End Feedback to Career Foundation &#8212; Why Some Mantras Shape Everything That Follows]]></description><link>https://tshasankda.substack.com/p/define-deliver-deliver-deliver-how</link><guid isPermaLink="false">https://tshasankda.substack.com/p/define-deliver-deliver-deliver-how</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Wed, 10 Sep 2025 07:00:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!n0K8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b793ff-e973-4b81-b1f9-529f4df16aba_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><em>The Review That Changed Everything</em></h2><p><em>I was sitting in my year-end review (2017), three months into joining a new company and team. My manager pulled up his notes and said something I'll never forget:</em></p><div class="pullquote"><p><em>"Keep living to the Agile BI Charter and mantra: 'Deliver. Deliver. Deliver,' and expand beyond our team."</em></p></div><p><em>I'd been hearing this phrase since I joined&#8212;it was woven into how our team operated. But hearing it in my formal review, connected to my performance and future potential, made it click differently. This wasn't just team vocabulary. This was the standard I was being measured against.</em></p><p><em>The feedback was clear: my strength lay in autonomous problem-solving and incident management. When issues arose, I didn't freeze&#8212;I acted. "If I want something addressing quickly and thoroughly," my manager wrote, "Thara is a great person to have on the team."</em></p><p><em>That review became my north star. The mantra wasn't just about delivery, it was about building the kind of reliability that makes managers' lives easier and teams more effective.</em></p><blockquote><p><em><strong>A Note from the Author:</strong> I really enjoyed crafting this story from start to finish! After I completed the writing, I used ChatGPT to help polish the grammar and make it even easier to read, all while keeping my original thoughts intact.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n0K8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b793ff-e973-4b81-b1f9-529f4df16aba_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n0K8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b793ff-e973-4b81-b1f9-529f4df16aba_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!n0K8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b793ff-e973-4b81-b1f9-529f4df16aba_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!n0K8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b793ff-e973-4b81-b1f9-529f4df16aba_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!n0K8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b793ff-e973-4b81-b1f9-529f4df16aba_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n0K8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b793ff-e973-4b81-b1f9-529f4df16aba_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1b793ff-e973-4b81-b1f9-529f4df16aba_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ab50890-4de3-4c46-9dbb-5e896c6a8191_1536x1024.png&quot;,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2068697,&quot;alt&quot;:&quot;Deliver. Deliver. Deliver.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/172732375?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ab50890-4de3-4c46-9dbb-5e896c6a8191_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Deliver. Deliver. Deliver." title="Deliver. Deliver. Deliver." srcset="https://substackcdn.com/image/fetch/$s_!n0K8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b793ff-e973-4b81-b1f9-529f4df16aba_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!n0K8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b793ff-e973-4b81-b1f9-529f4df16aba_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!n0K8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b793ff-e973-4b81-b1f9-529f4df16aba_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!n0K8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1b793ff-e973-4b81-b1f9-529f4df16aba_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>The first delivery proves capability. The second shows consistency. The third builds reputation.</em></figcaption></figure></div><h2><em>Living the Charter Daily</em></h2><p><em>From 2017 onward, "Deliver. Deliver. Deliver." became more than words it became operational DNA. When licensing issues erupted, I delivered solutions. When workspace problems threatened client experience, I delivered fixes. When emergency package rollouts were needed, I delivered results.</em></p><p><em>But the real lesson came through watching my manager and the Head of Analytics as a Service. They didn't just preach delivery but they modeled it. Every project started with ruthless clarity about what we were solving and why it mattered. Then came consistent execution, adjustment based on feedback, and more execution.</em></p><p><em>I realized the charter had an invisible first step: <strong>Define</strong>. Before you can deliver effectively, you must know exactly what you're delivering and why.</em></p><h2><em>The Framework in Action: Cloud Migration Reality Check</em></h2><p><em>By 2021, I was applying this evolved framework &#8220;Define. Deliver. Deliver. Deliver.&#8221; to increasingly complex challenges. Our Tableau cloud adoption project exemplified the approach.</em></p><p><em>The challenge: migrate Tableau Server to the azure cloud using containers, when container support was still experimental. My initial approach was chaotic&#8212;scripting solutions, forcing DevOps practices I'd only read about, trying to solve everything simultaneously.</em></p><p><em>The breakthrough came when I applied the charter's hidden wisdom: <strong>Define first</strong>. Instead of diving into technical implementation, I stepped back:</em></p><ul><li><p><em>What problem were we actually solving? (Cloud adoption, not just containerization)</em></p></li><li><p><em>What capabilities did we need? (Dev+Data+Ops mindset, not just tools)</em></p></li><li><p><em>What would success look like? (Sustainable platform engineering, not just working containers)</em></p></li></ul><p><em>With clarity established, delivery became systematic. We shipped small wins: basic container deployment, then orchestration, then infrastructure as code. Each delivery taught us something and built stakeholder confidence.</em></p><p><em>The project became my crash course in platform engineering&#8212;understanding how people, process, product, and technology intersect. That foundation opened doors to leadership roles and bigger challenges.</em></p><h2><em>Passing the Charter Forward</em></h2><p><em>When I started managing teams, the charter came with me. During our Power BI automation initiative with Microsoft vendors, I found myself echoing my old manager: "Let's define what success looks like, then deliver incrementally."</em></p><p><em>The approach eliminated friction and kept everyone aligned. Instead of getting lost in vendor capabilities, we focused on our organization's specific requirements and built solutions step by step.</em></p><p><em>Later, leading our feature store project, we delivered in record time with a small team by religiously following the framework. Currently, I manage an infrastructure engineering pod where this approach is baked into our team charter. Result: 2x efficiency improvement over two years.</em></p><h2><em>Why This Becomes Career DNA</em></h2><p><em>The genius of "Deliver. Deliver. Deliver." isn't in the repetition, it's in the compound effect. The first delivery proves capability. The second shows consistency. By the third, you've built a reputation.</em></p><p><em>My 2017 review highlighted something crucial: when problems arose, I didn't freeze, I acted. That behavior, reinforced by the team charter, became my professional brand. Managers knew they could count on me for "quick and thorough" solutions.</em></p><p><em>This reliability opened doors. Not because I was the smartest person in the room, but because I was the one who consistently turned problems into progress.</em></p><h2><em>The Evolution Continues</em></h2><p><em>Seven years later, that year-end feedback still guides everything I do. The charter has evolved I've added "Define" as the crucial first step but the core remains: consistent delivery builds careers.</em></p><p><em>When I share this framework in training sessions or with struggling teams, the response is always the same: clarity. Whether you're a new manager building credibility, an individual contributor seeking visibility, or anyone trying to make impact, the pattern works.</em></p><div class="pullquote"><p><em><strong>Define what matters. Then deliver with relentless consistency until reliability becomes your reputation.</strong></em></p></div><h2><em>The Lasting Gift</em></h2><p><em>My manager's greatest gift wasn't the mantra itself it was showing me how simple frameworks, consistently applied, compound into extraordinary results. That 2017 review didn't just evaluate my performance; it revealed the path to sustainable success.</em></p><p><em>The best professional advice often comes disguised as everyday expectations. Sometimes the most powerful career accelerator is simply becoming the person others can count on to deliver.</em></p><div><hr></div><p><em>What standards or mantras guide your work? Sometimes the most transformative lessons come wrapped in the simplest expectations.<br><br>DCD suggested reads:</em></p><div class="embedded-post-wrap" data-attrs="{&quot;id&quot;:162081022,&quot;url&quot;:&quot;https://vizmasters.substack.com/p/a-dataculture-lens&quot;,&quot;publication_id&quot;:4524746,&quot;publication_name&quot;:&quot;VizMasters&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!nPLx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6721ae08-d68e-479e-ba4b-5f9ad310b0dd_1024x1024.png&quot;,&quot;title&quot;:&quot;The Data Looks Healthy. But the Culture is Cracking.&quot;,&quot;truncated_body_text&quot;:&quot;The charts say we&#8217;re fine.&quot;,&quot;date&quot;:&quot;2025-04-24T21:51:31.683Z&quot;,&quot;like_count&quot;:7,&quot;comment_count&quot;:0,&quot;bylines&quot;:[{&quot;id&quot;:28659751,&quot;name&quot;:&quot;Tharashasank Davuluru&quot;,&quot;handle&quot;:&quot;tshasankda&quot;,&quot;previous_name&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf2a8721-d298-4f9f-b6a7-bc77ba443e9f_1200x1200.jpeg&quot;,&quot;bio&quot;:&quot;I&#8217;ve been living in Krakow, Poland, for the past eight years, with a background in engineering and technology. &quot;,&quot;profile_set_up_at&quot;:&quot;2025-03-27T21:47:20.835Z&quot;,&quot;reader_installed_at&quot;:&quot;2025-03-27T22:58:47.240Z&quot;,&quot;publicationUsers&quot;:[{&quot;id&quot;:4614886,&quot;user_id&quot;:28659751,&quot;publication_id&quot;:4523895,&quot;role&quot;:&quot;admin&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:true,&quot;publication&quot;:{&quot;id&quot;:4523895,&quot;name&quot;:&quot;Data, Chai &amp; Dialogue - The Leadership Lounge&quot;,&quot;subdomain&quot;:&quot;tshasankda&quot;,&quot;custom_domain&quot;:null,&quot;custom_domain_optional&quot;:false,&quot;hero_text&quot;:&quot;Data, Chai &amp; Dialogue is a space where insights meet storytelling, brewing conversations at the intersection of tech, culture, and personal growth from Krakow, Poland.&quot;,&quot;logo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ae5d6ee-8e59-41a5-ae04-a55ee9c5b03e_1024x1024.png&quot;,&quot;author_id&quot;:28659751,&quot;primary_user_id&quot;:null,&quot;theme_var_background_pop&quot;:&quot;#FF6719&quot;,&quot;created_at&quot;:&quot;2025-03-27T21:48:13.521Z&quot;,&quot;email_from_name&quot;:&quot; Data, Chai &amp; Dialogue &#8212; The Leadership Lounge&quot;,&quot;copyright&quot;:&quot;Tharashasank Davuluru&quot;,&quot;founding_plan_name&quot;:&quot;The Visionary Membership&quot;,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;enabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false,&quot;homepage_type&quot;:&quot;newspaper&quot;,&quot;is_personal_mode&quot;:false}}],&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null,&quot;status&quot;:{&quot;bestsellerTier&quot;:null,&quot;subscriberTier&quot;:null,&quot;leaderboard&quot;:null,&quot;vip&quot;:false,&quot;badge&quot;:null}}],&quot;utm_campaign&quot;:null,&quot;belowTheFold&quot;:true,&quot;type&quot;:&quot;newsletter&quot;,&quot;language&quot;:&quot;en&quot;,&quot;source&quot;:null}" data-component-name="EmbeddedPostToDOM"><a class="embedded-post" native="true" href="https://vizmasters.substack.com/p/a-dataculture-lens?utm_source=substack&amp;utm_campaign=post_embed&amp;utm_medium=web"><div class="embedded-post-header"><img class="embedded-post-publication-logo" src="https://substackcdn.com/image/fetch/$s_!nPLx!,w_56,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6721ae08-d68e-479e-ba4b-5f9ad310b0dd_1024x1024.png" loading="lazy"><span class="embedded-post-publication-name">VizMasters</span></div><div class="embedded-post-title-wrapper"><div class="embedded-post-title">The Data Looks Healthy. But the Culture is Cracking.</div></div><div class="embedded-post-body">The charts say we&#8217;re fine&#8230;</div><div class="embedded-post-cta-wrapper"><span class="embedded-post-cta">Read more</span></div><div class="embedded-post-meta">a year ago &#183; 7 likes &#183; Tharashasank Davuluru</div></a></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;19f18703-06c6-43fd-b62b-7dec4060c049&quot;,&quot;caption&quot;:&quot;An Engineering Leadership Perspective - On the surface, everything looks good.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;showDescription&quot;:true,&quot;showImage&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The Standups Are Fine. The Soul Isn&#8217;t.&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28659751,&quot;name&quot;:&quot;Tharashasank Davuluru&quot;,&quot;bio&quot;:&quot;I&#8217;ve been living in Krakow, Poland, for the past eight years, with a background in engineering and technology. &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf2a8721-d298-4f9f-b6a7-bc77ba443e9f_1200x1200.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-04-27T22:00:50.820Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!0CBc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13f3f2b7-04ae-4ac3-9bbe-3f188f9d6ce8_1536x1024.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://tshasankda.substack.com/p/the-standups-are-fine-the-soul-isnt&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:162081498,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:23,&quot;comment_count&quot;:5,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Data, Chai &amp; Dialogue - The Leadership Lounge&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!cMG_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5d6ee-8e59-41a5-ae04-a55ee9c5b03e_1024x1024.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;d63e0f6e-fd2f-4149-bc19-eb286816308c&quot;,&quot;caption&quot;:&quot;Leadership is about understanding people, not just telling them what to do. It&#8217;s about knowing who you're leading and connecting with them on a deeper level. When leaders understand the natural stren&#8230;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;showDescription&quot;:true,&quot;showImage&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The Colors of Leadership: A Smarter Way to Lead and Understand Your Team&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28659751,&quot;name&quot;:&quot;Tharashasank Davuluru&quot;,&quot;bio&quot;:&quot;I&#8217;ve been living in Krakow, Poland, for the past eight years, with a background in engineering and technology. &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf2a8721-d298-4f9f-b6a7-bc77ba443e9f_1200x1200.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-04-19T13:22:35.603Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!Kd-A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d54ef9-1f30-4437-927f-7ad193dc0497_1024x1024.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://tshasankda.substack.com/p/the-colors-of-leadership-a-smarter&quot;,&quot;section_name&quot;:&quot;Journeys &amp; Lessons&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:161671386,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:36,&quot;comment_count&quot;:13,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Data, Chai &amp; Dialogue - The Leadership Lounge&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!cMG_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5d6ee-8e59-41a5-ae04-a55ee9c5b03e_1024x1024.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;563f2566-d99f-4e8a-973b-70d8c9f047d6&quot;,&quot;caption&quot;:&quot;This blog is a reflection shaped by years of working with diverse teams across domains and geographies. It&#8217;s not about calling out anyone or any project - it&#8217;s about recurring patterns I&#8217;ve noticed. &#8230;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;showDescription&quot;:true,&quot;showImage&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;When Experience Becomes a Blind Spot&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28659751,&quot;name&quot;:&quot;Tharashasank Davuluru&quot;,&quot;bio&quot;:&quot;I&#8217;ve been living in Krakow, Poland, for the past eight years, with a background in engineering and technology. &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf2a8721-d298-4f9f-b6a7-bc77ba443e9f_1200x1200.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-08-05T08:49:12.344Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!8l2H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4418ea5c-0156-4026-aa47-f468bb52be3d_1024x1024.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://tshasankda.substack.com/p/when-experience-becomes-a-blind-spot&quot;,&quot;section_name&quot;:&quot;Journeys &amp; Lessons&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:168111599,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:2,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Data, Chai &amp; Dialogue - The Leadership Lounge&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!cMG_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5d6ee-8e59-41a5-ae04-a55ee9c5b03e_1024x1024.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;d970620c-16e7-4f99-a6fd-da3f42c83642&quot;,&quot;caption&quot;:&quot;From Curiosity to Career&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;showDescription&quot;:true,&quot;showImage&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;From Tester to Engineering Manager: My Journey Through Data, Cloud, and Growth&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28659751,&quot;name&quot;:&quot;Tharashasank Davuluru&quot;,&quot;bio&quot;:&quot;I&#8217;ve been living in Krakow, Poland, for the past eight years, with a background in engineering and technology. &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf2a8721-d298-4f9f-b6a7-bc77ba443e9f_1200x1200.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-07-31T07:24:14.223Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98f0113e-27c8-4fdd-9f17-22f3d157b200_800x1422.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://tshasankda.substack.com/p/from-tester-to-engineering-manager&quot;,&quot;section_name&quot;:&quot;Journeys &amp; Lessons&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:161805005,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:4,&quot;comment_count&quot;:1,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Data, Chai &amp; Dialogue - The Leadership Lounge&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!cMG_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5d6ee-8e59-41a5-ae04-a55ee9c5b03e_1024x1024.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;6c25ddfc-835e-4ed5-bba3-2f661695e7d1&quot;,&quot;caption&quot;:&quot;In the maze of professional growth, it&#8217;s easy to lose direction amidst the noise of constant demands and distractions. This is where mentorship becomes invaluable, acting as your guide to c&#8230;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;showDescription&quot;:true,&quot;showImage&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;From Chaos to Clarity: How Mentorship is Your Career&#8217;s North Star&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:28659751,&quot;name&quot;:&quot;Tharashasank Davuluru&quot;,&quot;bio&quot;:&quot;I&#8217;ve been living in Krakow, Poland, for the past eight years, with a background in engineering and technology. &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf2a8721-d298-4f9f-b6a7-bc77ba443e9f_1200x1200.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-03-30T00:04:43.136Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!uWTv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F34366811-0656-45ed-a7e9-e5bd48d11079_1024x608.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://tshasankda.substack.com/p/from-chaos-to-clarity-how-mentorship-25-03-27&quot;,&quot;section_name&quot;:&quot;Journeys &amp; Lessons&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:160032377,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:7,&quot;comment_count&quot;:7,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Data, Chai &amp; Dialogue - The Leadership Lounge&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!cMG_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ae5d6ee-8e59-41a5-ae04-a55ee9c5b03e_1024x1024.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div>]]></content:encoded></item><item><title><![CDATA[Avoid Stale Data in Starburst Delta Queries: Cache Tuning Tips]]></title><description><![CDATA[Balancing query freshness and performance when querying Delta Lake on Azure Data Lake Storage with Starburst Enterprise]]></description><link>https://tshasankda.substack.com/p/avoid-stale-data-in-starburst-delta</link><guid isPermaLink="false">https://tshasankda.substack.com/p/avoid-stale-data-in-starburst-delta</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Fri, 05 Sep 2025 07:48:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!HVL_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe52cfb49-6dbb-411c-b2d5-265191b883e9_604x523.svg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>Introduction</h3><p>When working with <strong>Starburst Enterprise</strong> and <strong>Delta Lake</strong> on <strong>Azure Data Lake Storage (ADLS)</strong>, performance tuning often involves tweaking cache properties for Hive Metastore and Delta connectors. However, caching can sometimes do more harm than good. Recently, we experienced an <strong>outage</strong> triggered by queries not reflecting the latest file changes in ADLS. This blog walks through what happened, how we fixed it, and recommendations on cache management moving forward.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HVL_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe52cfb49-6dbb-411c-b2d5-265191b883e9_604x523.svg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HVL_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe52cfb49-6dbb-411c-b2d5-265191b883e9_604x523.svg 424w, https://substackcdn.com/image/fetch/$s_!HVL_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe52cfb49-6dbb-411c-b2d5-265191b883e9_604x523.svg 848w, https://substackcdn.com/image/fetch/$s_!HVL_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe52cfb49-6dbb-411c-b2d5-265191b883e9_604x523.svg 1272w, https://substackcdn.com/image/fetch/$s_!HVL_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe52cfb49-6dbb-411c-b2d5-265191b883e9_604x523.svg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HVL_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe52cfb49-6dbb-411c-b2d5-265191b883e9_604x523.svg" width="604" height="523" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e52cfb49-6dbb-411c-b2d5-265191b883e9_604x523.svg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:523,&quot;width&quot;:604,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;azure 2&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="azure 2" title="azure 2" srcset="https://substackcdn.com/image/fetch/$s_!HVL_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe52cfb49-6dbb-411c-b2d5-265191b883e9_604x523.svg 424w, https://substackcdn.com/image/fetch/$s_!HVL_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe52cfb49-6dbb-411c-b2d5-265191b883e9_604x523.svg 848w, https://substackcdn.com/image/fetch/$s_!HVL_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe52cfb49-6dbb-411c-b2d5-265191b883e9_604x523.svg 1272w, https://substackcdn.com/image/fetch/$s_!HVL_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe52cfb49-6dbb-411c-b2d5-265191b883e9_604x523.svg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image source for Starburst Media</figcaption></figure></div><h3>The Problem</h3><p>Our catalog configuration used the following defaults:</p><pre><code><code>hive.metastore-cache-ttl=20m
hive.metastore-refresh-interval=10m
</code></code></pre><p>This configuration meant Starburst cached metadata and refreshed it only every 10&#8211;20 minutes. When new files were added to ADLS through Delta Lake transactions, queries continued reading <strong>stale metadata</strong>. This caused users to see outdated or missing data, resulting in a production outage.</p><h3>Root Cause</h3><p>The combination of long-lived <strong>Hive Metastore cache</strong> and <strong>Delta metadata cache</strong> caused Starburst to serve old snapshots of Delta tables. Inserts, schema updates, or compactions applied in ADLS were invisible until the cache expired. For workloads where data correctness and freshness are critical, this lag is unacceptable.</p><h3>The Fix</h3><p>Following Starburst recommendations, we updated catalog properties to disable or minimize caching:</p><pre><code><code># Disable Hive metastore metadata cache
hive.metastore-cache-ttl=0m
hive.metastore-stats-cache-ttl=0m

# Disable caching of partition listings and missing entries
hive.metastore-cache.cache-partitions=false
hive.metastore-cache.cache-missing=false

# Disable Delta metadata caching
delta.metadata.cache-ttl=0m
delta.metadata.live-files.cache-ttl=0m

# Disable transaction log caching
delta.fs.cache.disable-transaction-log-caching=true
</code></code></pre><h3>What These Settings Do</h3><ul><li><p><strong>hive.metastore-cache-ttl=0m:</strong> Queries fetch fresh table and partition metadata every time.</p></li><li><p><strong>hive.metastore-stats-cache-ttl=0m:</strong> Avoids using stale statistics during query planning.</p></li><li><p><strong>hive.metastore-cache.cache-partitions=false:</strong> Ensures partition listings are never cached and always fetched fresh.</p></li><li><p><strong>hive.metastore-cache.cache-missing=false:</strong> Prevents caching of missing table or partition lookups.</p></li><li><p><strong>delta.metadata.cache-ttl=0m:</strong> Forces reading of the latest Delta transaction log on every query.</p></li><li><p><strong>delta.metadata.live-files.cache-ttl=0m:</strong> Makes sure queries reflect current sets of Delta data files.</p></li><li><p><strong>delta.fs.cache.disable-transaction-log-caching=true:</strong> Disables caching of transaction log listings, keeping data fresh.</p></li></ul><h3>Metadata Caching Across Different Table Formats</h3><p>Different table formats have distinct metadata caching behavior in Starburst/Trino. Understanding these differences helps in tuning cache settings effectively:</p><ul><li><p><strong>Hive</strong></p><ul><li><p>Properties like <code>hive.metastore-cache-ttl</code>, <code>hive.metastore-refresh-interval</code>, <code>cache-partitions</code>, and <code>cache-missing</code> control caching.</p></li><li><p>Fully documented at <a href="https://trino.io/docs/current/connector/hive.html">Metastores &#8212; Trino 476 Documentation</a>.</p></li><li><p>Starburst uses the Hive Metastore primarily for <strong>file location and table metadata</strong>.</p></li></ul></li><li><p><strong>Iceberg</strong></p><ul><li><p>Similar caching concepts as Hive, but Iceberg tables store metadata differently (metadata files on the data lake).</p></li><li><p>Iceberg connector docs at <a href="https://trino.io/docs/current/connector/iceberg.html">Iceberg connector &#8212; Trino 476 Documentation</a> list additional parameters like snapshot and manifest caching.</p></li></ul></li><li><p><strong>Delta Lake</strong></p><ul><li><p>Delta tables have their <strong>own cache TTL properties</strong>, like <code>delta.metadata.cache-ttl</code> and <code>delta.metadata.live-files.cache-ttl</code>.</p></li><li><p>Documentation at <a href="https://trino.io/docs/current/connector/delta.html">Delta Lake connector &#8212; Trino 476 Documentation</a> highlights these along with procedures to flush caches manually.</p></li><li><p>Unlike Hive/Iceberg, Delta&#8217;s metadata is tied directly to <strong>transaction logs</strong>, making it critical to tune cache settings for correctness when tables are updated frequently.</p></li></ul></li></ul><p><strong>Key takeaway:</strong> Even though the concept of metadata caching exists across all connectors, the <strong>specific properties, defaults, and behavior differ</strong> depending on whether you are using Hive, Iceberg, or Delta.</p><h3>Flushing Caches Manually</h3><p>Sometimes, despite tuned TTLs, you need an immediate cache refresh. Starburst provides system procedures:</p><pre><code><code>-- Flush all Hive metadata caches
CALL system.flush_metadata_cache();

-- Flush cache for a specific table
CALL system.flush_metadata_cache(schema_name =&gt; 'my_schema', table_name =&gt; 'my_table');

-- Flush cache for a specific partition
CALL system.flush_metadata_cache(
    schema_name =&gt; 'my_schema',
    table_name =&gt; 'my_table',
    partition_columns =&gt; ARRAY['dt'],
    partition_values =&gt; ARRAY['2025-08-25']
);

-- Flush filesystem cache for a table
CALL system.flush_filesystem_cache(schema_name =&gt; 'my_schema', table_name =&gt; 'my_table');
</code></code></pre><p>These are essential during data ingestion, compaction, or troubleshooting when new data is not visible.</p><div><hr></div><h3>Trade-offs</h3><ul><li><p><strong>Correctness &amp; Freshness:</strong> Queries no longer read stale metadata or files.</p></li><li><p><strong>Performance Impact:</strong> Every query now hits the Hive Metastore and Delta log directly, increasing latency and load.</p></li></ul><h3>Practical Recommendation</h3><ul><li><p>For <strong>production systems</strong> where correctness is critical, disable caching as shown above.</p></li><li><p>For <strong>balanced performance</strong>, consider short TTLs (30 seconds to 1 minute) instead of completely turning off caches. This reduces load while keeping data reasonably fresh.</p></li></ul><h3>Sample Production-Ready Catalog Configuration</h3><pre><code><code>connector.name=delta_lake
hive.metastore.uri=thrift://hive:9083

# Metadata cache tuning
hive.metastore-cache-ttl=1m
hive.metastore-stats-cache-ttl=1m
hive.metastore-refresh-interval=1m

# Disable caching of partitions and missing entries
hive.metastore-cache.cache-partitions=false
hive.metastore-cache.cache-missing=false

# Delta-specific cache controls
delta.metadata.cache-ttl=0m
delta.metadata.live-files.cache-ttl=0m
delta.fs.cache.disable-transaction-log-caching=true

# Security and table registration
delta.security=starburst
delta.register-table-procedure.enabled=true

# Azure settings
fs.native-azure.enabled=true
azure.auth-type=OAUTH
azure.oauth.tenant-id=xxxx
azure.oauth.client-id=xxxx
azure.oauth.secret=xxxx
</code></code></pre><h3>Final Thoughts</h3><p>Caching is a double-edged sword. While it improves performance, it may silently introduce data staleness when working with Delta Lake on ADLS. Our choice to disable caching completely ensured query correctness at the expense of latency. Going forward, tuning cache TTLs to small values may strike the right balance.</p><p>If you use Starburst with Delta Lake on ADLS, review your cache settings today&#8212;your default settings might be hiding stale data issues.</p><h3>References</h3><ol><li><p><a href="https://docs.starburst.io/latest/connector/delta-lake.html">Starburst Delta Lake Connector Documentation</a></p></li><li><p><a href="https://www.starburst.io/blog/delta-lake-concurrent-writes-starburst/">Starburst Blog: Delta Lake Concurrent Writes and Consistency</a></p></li><li><p><a href="https://delta.io/">Delta Lake Official Website</a></p></li><li><p><a href="https://www.starburst.io/blog/data-lake-adls/">Starburst Blog: Moving On-Premise Data to Azure Cloud ADLS</a></p></li><li><p><a href="https://www.starburst.io/microsoft-azure/">Starburst Starburst Enterprise on Azure Overview</a></p></li></ol>]]></content:encoded></item><item><title><![CDATA[Monitoring Starburst Enterprise(Trino) Clusters with REST APIs]]></title><description><![CDATA[Practical ways to track health, state, and performance through coordinator endpoints]]></description><link>https://tshasankda.substack.com/p/monitoring-starburst-enterprisetrino</link><guid isPermaLink="false">https://tshasankda.substack.com/p/monitoring-starburst-enterprisetrino</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Wed, 03 Sep 2025 07:12:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!52Xx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc73ae2-4205-4c0d-b6bc-323b0a474075_1050x629.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Modern data analytics rely on distributed SQL engines like Starburst (Trino) to process large-scale data with speed and flexibility. While these engines excel at query execution, ensuring <strong>cluster health</strong> and <strong>observability</strong> is critical to maintaining performance and reliability. The coordinator node exposes REST APIs that provide real-time insights into cluster state, workload, and system health.</em></p><p><em>This guide is for Data Engineers, DevOps, and Site Reliability Engineers who want to maximize uptime and optimize Starburst clusters using native APIs combined with monitoring tools and automation.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!52Xx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc73ae2-4205-4c0d-b6bc-323b0a474075_1050x629.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!52Xx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc73ae2-4205-4c0d-b6bc-323b0a474075_1050x629.png 424w, https://substackcdn.com/image/fetch/$s_!52Xx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc73ae2-4205-4c0d-b6bc-323b0a474075_1050x629.png 848w, https://substackcdn.com/image/fetch/$s_!52Xx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc73ae2-4205-4c0d-b6bc-323b0a474075_1050x629.png 1272w, https://substackcdn.com/image/fetch/$s_!52Xx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc73ae2-4205-4c0d-b6bc-323b0a474075_1050x629.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!52Xx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc73ae2-4205-4c0d-b6bc-323b0a474075_1050x629.png" width="1050" height="629" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7cc73ae2-4205-4c0d-b6bc-323b0a474075_1050x629.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:629,&quot;width&quot;:1050,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!52Xx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc73ae2-4205-4c0d-b6bc-323b0a474075_1050x629.png 424w, https://substackcdn.com/image/fetch/$s_!52Xx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc73ae2-4205-4c0d-b6bc-323b0a474075_1050x629.png 848w, https://substackcdn.com/image/fetch/$s_!52Xx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc73ae2-4205-4c0d-b6bc-323b0a474075_1050x629.png 1272w, https://substackcdn.com/image/fetch/$s_!52Xx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc73ae2-4205-4c0d-b6bc-323b0a474075_1050x629.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image source from Starburst Media</figcaption></figure></div><p><em><strong>Starburst (Trino) Architecture: Coordinator and Worker Roles</strong></em></p><ul><li><p><em><strong>Coordinator</strong>: Acts as the control plane &#8212; receiving queries, planning execution, scheduling tasks, and aggregating results.</em></p></li><li><p><em><strong>Workers</strong>: Perform the actual data processing &#8212; scanning, filtering, joining, and aggregating in parallel.</em></p></li></ul><p><em>Monitoring focuses primarily on the coordinator&#8217;s APIs, which expose cluster-wide state and workload metrics.</em></p><p><em><strong>Coordinator State Endpoint: </strong></em><code>/v1/info/state</code></p><p><em><strong>Request:</strong></em></p><pre><code>curl http://coordinator_host:8080/v1/info/state</code></pre><p><em><strong>Response Example:</strong></em></p><pre><code>"ACTIVE"</code></pre><p><em><strong>Possible States and Operational Meaning:</strong></em></p><p><em><strong>ACTIVE</strong></em></p><ul><li><p><em>Node fully initialized and serving queries</em></p></li><li><p><em>Safe to route production queries</em></p></li></ul><p><em><strong>STARTING</strong></em></p><ul><li><p><em>Node initialization in progress</em></p></li><li><p><em>Hold traffic until ready</em></p></li></ul><p><em><strong>SHUTTING_DOWN</strong></em></p><ul><li><p><em>Node gracefully stopping</em></p></li><li><p><em>No new queries; prepare for maintenance</em></p></li></ul><p><em><strong>FAILED / INACTIVE</strong></em></p><ul><li><p><em>Node encountered errors or is down</em></p></li><li><p><em>Requires operator action or automated restart</em></p></li></ul><blockquote><p><em><strong>Best Practice:</strong> Use </em><code>/v1/info/state</code><em> as the <strong>readiness probe</strong> in Kubernetes. Unexpected transitions should trigger alerts.</em></p></blockquote><p><em><strong>Cluster Metrics Endpoint: </strong></em><code>/v1/ui/api/stats</code></p><p><em><strong>Request:</strong></em></p><pre><code>curl http://coordinator_host:8080/ui/api/stats</code></pre><p><em><strong>Sample Response:</strong></em></p><pre><code>{
  "runningQueries": 3,
  "blockedQueries": 1,
  "queuedQueries": 2,
  "activeWorkers": 5,
  "runningDrivers": 15,
  "totalAvailableProcessors": 20,
  "reservedMemory": 5120000000,
  "totalInputRows": 1500000,
  "totalInputBytes": 6000000000,
  "totalCpuTimeSecs": 4500
}</code></pre><p><em><strong>Metric Deep Dive:</strong></em></p><ul><li><p><em><strong>runningQueries / blockedQueries / queuedQueries</strong> &#8594; Query execution pipeline.</em></p></li><li><p><em><strong>activeWorkers</strong> &#8594; Number of connected, healthy workers.</em></p></li><li><p><em><strong>runningDrivers</strong> &#8594; Execution threads currently active.</em></p></li><li><p><em><strong>totalAvailableProcessors</strong> &#8594; Total CPU capacity across cluster.</em></p></li><li><p><em><strong>reservedMemory</strong> &#8594; Memory allocated to queries (bytes).</em></p></li><li><p><em><strong>totalInputRows / totalInputBytes</strong> &#8594; Cluster throughput since startup.</em></p></li><li><p><em><strong>totalCpuTimeSecs</strong> &#8594; Total CPU time consumed by queries.</em></p></li></ul><p><em><strong>Operational Tips:</strong></em></p><ul><li><p><em>Alert when </em><code>queuedQueries</code><em> or </em><code>blockedQueries</code><em> spike.</em></p></li><li><p><em>Track </em><code>activeWorkers</code><em> to detect node failures.</em></p></li><li><p><em>Use </em><code>reservedMemory</code><em> to monitor memory pressure.</em></p></li><li><p><em>Combine </em><code>runningDrivers</code><em> with CPU metrics to spot bottlenecks.</em></p></li></ul><p><em><strong>Readiness and Liveness Probes: </strong></em><code>/v1/info</code><em><strong> &amp; </strong></em><code>/v1/status</code></p><p><em><strong>Sample </strong></em><code>/v1/info</code><em><strong> Response:</strong></em></p><pre><code>{
  "nodeVersion": { "version": "365" },
  "environment": "prod",
  "coordinator": true,
  "starting": false,
  "uptime": "3h45m"
}</code></pre><p><em><strong>Sample </strong></em><code>/v1/status</code><em><strong> Response:</strong></em></p><pre><code>{
  "status": "OK",
  "uptime": "3h45m"
}</code></pre><p><em><strong>Usage Recommendations:</strong></em></p><ul><li><p><code>/v1/info</code><em> &#8594; Best for <strong>readiness checks</strong>; ensure </em><code>coordinator=true</code><em> and not </em><code>starting</code><em>.</em></p></li><li><p><code>/v1/status</code><em> &#8594; Ideal for <strong>liveness checks</strong>; restart pods if unresponsive.</em></p></li></ul><p><em><strong>Security Considerations</strong></em></p><ul><li><p><em>By default, Trino/Starburst APIs may be open inside the network.</em></p></li></ul><p><em><strong>Best practices:</strong></em></p><ul><li><p><em>Restrict API access to trusted IPs or VPNs.</em></p></li><li><p><em>Enable TLS at ingress/load balancer.</em></p></li><li><p><em>Monitor API logs for unusual access.</em></p></li></ul><p>Integrations and Tooling</p><ul><li><p><em><strong>Prometheus Exporters:</strong> Scrape </em><code>/v1/ui/api/stats</code><em> for long-term monitoring.</em></p></li><li><p><em><strong>Grafana Dashboards:</strong> Visualize queries, worker counts, CPU load, and memory usage.</em></p></li><li><p><em><strong>Kubernetes Probes:</strong> Automate restart and readiness checks with </em><code>/v1/info/state</code><em> and </em><code>/v1/status</code><em>.</em></p></li><li><p><em><strong>Automation Scripts:</strong> Scale clusters or trigger alerts based on metrics.</em></p></li><li><p><em><strong>Incident Response:</strong> Use APIs in playbooks to assess cluster quickly during outages.</em></p></li></ul><p><em><strong>Example: Kubernetes Probes</strong></em></p><pre><code>livenessProbe:
  httpGet:
    path: /v1/status
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 15
readinessProbe:
  httpGet:
    path: /v1/info/state
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10</code></pre><p><em>This ensures pods only serve traffic when ready, and restart automatically when unhealthy.</em></p><p><em><strong>Monitoring Best Practices</strong></em></p><ul><li><p><em>Poll </em><code>/v1/ui/api/stats</code><em> continuously for real-time metrics.</em></p></li><li><p><em>Watch </em><code>activeWorkers</code><em> as a simple cluster health indicator.</em></p></li><li><p><em>Track query distribution (</em><code>running vs queued vs blocked</code><em>).</em></p></li><li><p><em>Use </em><code>uptime</code><em> and </em><code>version</code><em> to track rolling upgrades.</em></p></li></ul><p><em><strong>Conclusion</strong></em></p><p><em>Starburst (Trino) coordinator REST APIs provide operators with a <strong>lightweight but powerful</strong> way to monitor clusters. By integrating endpoints like </em><code>/v1/info/state</code><em>, </em><code>/v1/ui/api/stats</code><em>, </em><code>/v1/info</code><em>, and </em><code>/v1/status</code><em> into dashboards, probes, and automation pipelines, teams can:</em></p><ul><li><p><em>Detect anomalies early,</em></p></li><li><p><em>Optimize resources,</em></p></li><li><p><em>Automate scaling and failover,</em></p></li><li><p><em>Maintain service reliability.</em></p></li></ul><p><em>In short, these APIs transform Starburst clusters from black boxes into <strong>transparent, observable systems</strong> &#8212; a foundation for running modern, production-grade analytics at scale.</em></p>]]></content:encoded></item><item><title><![CDATA[Building ETL Pipelines in Starburst Enterprise Platform (SEP)]]></title><description><![CDATA[What I Learned Using Python DataFrames in Starburst Enterprise]]></description><link>https://tshasankda.substack.com/p/from-queries-to-pipelines-in-starburst</link><guid isPermaLink="false">https://tshasankda.substack.com/p/from-queries-to-pipelines-in-starburst</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Wed, 27 Aug 2025 04:07:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!XnsZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa564eb8e-1fca-4632-a778-0535cc83094c_1189x565.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Starburst Enterprise Platform (SEP) was the go-to engine for federated queries and analytics at scale. We used it to query across data lakes, warehouses, and relational stores without moving data.</em></p><p><em>But ETL and pipeline building? That typically lived elsewhere Spark clusters, Pandas scripts, or licensed ETL tools.</em></p><p><em>Recently, while going through a set of blogs from <a href="https://www.starburst.io/blog/">Starburst </a>and <a href="https://lestermartin.blog/">Lester Martin</a>, I realised something powerful: with the new Python DataFrame APIs, even ETL and pipeline workflows can be done natively in Starburst Enterprise.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XnsZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa564eb8e-1fca-4632-a778-0535cc83094c_1189x565.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XnsZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa564eb8e-1fca-4632-a778-0535cc83094c_1189x565.png 424w, https://substackcdn.com/image/fetch/$s_!XnsZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa564eb8e-1fca-4632-a778-0535cc83094c_1189x565.png 848w, https://substackcdn.com/image/fetch/$s_!XnsZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa564eb8e-1fca-4632-a778-0535cc83094c_1189x565.png 1272w, https://substackcdn.com/image/fetch/$s_!XnsZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa564eb8e-1fca-4632-a778-0535cc83094c_1189x565.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XnsZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa564eb8e-1fca-4632-a778-0535cc83094c_1189x565.png" width="1189" height="565" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a564eb8e-1fca-4632-a778-0535cc83094c_1189x565.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:565,&quot;width&quot;:1189,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:81317,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/172057261?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa564eb8e-1fca-4632-a778-0535cc83094c_1189x565.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XnsZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa564eb8e-1fca-4632-a778-0535cc83094c_1189x565.png 424w, https://substackcdn.com/image/fetch/$s_!XnsZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa564eb8e-1fca-4632-a778-0535cc83094c_1189x565.png 848w, https://substackcdn.com/image/fetch/$s_!XnsZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa564eb8e-1fca-4632-a778-0535cc83094c_1189x565.png 1272w, https://substackcdn.com/image/fetch/$s_!XnsZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa564eb8e-1fca-4632-a778-0535cc83094c_1189x565.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image source: blog by Lester Martin</figcaption></figure></div><p><em><strong>What I Learned About Python DataFrames in SEP?</strong></em></p><p><em>SEP now supports two major Python DataFrame APIs on top of Trino:</em></p><ul><li><p><em>PyStarburst: Starburst&#8217;s native DataFrame API designed to feel familiar to PySpark users. It leverages Starburst-native extensions for performance.</em></p></li><li><p><em>Ibis: A backend-agnostic DataFrame API that works across multiple engines, including SEP, providing portability and flexibility in hybrid data landscapes.</em></p></li></ul><p><em><strong>ETL Without Moving Data?</strong></em></p><p><em>Instead of extracting data into external tools:</em></p><ul><li><p><em>You can read tables directly from ADLS, S3, or your warehouse into a Python DataFrame inside SEP.</em></p></li><li><p><em>Apply transformations using intuitive Python methods like filters, joins, and group-bys.</em></p></li><li><p><em>Save the transformed results back to object stores, registered as Hive or Glue tables.</em></p></li></ul><p><em>All of this happens inside Starburst Enterprise natively , no extra data movement or cluster hops required.</em></p><p><em><strong>Pipelines That Are Enterprise-Ready</strong></em></p><ul><li><p><em>Lazy execution: Transformations don&#8217;t run immediately; like in Spark, they only execute when you call </em><code>.collect()</code><em> or save results.</em></p></li><li><p><em>SEP&#8217;s orchestration and resource management capabilities allow scaling to production workloads.</em></p></li><li><p><em>Integration with popular schedulers like Airflow, dbt, and CI/CD pipelines allows treating SEP as a fully-fledged pipeline engine.</em></p></li></ul><p><em>This means development teams can write, test, version, and deploy ETL pipelines as regular Python projects unlocking developer discipline alongside enterprise governance.</em></p><p><em><strong>PyStarburst vs Ibis in Enterprise Context</strong></em></p><ul><li><p><em>PyStarburst is ideal if you are fully invested in SEP and want the best performance with Starburst-native features.</em></p></li><li><p><em>Ibis shines if your enterprise environment involves multiple SQL engines and you want a portable DataFrame API across platforms.</em></p></li></ul><p><em><strong>Why This Stands Out</strong></em></p><p><em>In my experience with enterprise data platforms, there&#8217;s often a fragmented ecosystem:</em></p><ul><li><p><em>SEP for federation and analytics.</em></p></li><li><p><em>Spark clusters for ETL.</em></p></li><li><p><em>Low-code tools (like Alteryx) for smaller jobs.</em></p></li></ul><blockquote><p><em>This results in complexity, silos, and governance headaches.</em></p></blockquote><p><em>SEP&#8217;s Python DataFrame APIs reveal a future where:</em></p><ul><li><p><em>One engine queries, transforms, and serves pipelines.</em></p></li><li><p><em>Governance happens in a single place, backed by Hive metastores.</em></p></li><li><p><em>The operational burden and complexity drastically reduce.</em></p></li></ul><p><em><strong>Closing Thought</strong></em></p><p><em>With Python DataFrames, Starburst Enterprise is evolving beyond analytics into a unified pipeline platform. ETL and transformations no longer need separate clusters or tools they can run directly on the same governed, federated platform.</em></p><blockquote><p><em>For me, SEP now looks less like a query engine and more like a unified data platform , one that simplifies pipelines, governance, and analytics in a single place</em></p></blockquote><p><em><strong>References :</strong></em></p><ul><li><p><a href="https://lestermartin.blog/2023/10/27/ibis-trino-dataframe-api-part-deux/">I</a><em><a href="https://lestermartin.blog/2023/10/27/ibis-trino-dataframe-api-part-deux/">bis &amp; Trino (DataFrame API Part Deux)</a></em></p></li><li><p><em><a href="https://www.starburst.io/blog/ibis-trino/">Ibis and Trino</a></em></p></li><li><p><em><a href="https://www.starburst.io/blog/how-to-build-data-transformations-with-python-ibis-and-starburst-galaxy/">How to Build Data Transformations with Python, Ibis, and Starburst Galaxy</a></em></p></li><li><p><em><a href="https://www.starburst.io/blog/pystarburst-the-dataframe-api/">PyStarburst: the DataFrame API</a></em></p></li><li><p><em><a href="https://www.starburst.io/blog/introducing-python-dataframes/">Introducing Python DataFrames in Starburst Galaxy</a></em></p></li><li><p><em><a href="https://lestermartin.blog/2025/08/18/building-trino-data-pipelines-with-sql-or-python/">Building Trino Data Pipelines with SQL or Python</a></em></p></li><li><p><a href="https://lestermartin.blog/2023/08/10/building-a-sql-based-data-pipeline-with-trino-starburst-5-slick-videos/">Building a sql based data pipeline with trino starburst</a></p></li></ul>]]></content:encoded></item><item><title><![CDATA[Running Starburst Enterprise: The Complete COE & Platform Playbook]]></title><description><![CDATA[Team structures, roles, and query optimization strategies for performance, governance, and adoption]]></description><link>https://tshasankda.substack.com/p/running-starburst-enterprise-the</link><guid isPermaLink="false">https://tshasankda.substack.com/p/running-starburst-enterprise-the</guid><dc:creator><![CDATA[Tharashasank Davuluru]]></dc:creator><pubDate>Tue, 19 Aug 2025 08:23:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!EnlV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bd4a5d-d00f-4b69-9d9a-a0240599f0b4_1456x971.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Starburst Enterprise built on the open-source Trino engine (formerly PrestoSQL)&#8212;lets you run fast, federated SQL across data lakes, warehouses, and operational systems without moving data. But performance and value don&#8217;t come from software alone; they come from operational discipline, governance, and a Center of Excellence (COE) that aligns people, process, and platform.</em></p><p><em>This handbook merges platform controls, observability, and query optimization with a complete, real-world COE operating model. So you can run Starburst as a reliable product, not a best-effort cluster.</em></p><p><em><strong>Part 1 &#8212; Platform Playbook: Healthy Starburst by Design</strong></em></p><p><em><strong>1.1 Core Platform Responsibilities</strong></em></p><ul><li><p><em>Cluster configuration &amp; lifecycle: coordinators, workers, catalogs/connectors, versioning.</em></p></li><li><p><em>Resource management: memory guardrails, workload groups, fair scheduling.</em></p></li><li><p><em>Reliability &amp; elasticity: HA patterns, rolling upgrades, scaling policies.</em></p></li><li><p><em>Observability: metrics, logs, dashboards, SLOs.</em></p></li></ul><blockquote><p><em>Best practice: Conduct a quarterly config review against workload patterns, compliance, and cost goals.</em></p></blockquote><p><em><strong>1.2 Stability &amp; Fair-Use Controls (non-negotiables)</strong></em></p><p><em>Some baseline configurations every enterprise should enforce:</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WHmH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F105642c1-ec9f-4de9-aa00-1fca48aaaafc_1366x274.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WHmH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F105642c1-ec9f-4de9-aa00-1fca48aaaafc_1366x274.png 424w, https://substackcdn.com/image/fetch/$s_!WHmH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F105642c1-ec9f-4de9-aa00-1fca48aaaafc_1366x274.png 848w, https://substackcdn.com/image/fetch/$s_!WHmH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F105642c1-ec9f-4de9-aa00-1fca48aaaafc_1366x274.png 1272w, https://substackcdn.com/image/fetch/$s_!WHmH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F105642c1-ec9f-4de9-aa00-1fca48aaaafc_1366x274.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WHmH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F105642c1-ec9f-4de9-aa00-1fca48aaaafc_1366x274.png" width="1366" height="274" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/105642c1-ec9f-4de9-aa00-1fca48aaaafc_1366x274.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:274,&quot;width&quot;:1366,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55901,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/171241261?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F105642c1-ec9f-4de9-aa00-1fca48aaaafc_1366x274.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WHmH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F105642c1-ec9f-4de9-aa00-1fca48aaaafc_1366x274.png 424w, https://substackcdn.com/image/fetch/$s_!WHmH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F105642c1-ec9f-4de9-aa00-1fca48aaaafc_1366x274.png 848w, https://substackcdn.com/image/fetch/$s_!WHmH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F105642c1-ec9f-4de9-aa00-1fca48aaaafc_1366x274.png 1272w, https://substackcdn.com/image/fetch/$s_!WHmH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F105642c1-ec9f-4de9-aa00-1fca48aaaafc_1366x274.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p><em><strong>1.3 Alerts That Matter</strong></em></p><p><em>Don&#8217;t drown in noise. Focus alerts on:</em></p><ul><li><p><em>High query failure rates</em></p></li><li><p><em>Spill-to-disk thresholds exceeded</em></p></li><li><p><em>Cluster CPU/memory saturation</em></p></li><li><p><em>Coordinator unavailability</em></p></li><li><p><em>Queue times rising abnormally</em></p></li></ul><p><em>Integrate with Prometheus + Grafana or cloud-native monitoring to ensure fast response.</em></p><p><em><strong>1.4 Starburst Enterprise Built-In Insights &amp; Logs</strong></em></p><ul><li><p><em>Query History: SQL text, duration, resource usage, errors</em></p></li><li><p><em>Usage Metrics: CPU/memory + workload patterns + query counts</em></p></li><li><p><em>Cluster History: performance and execution baselines,executions, error codes, runtime stats, plans, events.</em></p></li><li><p><em>Audit Logs (BIAC): compliance-grade access events,catalog/schema/table access, DDL.</em></p></li></ul><p><em>These are your rear-view mirror and dashboard for optimization and governance.</em></p><p><em>Documentation for deeper dives:</em></p><ul><li><p><em><a href="https://docs.starburst.io/latest/admin/backend-service.html">Backend service &#8212; Starburst Enterprise</a></em></p></li><li><p><em><a href="https://docs.starburst.io/latest/insights/usage-metrics-overview.html">Insights usage metrics &#8212; Starburst Enterprise</a></em></p></li><li><p><em><a href="https://docs.starburst.io/latest/insights/cluster-history.html">Insights cluster history &#8212; Starburst Enterprise</a></em></p></li><li><p><em><a href="https://docs.starburst.io/latest/security/biac-audit-log.html">Built-in access control audit log &#8212; Starburst Enterprise</a></em></p></li></ul><blockquote><p><em>DC&amp;D note: Logs aren&#8217;t &#8220;admin noise.&#8221; Read them right and they&#8217;ll tell you why Finance&#8217;s dashboard fails at month-end.</em></p></blockquote><p><em><strong>Part 2 &#8212; Query Optimization in Starburst</strong></em></p><p><em><strong>2.1 Common Bottlenecks</strong></em></p><ul><li><p><em>Compute: insufficient CPU/memory; overloaded workers</em></p></li><li><p><em>I/O &amp; network: slow external storage; high latency</em></p></li><li><p><em>Query design: excessive scans; suboptimal join order/type</em></p></li><li><p><em>Data layout: skewed partitions; too many small files</em></p></li></ul><p><em><strong>2.2 Diagnose with EXPLAIN / EXPLAIN ANALYZE</strong></em></p><ul><li><p><em>EXPLAIN &#8594; planned operators and join strategies</em></p></li><li><p><em>EXPLAIN ANALYZE &#8594; actual CPU, memory, I/O by operator; real hotspots</em></p></li></ul><p><em><strong>2.3 Proven Techniques</strong></em></p><ul><li><p><em>Resource tuning: </em><code>query.max-memory</code><em>, </em><code>task.concurrency</code><em>, workload groups</em></p></li><li><p><em>I/O: Parquet/ORC, compression, predicate pushdown, partition pruning</em></p></li><li><p><em>Joins: broadcast vs distributed; reorder by selectivity; dynamic filtering</em></p></li><li><p><em>Layout: sensible partitioning + bucketing; merge small files</em></p></li></ul><p><em><strong>2.4 Long-Term Practices</strong></em></p><ul><li><p><em>Weekly slow-query reviews; monthly partition/layout hygiene</em></p></li><li><p><em>Keep versions current to leverage optimizer improvements</em></p></li><li><p><em>Prefer star schemas or selective denormalization where appropriate</em></p></li><li><p><em>Share patterns via COE enablement (playbooks, labs)</em></p></li></ul><p><em><strong>Part 3 &#8212; The Starburst COE: Operating Model, Roles, and Runbooks (Deep Dive)</strong></em></p><p><em>The Center of Excellence makes Starburst successful organizationally: it standardizes how teams onboard, design queries, request capacity, meet SLAs, and stay compliant. Below is a complete, practical operating model.</em></p><p><em><strong>3.0 COE Charter (What/Why)</strong></em></p><p><em>Mission: Accelerate safe, performant, cost-aware Starburst adoption across the enterprise.<br>Scope:</em></p><ul><li><p><em>Governance: policies, standards, quotas, access models</em></p></li><li><p><em>Enablement: training, labs, office hours, champions network</em></p></li><li><p><em>Optimization: performance clinics, workload patterns, cost reviews</em></p></li><li><p><em>Adoption: intake, onboarding, reference architectures, use-case templates</em></p></li></ul><p><em>Non-Goals:</em></p><ul><li><p><em>Building every workload for teams</em></p></li><li><p><em>Owning line-of-business data modeling</em></p></li></ul><p><em><strong>3.1 Operating Model: Hub-and-Spoke</strong></em></p><ul><li><p><em>Hub (COE Core): sets standards, owns platform policies, runs enablement, performs complex reviews.</em></p></li><li><p><em>Spokes (Domain Teams): build solutions using COE guardrails; nominate Champions; own SLAs for their products.</em></p></li><li><p><em>Shared Services: SRE, Security, FinOps, Data Governance&#8212;embedded or dotted-line to COE.</em></p></li></ul><blockquote><p><em>DC&amp;D note: Centralize policy, decentralize execution. That&#8217;s how you scale without chaos.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EnlV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bd4a5d-d00f-4b69-9d9a-a0240599f0b4_1456x971.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EnlV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bd4a5d-d00f-4b69-9d9a-a0240599f0b4_1456x971.png 424w, https://substackcdn.com/image/fetch/$s_!EnlV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bd4a5d-d00f-4b69-9d9a-a0240599f0b4_1456x971.png 848w, https://substackcdn.com/image/fetch/$s_!EnlV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bd4a5d-d00f-4b69-9d9a-a0240599f0b4_1456x971.png 1272w, https://substackcdn.com/image/fetch/$s_!EnlV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bd4a5d-d00f-4b69-9d9a-a0240599f0b4_1456x971.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EnlV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bd4a5d-d00f-4b69-9d9a-a0240599f0b4_1456x971.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18bd4a5d-d00f-4b69-9d9a-a0240599f0b4_1456x971.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Generated image&quot;,&quot;title&quot;:&quot;Generated image&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Generated image" title="Generated image" srcset="https://substackcdn.com/image/fetch/$s_!EnlV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bd4a5d-d00f-4b69-9d9a-a0240599f0b4_1456x971.png 424w, https://substackcdn.com/image/fetch/$s_!EnlV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bd4a5d-d00f-4b69-9d9a-a0240599f0b4_1456x971.png 848w, https://substackcdn.com/image/fetch/$s_!EnlV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bd4a5d-d00f-4b69-9d9a-a0240599f0b4_1456x971.png 1272w, https://substackcdn.com/image/fetch/$s_!EnlV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18bd4a5d-d00f-4b69-9d9a-a0240599f0b4_1456x971.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>3.2 Roles, Responsibilities &amp; Dependencies</strong></em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GNT6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c712195-9134-4f41-9b14-7ff3e82ac524_1687x541.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GNT6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c712195-9134-4f41-9b14-7ff3e82ac524_1687x541.png 424w, https://substackcdn.com/image/fetch/$s_!GNT6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c712195-9134-4f41-9b14-7ff3e82ac524_1687x541.png 848w, https://substackcdn.com/image/fetch/$s_!GNT6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c712195-9134-4f41-9b14-7ff3e82ac524_1687x541.png 1272w, https://substackcdn.com/image/fetch/$s_!GNT6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c712195-9134-4f41-9b14-7ff3e82ac524_1687x541.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GNT6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c712195-9134-4f41-9b14-7ff3e82ac524_1687x541.png" width="728" height="233.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2c712195-9134-4f41-9b14-7ff3e82ac524_1687x541.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:467,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:136354,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/171241261?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c712195-9134-4f41-9b14-7ff3e82ac524_1687x541.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GNT6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c712195-9134-4f41-9b14-7ff3e82ac524_1687x541.png 424w, https://substackcdn.com/image/fetch/$s_!GNT6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c712195-9134-4f41-9b14-7ff3e82ac524_1687x541.png 848w, https://substackcdn.com/image/fetch/$s_!GNT6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c712195-9134-4f41-9b14-7ff3e82ac524_1687x541.png 1272w, https://substackcdn.com/image/fetch/$s_!GNT6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c712195-9134-4f41-9b14-7ff3e82ac524_1687x541.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em><strong>Key dependencies</strong></em></p><ul><li><p><em>COE relies on Platform/SRE for telemetry and change safety.</em></p></li><li><p><em>Domains rely on COE for patterns, templates, and escalations.</em></p></li><li><p><em>Governance relies on COE for audit evidence (BIAC), access certs, and control attestation.</em></p></li><li><p><em>FinOps relies on COE + Platform for accurate tagging and consumption metrics.</em></p></li></ul><p><em><strong>3.3 COE RACI for Critical Activities</strong></em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X9X2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F912a5554-6964-4987-9554-c8b0dbe235a4_1191x457.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X9X2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F912a5554-6964-4987-9554-c8b0dbe235a4_1191x457.png 424w, https://substackcdn.com/image/fetch/$s_!X9X2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F912a5554-6964-4987-9554-c8b0dbe235a4_1191x457.png 848w, https://substackcdn.com/image/fetch/$s_!X9X2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F912a5554-6964-4987-9554-c8b0dbe235a4_1191x457.png 1272w, https://substackcdn.com/image/fetch/$s_!X9X2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F912a5554-6964-4987-9554-c8b0dbe235a4_1191x457.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X9X2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F912a5554-6964-4987-9554-c8b0dbe235a4_1191x457.png" width="1191" height="457" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/912a5554-6964-4987-9554-c8b0dbe235a4_1191x457.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:457,&quot;width&quot;:1191,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58375,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://tshasankda.substack.com/i/171241261?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F912a5554-6964-4987-9554-c8b0dbe235a4_1191x457.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!X9X2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F912a5554-6964-4987-9554-c8b0dbe235a4_1191x457.png 424w, https://substackcdn.com/image/fetch/$s_!X9X2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F912a5554-6964-4987-9554-c8b0dbe235a4_1191x457.png 848w, https://substackcdn.com/image/fetch/$s_!X9X2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F912a5554-6964-4987-9554-c8b0dbe235a4_1191x457.png 1272w, https://substackcdn.com/image/fetch/$s_!X9X2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F912a5554-6964-4987-9554-c8b0dbe235a4_1191x457.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>A = Accountable, R = Responsible, C = Consulted, I = Informed</em></p><p><em><strong>3.4 Intake &amp; Onboarding (How Work Arrives)</strong></em></p><p><em><strong>Intake form (must-have fields)</strong></em></p><ul><li><p><em>Use case &amp; business owner</em></p></li><li><p><em>Data domains &amp; sensitivity classification</em></p></li><li><p><em>Catalogs/connectors needed</em></p></li><li><p><em>Expected concurrency &amp; SLA tier (e.g., Gold/Silver/Bronze)</em></p></li><li><p><em>Cost center tag + env (dev/test/prod)</em></p></li><li><p><em>Required timelines; dependencies</em></p></li></ul><p><em><strong>Onboarding checklist</strong></em></p><ol><li><p><em>Governance classification &amp; access patterns approved</em></p></li><li><p><em>Workload group &amp; queue assignment set (SLO tier)</em></p></li><li><p><em>Tags/labels for cost &amp; ownership applied</em></p></li><li><p><em>Observability: dashboards/alerts created</em></p></li><li><p><em>Enablement: users enrolled in Foundations course</em></p></li><li><p><em>Dry run + performance baseline captured</em></p></li><li><p><em>Go-live with rollback and comms plan</em></p></li></ol><p><em><strong>3.5 Policy &amp; Governance (What &#8220;Good&#8221; Looks Like)</strong></em></p><ul><li><p><em>Access &amp; Identity: least-privilege by role; time-bound elevations; catalog-scoped service accounts.</em></p></li><li><p><em>Resource Governance: standard memory caps; clear queue tiers (Gold = mission-critical; Silver = BI; Bronze = ad-hoc).</em></p></li><li><p><em>Change Control: CAB for connector upgrades and platform versions; maintenance windows; canary strategy.</em></p></li><li><p><em>Data Classification: P0/P1/P2 categories; mask/deny patterns by class; lineage expectations.</em></p></li><li><p><em>Audits (BIAC): monthly review of access anomalies; quarterly attestation to compliance.</em></p></li></ul><p><em><strong>3.6 Enablement &amp; Learning (Make Good Behavior the Default)</strong></em></p><p><em><strong>Curriculum paths</strong></em></p><ul><li><p><em>Foundations (2 hrs): Starburst basics, query history, limits &amp; queues, self-service troubleshooting.</em></p></li><li><p><em>Performance 101 (3 hrs): EXPLAIN/ANALYZE, join strategies, partition pruning, pushdown patterns.</em></p></li><li><p><em>Data Modeling for Starburst (2 hrs): star schemas, denorm tradeoffs, layout hygiene.</em></p></li><li><p><em>Operator Track (2 hrs): guardrails, SLOs, incident runbooks, dashboards.</em></p></li><li><p><em>Champion Track (ongoing): monthly clinics, deep dives, early-feature pilots.</em></p></li></ul><p><em><strong>Programs</strong></em></p><ul><li><p><em>Office Hours (weekly), Performance Clinics (bi-weekly), Champions Network (per domain), Quarterly Bootcamp (new hires).</em></p></li></ul><p><em><strong>Artifacts</strong></em></p><ul><li><p><em>Runbooks: &#8220;When your query is killed,&#8221; &#8220;Spill-heavy workload mitigation,&#8221; &#8220;Broadcast vs distributed joins.&#8221;</em></p></li><li><p><em>Templates: intake form, design review checklist, exception request, performance baseline report.</em></p></li></ul><p><em><strong>3.7 Rhythm of Business (Cadence)</strong></em></p><ul><li><p><em>Weekly Ops Review: SRE + Platform + COE &#8594; incidents, SLOs, hot spots.</em></p></li><li><p><em>Bi-weekly Performance Clinic: COE + Domains &#8594; top slow queries; remediations.</em></p></li><li><p><em>Monthly Architecture Board: COE + Security/Gov + Platform &#8594; connector/version changes.</em></p></li><li><p><em>Quarterly Business Review: COE &#8594; adoption, cost, wins, roadmap.</em></p></li><li><p><em>Quarterly Control Review: Governance &#8594; BIAC findings, access attestations.</em></p></li></ul><p><em><strong>3.8 Metrics &amp; OKRs (Measure What Matters)</strong></em></p><p><em><strong>KPIs</strong></em></p><ul><li><p><em>Reliability: % queries meeting SLA tier; coordinator uptime</em></p></li><li><p><em>Performance: P95 runtime by workload tier; spill rate trend</em></p></li><li><p><em>Cost: queries/$; idle capacity %; right-sizing actions closed</em></p></li><li><p><em>Adoption: active users/teams; # certified users; champions coverage</em></p></li><li><p><em>Governance: BIAC anomalies resolved; access reviews on time</em></p></li></ul><p><em><strong>Sample Quarterly OKRs</strong></em></p><ul><li><p><em>O1: Improve performance without overspend.</em></p><ul><li><p><em>KR1: Reduce P95 runtime for Silver tier by 30%</em></p></li><li><p><em>KR2: Cut spill events per 1k queries by 40%</em></p></li><li><p><em>KR3: Increase queries/$ by 20%</em></p></li></ul></li><li><p><em>O2: Raise org maturity.</em></p><ul><li><p><em>KR1: Certify 100 analysts in Foundations</em></p></li><li><p><em>KR2: Establish champions in 8 domains</em></p></li><li><p><em>KR3: Close 100% of BIAC anomalies monthly</em></p></li></ul></li></ul><p><em><strong>3.9 Exceptions &amp; Escalations</strong></em></p><ul><li><p><em>Exception requests (e.g., higher memory limits, Gold tier access) follow a ticketed, time-boxed approval (COE + Security + FinOps as needed).</em></p></li><li><p><em>Escalation ladder: Domain &#8594; COE &#8594; Platform/SRE on-call &#8594; COE Head.</em></p></li><li><p><em>Sunset: exceptions expire unless renewed with evidence (utilization + value delivered).</em></p></li></ul><p><em><strong>3.10 30-60-90 Day COE Launch Plan</strong></em></p><p><em><strong>Days 0&#8211;30 (Crawl)</strong></em></p><ul><li><p><em>Form COE core; publish charter &amp; intake form</em></p></li><li><p><em>Baseline guardrails + workload tiers</em></p></li><li><p><em>Stand up dashboards &amp; core alerts</em></p></li><li><p><em>Run &#8220;Foundations&#8221; training; open office hours</em></p></li></ul><p><em><strong>Days 31&#8211;60 (Walk)</strong></em></p><ul><li><p><em>First Performance Clinic; remediate top 10 queries</em></p></li><li><p><em>Pilot showback with two domains</em></p></li><li><p><em>CAB process live; document connector upgrade path</em></p></li><li><p><em>Champions network established in 3&#8211;5 domains</em></p></li></ul><p><em><strong>Days 61&#8211;90 (Run)</strong></em></p><ul><li><p><em>SLA definitions per tier; publish SLO dashboards</em></p></li><li><p><em>Formal exceptions workflow; monthly BIAC review</em></p></li><li><p><em>Quarterly roadmap &amp; OKRs approved; QBR cadence live</em></p></li></ul><p><em><strong>3.11 How Teams Interlock Day-to-Day (Dependency Map)</strong></em></p><ul><li><p><em>COE &#8596; Platform: COE defines guardrails; Platform implements and operates them.</em></p></li><li><p><em>COE &#8596; SRE: COE sets SLOs and KPIs; SRE measures and drives incident/runbook improvements.</em></p></li><li><p><em>COE &#8596; Domains: COE enables (templates, clinics); Domains own delivery and first-line triage.</em></p></li><li><p><em>COE &#8596; Security/Gov: COE supplies BIAC evidence; Security signs off policies and exceptions.</em></p></li><li><p><em>COE &#8596; FinOps: COE enforces tagging/showback; FinOps tracks spend and flags anomalies.</em></p></li></ul><blockquote><p><em>DC&amp;D note: Clear ownership beats heroics. If two teams &#8220;sort of&#8221; own it&#8212;nobody owns it.</em></p></blockquote><p><em><strong>Part 4&#8202;&#8212;&#8202;Activity Resource Tracking (ART) in Starburst: A Unified Dashboard for Performance, Governance, and Adoption</strong></em></p><p><em>Starburst Enterprise is powerful for federated analytics, but running it at scale demands visibility: <strong>Who is using what, how much, and how efficiently?</strong></em></p><p><em>That&#8217;s where an <strong>Activity Resource Tracking (ART) Dashboard</strong> comes in. Think of ART as your &#8220;single pane of glass&#8221;&#8202;&#8212;&#8202;it shows query activity, resource consumption, and governance signals in one place. Platform teams, SREs, COEs, and business stakeholders all get what they need without context-switching across half a dozen tools.</em></p><p><em><strong>4.0 Why ART Dashboard Matters</strong></em></p><p><em>For Platform/SRE Teams: catch bottlenecks, track node health, and respond before outages.</em></p><p><em>For COE/Governance: monitor adoption, workload fairness, and compliance (who accessed what).</em></p><p><em>For Leadership/Product: quantify value&#8202;&#8212;&#8202;queries per dollar, adoption growth, SLA trends.</em></p><p><em>For Analysts/Engineers: self-service view of query performance history and tuning insights.</em></p><blockquote><p><em>ART isn&#8217;t just an ops tool. It&#8217;s a governance enabler and adoption amplifier.</em></p></blockquote><p><em><strong>4.1 What to Track in ART</strong></em></p><p><em>Here&#8217;s a structured view of the most critical dimensions:</em></p><p><em><strong>4.1.1. User &amp; Group Activity</strong></em></p><ul><li><p><em>Queries per user/group (counts, runtimes, frequency).</em></p></li><li><p><em>Top N consumers (by memory, CPU, or query count).</em></p></li><li><p><em>Active vs inactive users (adoption trendlines).</em></p></li><li><p><em>Peak usage hours (heatmaps).</em></p></li></ul><p><em><strong>4.1.2. Query Resource Consumption</strong></em></p><ul><li><p><em>Memory per query/per group.</em></p></li><li><p><em>CPU usage&#8202;&#8212;&#8202;query-level and aggregate.</em></p></li><li><p><em>Spill-to-disk volume per query.</em></p></li><li><p><em>Queue wait times.</em></p></li><li><p><em>Long-running vs short-running queries.</em></p></li></ul><p><em><strong>4.1.3. Workload Mix &amp; Patterns</strong></em></p><ul><li><p><em>Interactive queries vs batch pipelines.</em></p></li><li><p><em>BI dashboards vs data engineering jobs.</em></p></li><li><p><em>Success vs failure ratio.</em></p></li><li><p><em>Fault-tolerant execution retries.</em></p></li></ul><p><em><strong>4.1.4. Cluster-Wide Health Metrics</strong></em></p><ul><li><p><em>Worker utilization (CPU/memory).</em></p></li><li><p><em>Coordinator performance.</em></p></li><li><p><em>Autoscaling events (node add/remove).</em></p></li><li><p><em>Queue saturation trends.</em></p></li></ul><p><em><strong>4.1.5. Governance &amp; Compliance</strong></em></p><ul><li><p><em>Queries against sensitive catalogs/tables.</em></p></li><li><p><em>Access anomalies (unexpected patterns).</em></p></li><li><p><em>BIAC Audit Logs overlay &#8594; &#8220;who changed what permissions and when.&#8221;</em></p></li></ul><p><em><strong>4.2 ART in Starburst: Built-In Capabilities</strong></em></p><p><em>Starburst already provides much of the foundation:</em></p><p><em><strong>4.2.1. Starburst Insights &amp; Metrics Collection</strong></em></p><p><em><strong>Usage Metrics:</strong> query volume, CPU/memory usage, workload mix.</em></p><p><em><strong>Cluster History:</strong> query executions, errors, stats, execution plans, cluster events.</em></p><p><em><strong>4.2.2. Administrative Views</strong></em></p><p><em>Detailed system and Insights views to track:</em></p><ul><li><p><em>Query activity per user/group.</em></p></li><li><p><em>Background tasks (metadata refreshes, scheduled pipelines).</em></p></li><li><p><em>Queue/workload distribution.</em></p></li></ul><p><em><strong>4.2.3. Data Collection</strong></em></p><p><em>Telemetry includes:</em></p><ul><li><p><em>Resource usage (CPU, memory, spill-to-disk).</em></p></li><li><p><em>Session and query runtimes.</em></p></li><li><p><em>Background workloads (reports, ETL jobs, data products).</em></p></li></ul><p><em><strong>4.2.4. Audit &amp; Activity Logs</strong></em></p><p><em>The <strong>BIAC Audit Log</strong> captures:</em></p><ul><li><p><em>User logins and authentication.</em></p></li><li><p><em>Role or permission changes.</em></p></li><li><p><em>Catalog/schema/table access.</em></p></li><li><p><em>DDL operations.</em></p></li></ul><blockquote><p><em>These logs can integrate into enterprise SIEM or compliance tooling. In short: ART in Starburst combines Insights, system metrics, and BIAC into a holistic view of consumption, performance, and governance.</em></p></blockquote><p><em><strong>4.3 How to Build the ART Dashboard</strong></em></p><p><em><strong>4.3.1 Data Sources</strong></em></p><ul><li><p><em>Starburst Insights &#8594; Usage metrics + Cluster history.</em></p></li><li><p><em>BIAC Audit Logs &#8594; Compliance-grade event capture.</em></p></li><li><p><em>Prometheus/Grafana &#8594; Low-level node stats.</em></p></li><li><p><em>Custom Starburst SQL Reports &#8594; System tables &amp; Insights queries.</em></p></li></ul><p><em><strong>4.3.2 Visualization Options</strong></em></p><ul><li><p><em>Grafana dashboards &#8594; best for platform/SRE monitoring.</em></p></li><li><p><em>Power BI/Tableau dashboards &#8594; best for COE and business visibility.</em></p></li><li><p><em>Custom Starburst SQL reports &#8594; lightweight, embedded, shareable.</em></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WdMd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0f5c6f-1652-4890-96cb-ea39cfc5345b_1200x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WdMd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0f5c6f-1652-4890-96cb-ea39cfc5345b_1200x800.png 424w, https://substackcdn.com/image/fetch/$s_!WdMd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0f5c6f-1652-4890-96cb-ea39cfc5345b_1200x800.png 848w, https://substackcdn.com/image/fetch/$s_!WdMd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0f5c6f-1652-4890-96cb-ea39cfc5345b_1200x800.png 1272w, https://substackcdn.com/image/fetch/$s_!WdMd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0f5c6f-1652-4890-96cb-ea39cfc5345b_1200x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WdMd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0f5c6f-1652-4890-96cb-ea39cfc5345b_1200x800.png" width="1200" height="800" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9a0f5c6f-1652-4890-96cb-ea39cfc5345b_1200x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:800,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WdMd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0f5c6f-1652-4890-96cb-ea39cfc5345b_1200x800.png 424w, https://substackcdn.com/image/fetch/$s_!WdMd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0f5c6f-1652-4890-96cb-ea39cfc5345b_1200x800.png 848w, https://substackcdn.com/image/fetch/$s_!WdMd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0f5c6f-1652-4890-96cb-ea39cfc5345b_1200x800.png 1272w, https://substackcdn.com/image/fetch/$s_!WdMd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a0f5c6f-1652-4890-96cb-ea39cfc5345b_1200x800.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>4.4 Stakeholder Views (Dashboard Lenses)</strong></em></p><p><em>Different roles = different lenses:</em></p><ul><li><p><em>Platform/SRE Teams &#8594; Node health, autoscaling, hot queries, failure hotspots.</em></p></li><li><p><em>COE/Governance &#8594; Adoption growth, compliance audits, workload fairness.</em></p></li><li><p><em>Leadership/Product &#8594; Cost-per-query, SLA adherence, business value by use case.</em></p></li><li><p><em>Data Engineers/Analysts &#8594; Self-service query history, runtime insights, tuning guides.</em></p></li></ul><p><em><strong>4.5 Example ART Metrics to Visualize</strong></em></p><ul><li><p><em>Top 10 users by memory consumption (last 7 days)</em></p></li><li><p><em>% of queries spilling to disk per workload queue</em></p></li><li><p><em>Cluster idle vs active hours (utilization efficiency)</em></p></li><li><p><em>Gold-tier queries breaching SLA this month</em></p></li><li><p><em>Quarterly adoption growth by team/domain</em></p></li></ul><p><em><strong>4.6 How the COE Uses ART</strong></em></p><p><em>The Center of Excellence (COE) turns the ART dashboard into action:</em></p><ul><li><p><em>Office Hours: Walkthroughs with analysts on failed queries or long runtimes.</em></p></li><li><p><em>Performance Clinics: Identify top 10 spill-heavy queries, remediate together.</em></p></li><li><p><em>Quarterly Adoption Reviews: Which teams are growing? Who&#8217;s plateauing?</em></p></li><li><p><em>Governance Evidence: Pull BIAC audit logs + usage metrics for compliance audits.</em></p></li><li><p><em>Training Loops: If a pattern keeps reoccurring (e.g., bad join order), convert it into a training module.</em></p></li></ul><p><em><strong>Part 5&#8212; Best-Practice Summary</strong></em></p><ul><li><p><em>Align resource controls with both technical and business needs.</em></p></li><li><p><em>Review queueing rules &amp; fault-tolerant execution regularly.</em></p></li><li><p><em>Use spill strategically&#8202;&#8212;&#8202;not as a crutch for poor queries.</em></p></li><li><p><em>Automate alerts on leading indicators (queue waits, spill growth).</em></p></li><li><p><em>Keep tight communication loops among Platform, SRE, COE, Enablement, Governance.</em></p></li><li><p><em>Train stakeholders to self-diagnose common issues and read Query History/Insights.</em></p></li></ul><p><em><strong>Conclusion</strong></em></p><p><em>The ART Dashboard transforms Starburst observability into a strategic asset.</em></p><ul><li><p><em>For operators &#8594; it&#8217;s a health console.</em></p></li><li><p><em>For the COE &#8594; it&#8217;s an enablement and governance tool.</em></p></li><li><p><em>For leadership &#8594; it&#8217;s proof of adoption and ROI.</em></p></li><li><p><em>For users &#8594; it&#8217;s feedback for faster, cleaner queries.</em></p></li></ul><p><em>In DC&amp;D spirit: Don&#8217;t just monitor Starburst. Make it accountable, governed, and optimized&#8202;&#8212;&#8202;with ART as your single source of truth.</em></p><blockquote><p><em>Run Starburst Enterprise like a product: clear guardrails, strong observability, continuous optimization, and a COE that operationalizes good behavior at scale. With the operating model above, your platform becomes reliable for operators, fast for users, and governed for leadership.</em></p></blockquote><p></p><p>This post was originally published on <a href="https://medium.com/data-chai-dialogue-the-leadership-lounge/running-starburst-enterprise-the-complete-coe-platform-playbook-d35c3f189ac1">Medium blog</a></p>]]></content:encoded></item></channel></rss>