<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Lezwon Castelino]]></title><description><![CDATA[C()DE N C()FFEE]]></description><link>https://lezwon.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!w7mN!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F1a80030e-cecf-42db-9806-da1073f98d39_400x400.jpeg</url><title>Lezwon Castelino</title><link>https://lezwon.substack.com</link></image><generator>Substack</generator><lastBuildDate>Thu, 28 May 2026 20:43:47 GMT</lastBuildDate><atom:link href="https://lezwon.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Lezwon Castelino]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[lezwon@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[lezwon@substack.com]]></itunes:email><itunes:name><![CDATA[Lezwon Castelino]]></itunes:name></itunes:owner><itunes:author><![CDATA[Lezwon Castelino]]></itunes:author><googleplay:owner><![CDATA[lezwon@substack.com]]></googleplay:owner><googleplay:email><![CDATA[lezwon@substack.com]]></googleplay:email><googleplay:author><![CDATA[Lezwon Castelino]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Deconstructing Data Sketches]]></title><description><![CDATA[How 1,000 numbers can approximate millions of unique visitors in milliseconds]]></description><link>https://lezwon.substack.com/p/deconstructing-data-sketches</link><guid isPermaLink="false">https://lezwon.substack.com/p/deconstructing-data-sketches</guid><dc:creator><![CDATA[Lezwon Castelino]]></dc:creator><pubDate>Mon, 25 May 2026 03:16:13 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/643067c3-3116-45cc-8088-34b61d25d869_1537x726.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Recently I came across <a href="https://luminousmen.substack.com/p/the-power-of-data-sketches-a-comprehensive">luminousmen article on Data Sketches</a>. It explains how data sketches can replace expensive database count operations with quick  approximations. The concept was very intriguing but confusing at the same time. So I spent some time diving into why it works. This article explains the same.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FccJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F622f2537-9aee-4f8f-a525-39f51d73e877_1537x726.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FccJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F622f2537-9aee-4f8f-a525-39f51d73e877_1537x726.png 424w, https://substackcdn.com/image/fetch/$s_!FccJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F622f2537-9aee-4f8f-a525-39f51d73e877_1537x726.png 848w, https://substackcdn.com/image/fetch/$s_!FccJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F622f2537-9aee-4f8f-a525-39f51d73e877_1537x726.png 1272w, https://substackcdn.com/image/fetch/$s_!FccJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F622f2537-9aee-4f8f-a525-39f51d73e877_1537x726.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FccJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F622f2537-9aee-4f8f-a525-39f51d73e877_1537x726.png" width="1456" height="688" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/622f2537-9aee-4f8f-a525-39f51d73e877_1537x726.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:688,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:62114,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lezwon.substack.com/i/198242138?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F622f2537-9aee-4f8f-a525-39f51d73e877_1537x726.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FccJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F622f2537-9aee-4f8f-a525-39f51d73e877_1537x726.png 424w, https://substackcdn.com/image/fetch/$s_!FccJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F622f2537-9aee-4f8f-a525-39f51d73e877_1537x726.png 848w, https://substackcdn.com/image/fetch/$s_!FccJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F622f2537-9aee-4f8f-a525-39f51d73e877_1537x726.png 1272w, https://substackcdn.com/image/fetch/$s_!FccJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F622f2537-9aee-4f8f-a525-39f51d73e877_1537x726.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>So what exactly is a Data Sketch and how does it help?</h3><p> Just to recap the original article, imagine you are asked to find the unique number of users who visited a site today. You would ideally do a count distinct user query on the sessions table and it would workout fine, unless the number of events in the order of millions or billions. If thats the case your query would have to visit every record in the table to ensure it has the exact count of all unique users. This could either break the database or take forever depending on the table size and the compute being used. But we could use a data sketch and get an approximate count in under a second. How is that? For this we can use an analogy. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lezwon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Imagine a large corporate like Walmart, who has millions of employees working each day.  Now there&#8217;s an inspector who&#8217;s job is to find out how many employees checked in yesterday. Note that an employee could checkin in multiple times for different shifts, so his duplicate entry will not be counted. Now he can manually note down whenever an employee checks in for the day on his notebook. Whenever someone enters the office he could check his list if the person had already checked in for the day or not, and make an entry accordingly. This can work but it&#8217;ll be slow and tiresome. So he uses another method. One that gets an approximate count as he doesn&#8217;t need the exact number. He assigns each employee a unique tag when they check in. The tag could be any number between 1 to 4 billion (We need a large number space to avoid tag collisions). The rules are:</p><ul><li><p>The number tag will be unique and exclusive to that employee</p></li><li><p>The number tag generator randomly picks a number from a UNIFORM distribution. </p></li></ul><p>So when the employees start checking in, they are assigned different tags: 600, 11000, 565732, 573, 64968, 395345.. and so on from the space of 4,000,000,000.</p><p>What the inspector does, is that he keeps track of only the LOWEST K tags. Lets say its 1000. That is enough for him to get the approximate count of the employees who checked in for the day. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Yf8N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F131d0f49-b612-46bb-8743-f2ba89c76298_2924x902.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Yf8N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F131d0f49-b612-46bb-8743-f2ba89c76298_2924x902.png 424w, https://substackcdn.com/image/fetch/$s_!Yf8N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F131d0f49-b612-46bb-8743-f2ba89c76298_2924x902.png 848w, https://substackcdn.com/image/fetch/$s_!Yf8N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F131d0f49-b612-46bb-8743-f2ba89c76298_2924x902.png 1272w, https://substackcdn.com/image/fetch/$s_!Yf8N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F131d0f49-b612-46bb-8743-f2ba89c76298_2924x902.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Yf8N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F131d0f49-b612-46bb-8743-f2ba89c76298_2924x902.png" width="1456" height="449" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/131d0f49-b612-46bb-8743-f2ba89c76298_2924x902.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:449,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:146955,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lezwon.substack.com/i/198242138?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F131d0f49-b612-46bb-8743-f2ba89c76298_2924x902.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Yf8N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F131d0f49-b612-46bb-8743-f2ba89c76298_2924x902.png 424w, https://substackcdn.com/image/fetch/$s_!Yf8N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F131d0f49-b612-46bb-8743-f2ba89c76298_2924x902.png 848w, https://substackcdn.com/image/fetch/$s_!Yf8N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F131d0f49-b612-46bb-8743-f2ba89c76298_2924x902.png 1272w, https://substackcdn.com/image/fetch/$s_!Yf8N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F131d0f49-b612-46bb-8743-f2ba89c76298_2924x902.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3>How does that happen? </h3><p>Here&#8217;s the logic: Given that these tags are assigned in a uniform distribution, we can sort of assume that the gaps among all the tags are even. So what does that mean? It means that the gap between each number tag  on a scale on 0 to 4 billion is  around Total Space / Total tags. This also means that the gap between the first 1000 tags is the same as all the tags. i.e K&#8217;th tag / K. i.e V(K) / K.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!u81N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f4ccd82-e1ef-4f4a-b86c-72af61166ca3_829x595.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!u81N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f4ccd82-e1ef-4f4a-b86c-72af61166ca3_829x595.png 424w, https://substackcdn.com/image/fetch/$s_!u81N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f4ccd82-e1ef-4f4a-b86c-72af61166ca3_829x595.png 848w, https://substackcdn.com/image/fetch/$s_!u81N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f4ccd82-e1ef-4f4a-b86c-72af61166ca3_829x595.png 1272w, https://substackcdn.com/image/fetch/$s_!u81N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f4ccd82-e1ef-4f4a-b86c-72af61166ca3_829x595.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!u81N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f4ccd82-e1ef-4f4a-b86c-72af61166ca3_829x595.png" width="829" height="595" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f4ccd82-e1ef-4f4a-b86c-72af61166ca3_829x595.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:595,&quot;width&quot;:829,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:49937,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lezwon.substack.com/i/198242138?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f4ccd82-e1ef-4f4a-b86c-72af61166ca3_829x595.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!u81N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f4ccd82-e1ef-4f4a-b86c-72af61166ca3_829x595.png 424w, https://substackcdn.com/image/fetch/$s_!u81N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f4ccd82-e1ef-4f4a-b86c-72af61166ca3_829x595.png 848w, https://substackcdn.com/image/fetch/$s_!u81N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f4ccd82-e1ef-4f4a-b86c-72af61166ca3_829x595.png 1272w, https://substackcdn.com/image/fetch/$s_!u81N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f4ccd82-e1ef-4f4a-b86c-72af61166ca3_829x595.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Does that make sense?  If the 1000th lowest number tag is 2,779,252 then we do 2,779,252 / 1000 = 2779. That is the average assumed gap between tags. Now to get the total count, we just need to divide this gap with the total space. i.e 4,000,000,000 / 2779 = 1,439,366. This is the approximate count of all the unique tags, which is the unique number of employees who checked in today.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Trot!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98df5f8e-06b5-405c-8a8f-538917fac814_1470x686.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Trot!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98df5f8e-06b5-405c-8a8f-538917fac814_1470x686.png 424w, https://substackcdn.com/image/fetch/$s_!Trot!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98df5f8e-06b5-405c-8a8f-538917fac814_1470x686.png 848w, https://substackcdn.com/image/fetch/$s_!Trot!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98df5f8e-06b5-405c-8a8f-538917fac814_1470x686.png 1272w, https://substackcdn.com/image/fetch/$s_!Trot!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98df5f8e-06b5-405c-8a8f-538917fac814_1470x686.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Trot!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98df5f8e-06b5-405c-8a8f-538917fac814_1470x686.png" width="1456" height="679" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98df5f8e-06b5-405c-8a8f-538917fac814_1470x686.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:679,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42377,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lezwon.substack.com/i/198242138?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98df5f8e-06b5-405c-8a8f-538917fac814_1470x686.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Trot!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98df5f8e-06b5-405c-8a8f-538917fac814_1470x686.png 424w, https://substackcdn.com/image/fetch/$s_!Trot!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98df5f8e-06b5-405c-8a8f-538917fac814_1470x686.png 848w, https://substackcdn.com/image/fetch/$s_!Trot!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98df5f8e-06b5-405c-8a8f-538917fac814_1470x686.png 1272w, https://substackcdn.com/image/fetch/$s_!Trot!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98df5f8e-06b5-405c-8a8f-538917fac814_1470x686.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>It&#8217;s sort of like this diagram above. If we arrange the dots in an organized manner, we can figure out an approx population count immediately by multiplying the number of rows and columns. As long as we know the number of rows and columns, we can arrive at the total count quickly,  no matter how big the table gets. While when left unorganized (left) we would have to count each dot individually.</p><p>Hopefully you have a good conceptual idea of what a data sketch is. We can verify the data sketch with an actual code snippet.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import heapq as hq
import mmh3
import numpy as np

# &#9472;&#9472; Constants &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
RNG_SEED        = 25
MIN_EMPLOYEE_ID = 1
MAX_EMPLOYEE_ID = 4_000_000   # Walmart employee IDs in [1, 4M]
NUM_CHECKINS    = 2_000_000   # total check-ins today (duplicates expected)
K               = 1_000       # keep K lowest tags; error &#8776; 1/&#8730;K &#8776; 3.2%
TAG_SPACE       = 1 &lt;&lt; 32     # tags in [1, ~4 billion] &#8212; must be &gt;&gt; unique employees
                              # to avoid collisions (1M tag space &lt; 1.57M uniques = broken)

# &#9472;&#9472; Step 1: Simulate check-ins &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
# 2M check-ins from employees with IDs in [1, 4M] &#8212; duplicates expected
rng      = np.random.default_rng(RNG_SEED)
checkins = rng.integers(MIN_EMPLOYEE_ID, MAX_EMPLOYEE_ID + 1, size=NUM_CHECKINS)

# &#9472;&#9472; Step 2: Build KMV sketch &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
# Each employee gets a deterministic integer tag in [1, TAG_SPACE] via hash.
# Same employee &#8594; same tag always, so duplicate check-ins are naturally skipped.
# Inspector keeps only the K lowest tags &#8212; O(K) memory regardless of stream size.
heap     = []    # max-heap (negated) &#8212; heap[0] = -(current K-th lowest tag)
heap_set = set() # mirrors heap for O(1) lookup &#8212; size stays &#8804; K

for emp_id in checkins:
    tag = (mmh3.hash64(str(int(emp_id)), signed=False)[0] % TAG_SPACE) + 1

    if tag in heap_set:
        continue  # same tag already tracked (same employee checking in again)

    if len(heap) &lt; K:
        hq.heappush(heap, -tag)
        heap_set.add(tag)
    elif tag &lt; -heap[0]:          # smaller than current K-th tag &#8212; swap in
        evicted = -hq.heapreplace(heap, -tag)
        heap_set.discard(evicted)
        heap_set.add(tag)

# &#9472;&#9472; Step 3: Estimate unique headcount &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
# The K-th lowest tag sits at position K in the sorted uniform distribution.
# Average gap between tags = kth_tag / K
# Total unique employees  &#8776; TAG_SPACE / gap = TAG_SPACE * K / kth_tag

kth_tag  = -heap[0]
gap      = kth_tag / K
estimate = TAG_SPACE / gap          # = TAG_SPACE * K / kth_tag

# &#9472;&#9472; Step 4: Verify &#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
exact      = np.unique(checkins).size
error_rate = abs(estimate - exact) / exact

print(f"Tag space         : 1 to {TAG_SPACE:,}") # 1 to 4,294,967,296
print(f"K-th tag (K={K}) : {kth_tag:,}") # 2,779,252
print(f"Avg gap           : {gap:,.0f}") # 2,779
print(f"Estimate          : {estimate:,.0f}") # 1,545,368
print(f"Exact             : {exact:,}") # 1,574,043
print(f"Error             : {error_rate:.2%}  (expected ~{1/K**0.5:.2%})") # 1.82%  (expected ~3.16%)
</code></pre></div><h3>Its all about probability</h3><p>This can be confusing at first. How do we evenly distribute these numbers and ensure they are unique? The secret that this algo works is due to the magic inside the hash function (MurmurHash3). That is the crux of this solution. The hash function needs to be well tested and vetted to generate a uniform distribution of numbers on a large scale. The MurmurHash3 hash function in this case does that job. It distributes the hash output numbers equally across the space in complete isolation. And that happens due to the law of large numbers. We will dive deeper into that in another article, but to be brief, this algorithm ensures that every tag (hash) it generates has an equal chance of being placed anywhere on the scale. And with the probability laws in effect, these odds for these tags to get evenly distributed are massive (given the data points are large too).</p><p>Using this method, we can quickly estimate the counts for a metric. It comes in handy when we just need approximations and not absolute numbers. Especially on dashboards or reports. </p><p>These functions are available in most of the data storage platforms. In Spark you can use it to aggregate counts across clusters without shuffling data. In Postgres (postgresql-hll extension), you can use it to aggregate counts across days, weeks, months etc without manually going through the rows. So the next, time you require quick aggregations, consider using Data Sketches instead of <code>count()</code>. It&#8217;ll definitely save you a lot of compute cycles.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lezwon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[A Deep Dive into Spark UDFs]]></title><description><![CDATA[The UDF Trap: Unmasking PySpark's Biggest Performance Bottleneck]]></description><link>https://lezwon.substack.com/p/a-deep-dive-into-spark-udfs</link><guid isPermaLink="false">https://lezwon.substack.com/p/a-deep-dive-into-spark-udfs</guid><dc:creator><![CDATA[Lezwon Castelino]]></dc:creator><pubDate>Mon, 11 May 2026 04:40:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!MTX_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61daaf88-9d18-429a-9562-71a6f527842b_1420x500.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What is a Spark UDF?</h2><p>Anyone who has worked with PySpark has probably come across User Defined Functions. UDF&#8217;s are functions in PySpark which can be used to do very specific custom tasks that might be difficult or not possible in native PySpark. Here&#8217;s a simple example:</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lezwon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>This is a simple dataframe with username and date of birth:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MTX_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61daaf88-9d18-429a-9562-71a6f527842b_1420x500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MTX_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61daaf88-9d18-429a-9562-71a6f527842b_1420x500.png 424w, https://substackcdn.com/image/fetch/$s_!MTX_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61daaf88-9d18-429a-9562-71a6f527842b_1420x500.png 848w, https://substackcdn.com/image/fetch/$s_!MTX_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61daaf88-9d18-429a-9562-71a6f527842b_1420x500.png 1272w, https://substackcdn.com/image/fetch/$s_!MTX_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61daaf88-9d18-429a-9562-71a6f527842b_1420x500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MTX_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61daaf88-9d18-429a-9562-71a6f527842b_1420x500.png" width="1420" height="500" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61daaf88-9d18-429a-9562-71a6f527842b_1420x500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:500,&quot;width&quot;:1420,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:85506,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lezwon.substack.com/i/190740319?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61daaf88-9d18-429a-9562-71a6f527842b_1420x500.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MTX_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61daaf88-9d18-429a-9562-71a6f527842b_1420x500.png 424w, https://substackcdn.com/image/fetch/$s_!MTX_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61daaf88-9d18-429a-9562-71a6f527842b_1420x500.png 848w, https://substackcdn.com/image/fetch/$s_!MTX_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61daaf88-9d18-429a-9562-71a6f527842b_1420x500.png 1272w, https://substackcdn.com/image/fetch/$s_!MTX_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61daaf88-9d18-429a-9562-71a6f527842b_1420x500.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We will create a Python UDF here to infer the users current age via using this date of birth.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from pyspark.sql.functions import udf
import datetime as dt
from dateutil import relativedelta as rd
from pyspark.sql.types import IntegerType

@udf(IntegerType())
def infer_age_udf(date_of_birth):
    try:
        return rd.relativedelta(dt.datetime.now(), date_of_birth).years
    except Exception:
        return None


users_df_udf = users_df.withColumn("age", infer_age_udf(col("date_of_birth")))</code></pre></div><p>This is the end result:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i5s1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafc160df-1217-47b0-9829-940042f245cb_1082x322.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i5s1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafc160df-1217-47b0-9829-940042f245cb_1082x322.png 424w, https://substackcdn.com/image/fetch/$s_!i5s1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafc160df-1217-47b0-9829-940042f245cb_1082x322.png 848w, https://substackcdn.com/image/fetch/$s_!i5s1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafc160df-1217-47b0-9829-940042f245cb_1082x322.png 1272w, https://substackcdn.com/image/fetch/$s_!i5s1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafc160df-1217-47b0-9829-940042f245cb_1082x322.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i5s1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafc160df-1217-47b0-9829-940042f245cb_1082x322.png" width="1082" height="322" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/afc160df-1217-47b0-9829-940042f245cb_1082x322.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:322,&quot;width&quot;:1082,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58837,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lezwon.substack.com/i/190740319?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafc160df-1217-47b0-9829-940042f245cb_1082x322.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!i5s1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafc160df-1217-47b0-9829-940042f245cb_1082x322.png 424w, https://substackcdn.com/image/fetch/$s_!i5s1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafc160df-1217-47b0-9829-940042f245cb_1082x322.png 848w, https://substackcdn.com/image/fetch/$s_!i5s1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafc160df-1217-47b0-9829-940042f245cb_1082x322.png 1272w, https://substackcdn.com/image/fetch/$s_!i5s1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafc160df-1217-47b0-9829-940042f245cb_1082x322.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This a common UDF implementation we often see in PySpark. UDF&#8217; s however comes with a hidden underlying cost. Most new developers assume that UDF&#8217;s are just a quick way to implement Python logic and get something done quickly. PySpark usually gives this impression that it&#8217;s all within the Python ecosystem, so it runs one and the same. But that is an assumption far from the truth.</p><p></p><h2>So what happens when you use a UDF in PySpark?</h2><p>Contrary to our beliefs, the PySpark application does not really run Python. Sure it has a Python syntax, but internally, it is just a Python wrapper communicating with a Scala application running on the JVM. When we call a built-in PySpark function, we are ultimately calling a highly optimized Scala function.</p><p>Scala is a compiled language that runs on the JVM. Because Spark has full control over this native code, its Catalyst Optimizer can inspect and optimize it to a massive extent. This is what makes Spark so fast.</p><p><strong>The exception to this rule is UDFs.</strong> When we create a custom Python UDF, it does not automatically translate into Scala. Spark has no Scala equivalent for your custom Python logic. To run this code, the Spark Executor JVM has to exclusively spin up a separate Python worker process on each node.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8gLN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea8e449-01a6-46f2-89c2-bf7862689789_1062x674.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8gLN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea8e449-01a6-46f2-89c2-bf7862689789_1062x674.png 424w, https://substackcdn.com/image/fetch/$s_!8gLN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea8e449-01a6-46f2-89c2-bf7862689789_1062x674.png 848w, https://substackcdn.com/image/fetch/$s_!8gLN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea8e449-01a6-46f2-89c2-bf7862689789_1062x674.png 1272w, https://substackcdn.com/image/fetch/$s_!8gLN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea8e449-01a6-46f2-89c2-bf7862689789_1062x674.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8gLN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea8e449-01a6-46f2-89c2-bf7862689789_1062x674.png" width="1062" height="674" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ea8e449-01a6-46f2-89c2-bf7862689789_1062x674.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:674,&quot;width&quot;:1062,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8gLN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea8e449-01a6-46f2-89c2-bf7862689789_1062x674.png 424w, https://substackcdn.com/image/fetch/$s_!8gLN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea8e449-01a6-46f2-89c2-bf7862689789_1062x674.png 848w, https://substackcdn.com/image/fetch/$s_!8gLN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea8e449-01a6-46f2-89c2-bf7862689789_1062x674.png 1272w, https://substackcdn.com/image/fetch/$s_!8gLN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ea8e449-01a6-46f2-89c2-bf7862689789_1062x674.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Spark Architecture with Python Interpreter. <a href="https://medium.com/@nguyenlecattuong93/compare-pyspark-built-in-function-pyspark-user-define-function-and-pandas-user-define-function-fab4529943b8">Credits</a></figcaption></figure></div><p>The major overhead here is <strong>data serialization</strong>. To pass data out of the JVM and into the Python worker, Spark has to serialize (pickle) the data. The Python worker executes your code, and then the results must be serialized again and sent back to the JVM.</p><p>Furthermore, because this execution happens inside a black-box Python process outside the JVM, Spark's Catalyst Optimizer cannot inspect or optimize your code. You are completely at the mercy of the Python worker.</p><p></p><h2>So what can we do about it?</h2><h4>Native Spark (The best approach)</h4><p>In most instances, UDF&#8217;s can be completely avoided by using Spark native functions. String manipulations, mathematical calculations, dictionary accessors can all be implemented in PySpark itself. The following code will give us the exact same results without touching a UDF, utilizing native PySpark code but giving us a 13x speedup:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;d5f86a49-ccf7-4135-b385-ef78553e6afa&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from pyspark.sql.functions import col, current_date, months_between, floor

users_df_native = users_df.withColumn(
    "age",
    floor(months_between(current_date(), col("date_of_birth")) / 12)
)</code></pre></div><p></p><h4>Pandas UDF (Vectorized operations)</h4><p>If you must use Python, you can speed up processing by using Pandas UDFs. Pandas UDFs utilize Apache Arrow which does not require data serialization for copying data and leverages vectorized operations which speed up processing.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;7d614a45-7995-473f-9421-23d927e2ceeb&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import IntegerType
import json
import pandas as pd
import datetime as dt

@pandas_udf(IntegerType())
def infer_age_pdf(dob: pd.Series) -&gt; pd.Series:
    today = dt.datetime.now()
    dob = pd.to_datetime(dob)
    
    age = (today.year - dob.dt.year)
    birthday_past = (today.month &lt; dob.dt.month) | ((today.month == dob.dt.month) &amp; (today.day &lt; dob.dt.day))

    # Calculate age and subtract 1 if the birthday hasn't occurred yet
    return age - birthday_past.astype(int)

users_df_pdf = users_df.withColumn("age", infer_age_pdf(col("date_of_birth")))</code></pre></div><p></p><h4>Scala UDF</h4><p>Another option is to write your UDF in Scala. These are extremely performant and fit natively into the Spark ecosystem. The JVM executes them directly, meaning there is zero Python serialization overhead.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;scala&quot;,&quot;nodeId&quot;:&quot;b14a8d49-3845-4ccd-b337-49787c7343c6&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-scala">%scala

import org.apache.spark.sql.api.java.UDF1
import java.time.LocalDate
import java.time.Period
import java.sql.Date

class InferAge extends UDF1[java.sql.Date, Integer] {
  def call(dob: java.sql.Date): Integer = {
    if (dob != null) {
      val birthDate = dob.toLocalDate
      val today = LocalDate.now()
      val age = Period.between(birthDate, today).getYears
      age
    } else null
  }
}
spark.udf.register("infer_age_scala", new InferAge(), org.apache.spark.sql.types.IntegerType)</code></pre></div><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;a9d8ac13-5cd1-4e12-a8a1-32ace8452461&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">from pyspark.sql.functions import expr

users_df_scala = users_df.withColumn("age", expr("infer_age_scala(date_of_birth)"))</code></pre></div><p></p><h2>Benchmarks</h2><p>Here are some benchmarks for this Dataframe of 1,000,000 rows. You can see the massive difference in performance among these methods. The native Scala options are extremely fast, followed by the vectorized pandas UDF implementation. The Python UDF is around 13x slower due to the Python serialization and execution overhead.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-InE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda382a52-d6a2-4450-9d67-3a980a550b7c_1190x590.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-InE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda382a52-d6a2-4450-9d67-3a980a550b7c_1190x590.png 424w, https://substackcdn.com/image/fetch/$s_!-InE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda382a52-d6a2-4450-9d67-3a980a550b7c_1190x590.png 848w, https://substackcdn.com/image/fetch/$s_!-InE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda382a52-d6a2-4450-9d67-3a980a550b7c_1190x590.png 1272w, https://substackcdn.com/image/fetch/$s_!-InE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda382a52-d6a2-4450-9d67-3a980a550b7c_1190x590.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-InE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda382a52-d6a2-4450-9d67-3a980a550b7c_1190x590.png" width="1190" height="590" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da382a52-d6a2-4450-9d67-3a980a550b7c_1190x590.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:590,&quot;width&quot;:1190,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:27176,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lezwon.substack.com/i/190740319?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda382a52-d6a2-4450-9d67-3a980a550b7c_1190x590.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-InE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda382a52-d6a2-4450-9d67-3a980a550b7c_1190x590.png 424w, https://substackcdn.com/image/fetch/$s_!-InE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda382a52-d6a2-4450-9d67-3a980a550b7c_1190x590.png 848w, https://substackcdn.com/image/fetch/$s_!-InE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda382a52-d6a2-4450-9d67-3a980a550b7c_1190x590.png 1272w, https://substackcdn.com/image/fetch/$s_!-InE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda382a52-d6a2-4450-9d67-3a980a550b7c_1190x590.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These Python UDF alternatives will work most of the time. There are a few exceptions like using Python machine learning libraries which don&#8217;t have Scala/Java equivalents. Those are acceptable cases where we would need Python UDF&#8217;s to get the job done.</p><p>When writing code, ensure you keep these things in mind as and when dealing with your data. These tiny optimization make a huge difference during heavy compute operations.</p><div><hr></div><p>If you&#8217;d like to have a tool to ensure such issues are highlighted in your code, you can use <a href="https://catalystops.dev/">CatalystOps</a> within your VS code.</p><p></p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lezwon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Guide to Database Sharding and Partitioning Strategies]]></title><description><![CDATA[Architectural patterns for distributed data storage.]]></description><link>https://lezwon.substack.com/p/guide-to-database-sharding-and-partitioning</link><guid isPermaLink="false">https://lezwon.substack.com/p/guide-to-database-sharding-and-partitioning</guid><dc:creator><![CDATA[Lezwon Castelino]]></dc:creator><pubDate>Sun, 01 Feb 2026 10:47:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!hnbJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f32023a-70fb-4e39-9894-a92a6f128b84_1024x584.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I was recently reading the chapter on  partitioning in the book <strong><a href="https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/">Designing Data-Intensive Applications</a>. </strong>If you work on distributed systems that need to scale, I would definitely recommend reading this book. </p><p>This article is just to journal and summarize my learnings from this chapter on partitioning.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lezwon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hnbJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f32023a-70fb-4e39-9894-a92a6f128b84_1024x584.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hnbJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f32023a-70fb-4e39-9894-a92a6f128b84_1024x584.jpeg 424w, https://substackcdn.com/image/fetch/$s_!hnbJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f32023a-70fb-4e39-9894-a92a6f128b84_1024x584.jpeg 848w, https://substackcdn.com/image/fetch/$s_!hnbJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f32023a-70fb-4e39-9894-a92a6f128b84_1024x584.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!hnbJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f32023a-70fb-4e39-9894-a92a6f128b84_1024x584.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hnbJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f32023a-70fb-4e39-9894-a92a6f128b84_1024x584.jpeg" width="1024" height="584" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f32023a-70fb-4e39-9894-a92a6f128b84_1024x584.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:584,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:44422,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://lezwon.substack.com/i/182974082?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f32023a-70fb-4e39-9894-a92a6f128b84_1024x584.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hnbJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f32023a-70fb-4e39-9894-a92a6f128b84_1024x584.jpeg 424w, https://substackcdn.com/image/fetch/$s_!hnbJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f32023a-70fb-4e39-9894-a92a6f128b84_1024x584.jpeg 848w, https://substackcdn.com/image/fetch/$s_!hnbJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f32023a-70fb-4e39-9894-a92a6f128b84_1024x584.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!hnbJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f32023a-70fb-4e39-9894-a92a6f128b84_1024x584.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>What is Sharding and Partitioning? </h3><p>Sharding is a process of dividing a database into smaller batches and distributing them across multiple machines. Partitioning is how we divide that data into smaller batches to improve query efficiency. Although the terms are used interchangeably, sharding often refers to horizontal partitioning (split by rows instead of columns) across different machines. </p><p>But why do we do that? Well, the size of your database without sharding is limited by the size of your physical hard drive. When your database reaches that physical limit (i.e., 1TB, 2TB, etc.), you will encounter a problem. So instead of storing all this data on a single machine, we can distribute it across 10 machines, each handling it independently.</p><p>Sharding allows a larger number of users to access the database simultaneously. Instead of being limited by the read/write speed of one hard drive, we will be limited by the combined read/write speed of multiple hard drives. </p><p>It also distributes the risk of losing data across the drives. If the data were sitting on one drive, then we could have to rely only on the health of that drive. If it fails, we lose all the data. With a sharded database, if a drive fails, we will lose only the data in that one shard.</p><p>So how do we go about distributing this data? There are various partitioning techniques. Let&#8217;s have a look at those.</p><h3>Partitioning Strategies</h3><h4>Random / Equal Partitioning</h4><p>Given that we want to distribute data across the nodes, we could perform an equal split of the data and push it to different nodes. While this will solve our distribution problem, we will have a tough time finding a record during queries, as we would have no idea which partition it is present in.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ugDu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1450bd52-4428-4657-90be-4c1d28606e54_3179x985.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ugDu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1450bd52-4428-4657-90be-4c1d28606e54_3179x985.png 424w, https://substackcdn.com/image/fetch/$s_!ugDu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1450bd52-4428-4657-90be-4c1d28606e54_3179x985.png 848w, https://substackcdn.com/image/fetch/$s_!ugDu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1450bd52-4428-4657-90be-4c1d28606e54_3179x985.png 1272w, https://substackcdn.com/image/fetch/$s_!ugDu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1450bd52-4428-4657-90be-4c1d28606e54_3179x985.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ugDu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1450bd52-4428-4657-90be-4c1d28606e54_3179x985.png" width="1456" height="451" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1450bd52-4428-4657-90be-4c1d28606e54_3179x985.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:451,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:85098,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lezwon.substack.com/i/182974082?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1450bd52-4428-4657-90be-4c1d28606e54_3179x985.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ugDu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1450bd52-4428-4657-90be-4c1d28606e54_3179x985.png 424w, https://substackcdn.com/image/fetch/$s_!ugDu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1450bd52-4428-4657-90be-4c1d28606e54_3179x985.png 848w, https://substackcdn.com/image/fetch/$s_!ugDu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1450bd52-4428-4657-90be-4c1d28606e54_3179x985.png 1272w, https://substackcdn.com/image/fetch/$s_!ugDu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1450bd52-4428-4657-90be-4c1d28606e54_3179x985.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Round Robin Partitioning</h4><p>In round robin partitioning, we equally balance the shards by storing the next incoming record of data in the next available partition in the cycle. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AsDo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54867a62-01bb-4d8b-958e-6493a337150a_1834x1152.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AsDo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54867a62-01bb-4d8b-958e-6493a337150a_1834x1152.png 424w, https://substackcdn.com/image/fetch/$s_!AsDo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54867a62-01bb-4d8b-958e-6493a337150a_1834x1152.png 848w, https://substackcdn.com/image/fetch/$s_!AsDo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54867a62-01bb-4d8b-958e-6493a337150a_1834x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!AsDo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54867a62-01bb-4d8b-958e-6493a337150a_1834x1152.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AsDo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54867a62-01bb-4d8b-958e-6493a337150a_1834x1152.png" width="1456" height="915" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54867a62-01bb-4d8b-958e-6493a337150a_1834x1152.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:915,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:88401,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lezwon.substack.com/i/182974082?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54867a62-01bb-4d8b-958e-6493a337150a_1834x1152.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AsDo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54867a62-01bb-4d8b-958e-6493a337150a_1834x1152.png 424w, https://substackcdn.com/image/fetch/$s_!AsDo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54867a62-01bb-4d8b-958e-6493a337150a_1834x1152.png 848w, https://substackcdn.com/image/fetch/$s_!AsDo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54867a62-01bb-4d8b-958e-6493a337150a_1834x1152.png 1272w, https://substackcdn.com/image/fetch/$s_!AsDo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54867a62-01bb-4d8b-958e-6493a337150a_1834x1152.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For example, if we have 3 nodes/shards and 4 records, then:</p><ol><li><p>Record 1 goes to partition 1</p></li><li><p>Record 2 goes to partition 2</p></li><li><p>Record 3 goes to partition 3</p></li><li><p>Record 4 goes to partition 1</p></li></ol><p>and so on. This way, all the shards have an equal amount of data. The issue with the approach is that while querying, the database has no idea on which shard the record it seeks lives on. It will have to make a scatter-gather query where all the nodes receive the request and start searching for the record.</p><p></p><h4>List Partitioning</h4><p>A better approach to partitioning is based on a particular key. For example, say we have a list of people  with their names and country of origin. We could partition the list by country of origin. All people from India could be in one partition, all the people from Korea in another, etc. This approach will help us to easily find the node that has a particular person by matching them with the country of origin. The issue with this approach, though, is that it&#8217;s prone to skewness and hotspots. Which means that if 80% of people in the data are from India, they will all end up in one node, and the rest of 20% will be across other nodes. So the majority of the system load might be taken up by just one node, which is not desirable. Also, a  condition for this approach to succeed is that the value of this attribute should have high cardinality. i.e If there are people only from 4 countries and there are 10 nodes, the data will be stored on only 4 nodes. The other 6 nodes will not have any records assigned to them.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n_2o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbb9cee9-fc70-4c3a-8874-6cc31bceebd7_1833x1011.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n_2o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbb9cee9-fc70-4c3a-8874-6cc31bceebd7_1833x1011.png 424w, https://substackcdn.com/image/fetch/$s_!n_2o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbb9cee9-fc70-4c3a-8874-6cc31bceebd7_1833x1011.png 848w, https://substackcdn.com/image/fetch/$s_!n_2o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbb9cee9-fc70-4c3a-8874-6cc31bceebd7_1833x1011.png 1272w, https://substackcdn.com/image/fetch/$s_!n_2o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbb9cee9-fc70-4c3a-8874-6cc31bceebd7_1833x1011.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n_2o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbb9cee9-fc70-4c3a-8874-6cc31bceebd7_1833x1011.png" width="1456" height="803" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbb9cee9-fc70-4c3a-8874-6cc31bceebd7_1833x1011.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:803,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64066,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lezwon.substack.com/i/182974082?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbb9cee9-fc70-4c3a-8874-6cc31bceebd7_1833x1011.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!n_2o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbb9cee9-fc70-4c3a-8874-6cc31bceebd7_1833x1011.png 424w, https://substackcdn.com/image/fetch/$s_!n_2o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbb9cee9-fc70-4c3a-8874-6cc31bceebd7_1833x1011.png 848w, https://substackcdn.com/image/fetch/$s_!n_2o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbb9cee9-fc70-4c3a-8874-6cc31bceebd7_1833x1011.png 1272w, https://substackcdn.com/image/fetch/$s_!n_2o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbb9cee9-fc70-4c3a-8874-6cc31bceebd7_1833x1011.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h4>Hash Partitioning</h4><p>In the above example, we partitioned the database by country. In such cases, there might be a large number of partitions created and only a few nodes. i.e 195 partitions (nodes), 3 nodes. Now, we could store multiple partitions on one node, but how can we find the data when there is a query? The solution to this is a hash-based partitioning scheme. In hash-based partitioning, we insert the key into a hash function, which spits out the node number that the record should go to. For example, if we partition by country, and we have 3 nodes, then India might be assigned to node 1, Korea to node 2, the  UK to 3,  Denmark to 2, and Germany to 1, and so on. When there is a query, we can use the same hash function to find which node the data lives on.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!typh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2441c63-7bfd-4bd6-8d2b-520403fed57f_1223x545.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!typh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2441c63-7bfd-4bd6-8d2b-520403fed57f_1223x545.png 424w, https://substackcdn.com/image/fetch/$s_!typh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2441c63-7bfd-4bd6-8d2b-520403fed57f_1223x545.png 848w, https://substackcdn.com/image/fetch/$s_!typh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2441c63-7bfd-4bd6-8d2b-520403fed57f_1223x545.png 1272w, https://substackcdn.com/image/fetch/$s_!typh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2441c63-7bfd-4bd6-8d2b-520403fed57f_1223x545.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!typh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2441c63-7bfd-4bd6-8d2b-520403fed57f_1223x545.png" width="1223" height="545" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f2441c63-7bfd-4bd6-8d2b-520403fed57f_1223x545.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:545,&quot;width&quot;:1223,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:31498,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lezwon.substack.com/i/182974082?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2441c63-7bfd-4bd6-8d2b-520403fed57f_1223x545.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!typh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2441c63-7bfd-4bd6-8d2b-520403fed57f_1223x545.png 424w, https://substackcdn.com/image/fetch/$s_!typh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2441c63-7bfd-4bd6-8d2b-520403fed57f_1223x545.png 848w, https://substackcdn.com/image/fetch/$s_!typh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2441c63-7bfd-4bd6-8d2b-520403fed57f_1223x545.png 1272w, https://substackcdn.com/image/fetch/$s_!typh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2441c63-7bfd-4bd6-8d2b-520403fed57f_1223x545.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!R3sK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F898b7838-8239-43ce-95a1-2980244ddb3d_1833x1005.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!R3sK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F898b7838-8239-43ce-95a1-2980244ddb3d_1833x1005.png 424w, https://substackcdn.com/image/fetch/$s_!R3sK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F898b7838-8239-43ce-95a1-2980244ddb3d_1833x1005.png 848w, https://substackcdn.com/image/fetch/$s_!R3sK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F898b7838-8239-43ce-95a1-2980244ddb3d_1833x1005.png 1272w, https://substackcdn.com/image/fetch/$s_!R3sK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F898b7838-8239-43ce-95a1-2980244ddb3d_1833x1005.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!R3sK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F898b7838-8239-43ce-95a1-2980244ddb3d_1833x1005.png" width="1456" height="798" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/898b7838-8239-43ce-95a1-2980244ddb3d_1833x1005.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:798,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:77061,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lezwon.substack.com/i/182974082?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F898b7838-8239-43ce-95a1-2980244ddb3d_1833x1005.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!R3sK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F898b7838-8239-43ce-95a1-2980244ddb3d_1833x1005.png 424w, https://substackcdn.com/image/fetch/$s_!R3sK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F898b7838-8239-43ce-95a1-2980244ddb3d_1833x1005.png 848w, https://substackcdn.com/image/fetch/$s_!R3sK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F898b7838-8239-43ce-95a1-2980244ddb3d_1833x1005.png 1272w, https://substackcdn.com/image/fetch/$s_!R3sK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F898b7838-8239-43ce-95a1-2980244ddb3d_1833x1005.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Using a decent hashing function, we could distribute the records pretty uniformly across the nodes. The only drawback of hash partitioning is that it destroys ordering. Given that the records are mapped to nodes in a pseudo-random order, if we make a query like fetch the last 100 records of a user, then the database engine will have to fetch the last 100 records from each node, sort them out in the main node, and then pick the latest 100, which is very inefficient.</p><p></p><h4>Range Partitioning</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oEOW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d92e1a-d7ed-482a-b4e0-732906dd77f0_1833x1251.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oEOW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d92e1a-d7ed-482a-b4e0-732906dd77f0_1833x1251.png 424w, https://substackcdn.com/image/fetch/$s_!oEOW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d92e1a-d7ed-482a-b4e0-732906dd77f0_1833x1251.png 848w, https://substackcdn.com/image/fetch/$s_!oEOW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d92e1a-d7ed-482a-b4e0-732906dd77f0_1833x1251.png 1272w, https://substackcdn.com/image/fetch/$s_!oEOW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d92e1a-d7ed-482a-b4e0-732906dd77f0_1833x1251.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oEOW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d92e1a-d7ed-482a-b4e0-732906dd77f0_1833x1251.png" width="1456" height="994" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6d92e1a-d7ed-482a-b4e0-732906dd77f0_1833x1251.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:994,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:107910,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lezwon.substack.com/i/182974082?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d92e1a-d7ed-482a-b4e0-732906dd77f0_1833x1251.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oEOW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d92e1a-d7ed-482a-b4e0-732906dd77f0_1833x1251.png 424w, https://substackcdn.com/image/fetch/$s_!oEOW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d92e1a-d7ed-482a-b4e0-732906dd77f0_1833x1251.png 848w, https://substackcdn.com/image/fetch/$s_!oEOW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d92e1a-d7ed-482a-b4e0-732906dd77f0_1833x1251.png 1272w, https://substackcdn.com/image/fetch/$s_!oEOW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6d92e1a-d7ed-482a-b4e0-732906dd77f0_1833x1251.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the range-based partitioning, the data is divided by a key range. For example, if we have a list of users and we range partition by country name, then all the names starting from A-C might be put in node 1, D-F in another, G-K in another. This way, we would know which nodes contain which ranges of the key, and we can fetch and write to these nodes accordingly. An advantage of this method is that if one key dominates the dataset, then it can be split into a smaller group and pushed to another node. For examples if the I names are a large set, you can split it as I - Iq, Ir-U into another node. We could also use this to repartition a node in case the data grows too large.</p><p>The one issue with range partitioning, however, might arise during writes. For example, i.e in user event activities, if we partition the data by timestamp (i.e., one day in one partition), all the writes for the one day will flow to one node, overloading it, while the others remain idle. Hence, it is very important to choose the right partition keys here.</p><p></p><h4>Composite Key Partitioning</h4><p>Composite keys comprise of two parts. Partition key and Sort key. One functions at the inter-node level (routing data to the right server) and the other at the intra-node level (organizing data on the disk). Composite Key partitioning helps us solve the issues faced in Hash and Range partitioning by grouping and sorting related data together on one node. For example, for user event activities flowing in, data may first be partitioned by the user and then stored in the nodes. Within those nodes, the data for these users will be grouped and sorted by timestamp. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jMVO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef7ede7-6d2d-4ed2-bbd4-1ae951117b1d_1938x1142.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jMVO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef7ede7-6d2d-4ed2-bbd4-1ae951117b1d_1938x1142.png 424w, https://substackcdn.com/image/fetch/$s_!jMVO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef7ede7-6d2d-4ed2-bbd4-1ae951117b1d_1938x1142.png 848w, https://substackcdn.com/image/fetch/$s_!jMVO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef7ede7-6d2d-4ed2-bbd4-1ae951117b1d_1938x1142.png 1272w, https://substackcdn.com/image/fetch/$s_!jMVO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef7ede7-6d2d-4ed2-bbd4-1ae951117b1d_1938x1142.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jMVO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef7ede7-6d2d-4ed2-bbd4-1ae951117b1d_1938x1142.png" width="1456" height="858" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eef7ede7-6d2d-4ed2-bbd4-1ae951117b1d_1938x1142.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:858,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:176092,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://lezwon.substack.com/i/182974082?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef7ede7-6d2d-4ed2-bbd4-1ae951117b1d_1938x1142.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jMVO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef7ede7-6d2d-4ed2-bbd4-1ae951117b1d_1938x1142.png 424w, https://substackcdn.com/image/fetch/$s_!jMVO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef7ede7-6d2d-4ed2-bbd4-1ae951117b1d_1938x1142.png 848w, https://substackcdn.com/image/fetch/$s_!jMVO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef7ede7-6d2d-4ed2-bbd4-1ae951117b1d_1938x1142.png 1272w, https://substackcdn.com/image/fetch/$s_!jMVO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feef7ede7-6d2d-4ed2-bbd4-1ae951117b1d_1938x1142.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For example, say we have user data that contains name, country, and timestamp. We can have a primary partition key that will partition the data by User. Within these partitions, the data will be grouped by user and sorted using the timestamp field, i.e., Carol&#8217;s records will be stored together and sorted by timestamp. During a range query, for example: &#8220;All records for Carol from 31st December 2025 between 11.00 AM and 12.00 PM&#8221;, the database will find the node by the user name and then fetch all her records between the timestamps of 11.00 AM and 12.00 PM as they are already sorted and stored together. </p><h3>Measurements to do while partitioning</h3><p>While choosing partitioning and indexing schemes, it is important to take a step back and question what exactly we are trying to achieve. We need to have a look at the data, the frequency, and the type of read and write queries, what the hotspots are, etc. Once we know this, we can implement a partitioning scheme.</p><p>Some factors to keep in mind are:</p><ul><li><p>For a range query, the best performance can be achieved if the data to be fetched for that query sits in one node in one partition adjacent to each other.</p></li><li><p>For multiple read queries, it&#8217;s best that they can be distributed across multiple nodes, and each node can serve the requested data without requiring data from other nodes.</p></li><li><p>For write queries, it&#8217;s best to have each write request hit different nodes so that the load is distributed.</p></li></ul><p>Of course, we cannot have a perfect setup for this, and there will always be a compromise, but it&#8217;s best to keep this in mind when designing a system. The type of queries will give you a good idea of how to design the scheme.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://lezwon.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>