WHAT ACTUALLY DRIVES SALES INSIDE A BLINKIT DARK STORE?| I ANALYZED 8,500+ RECORDS TO FIND OUT.


Quick commerce is a ₹5 billion industry built on a simple promise  your groceries in 10 minutes. But inside those dark stores, what actually determines whether one outlet earns ₹500 per product or ₹2,400?

I spent weeks analyzing 8,523 Blinkit sales records using Python to find out. Some answers were expected. Others genuinely surprised me.

THE PROJECT -WHAT I WAS TRYING TO ANSWER

For my  Analytics project , I picked a dataset that felt real  Blinkit's product-outlet sales records. 

It had 8,523 rows and 12 variables: product type, weight, MRP, fat content, outlet size, location tier, outlet format, and more. 

The target variable was Item Outlet Sales, the revenue each product generated at each outlet.

My goal was simple but important: figure out which factors actually drive sales performance. And then build a regression model to predict it.

Quick numbers:

- 8,523 sales records analyzed

- 12 variables in the dataset

- ₹1,499 average item outlet sales

- 16 product categories

STEP 1 - CLEANING THE DATA (THE MESSY PART NOBODY TALKS ABOUT)

Before any analysis, the data needed serious work. I found two variables with significant missing entries:

- Item Weight was missing in 1,448 records (17%)

- Outlet Size was missing in 2,386 records (28%)

For ItemWeight, the distribution was nearly symmetric (skewness = 0.04), so I imputed using the column mean of 12.77 kg. 

For Outlet Size  a categorical variable  I used the mode, which was "Medium."


There was also an inconsistency in the fat content labels. 

The same "Low Fat" category had been entered as "LF", "low fat", and "Low Fat" across different rows. I standardised everything to two clean labels using a Python dictionary replacement. Small fix, but without it, the group-level analysis would have been completely misleading.


I also engineered two new features:

-Outlet Age (2026 minus the establishment year) to capture store maturity

-A Price Tier variable that bucketed Item MRP into Low, Medium, and High bands  making the pricing patterns much easier to model

FINDING 1 - PRICE IS THE STRONGEST NUMERICAL DRIVER, BUT NOT FOR THE REASON YOU'D THINK

The Pearson correlation between Item MRP and Item Outlet Sales was r = 0.403 (p < 0.001).

That makes price the single strongest numerical predictor in the dataset.

In the regression model, every ₹1 increase in MRP was associated with ₹5.24 more in sales revenue.

But here's the nuance: 

This doesn't mean higher prices cause more demand. It means that higher-priced products mathematically generate more revenue per transaction -Revenue = Price × Quantity.

Blinkit's assortment appears to include a strong premium segment where consumers are relatively price-insensitive, especially in personal care and packaged goods.

Key takeaway: If you're managing inventory for a quick-commerce platform, stocking high-MRP items isn't just a margin play — it's a revenue floor strategy. A stockout in a ₹250 product costs far more than a stockout in a ₹50 one.

FINDING 2 -OUTLET FORMAT IS THE REAL REVENUE CEILING (THE BIGGEST ONE)

This was the most striking finding in the entire project. Average sales by outlet type:

- Supermarket Type 3 → ₹2,415 average sales

- Supermarket Type 2 → ₹1,830 average sales

- Supermarket Type 1 → ₹1,655 average sales

- Grocery Store      → ₹508 average sales

Supermarket Type 3 outlets earn nearly 4.75 times more than a standard Grocery Store.

That is not a small difference. It tells us that the structural format of an outlet essentially determines its revenue ceiling before a single product is even stocked.

Why? Larger supermarket formats support wider assortments, higher basket sizes per order, and better ability to carry high-MRP premium products. The Grocery Store format is limited by scale — both physical and assortment depth.

Strategic implication: For Blinkit, this data makes a clear case for auditing the Grocery Store portfolio. Converting even a portion of underperforming outlets to a Supermarket Type 1 configuration could meaningfully lift network-level revenue.

FINDING 3 - TIER 3 CITIES OUTSELL TIER 1. YES, REALLY.

Everyone assumes that Delhi, Mumbai, and Bengaluru drive the most revenue for any consumer platform. The data disagreed.

- Tier 3 cities → ₹1,531 average sales (highest)

- Tier 2 cities → ₹1,493 average sales

- Tier 1 cities → ₹1,471 average sales (lowest)

The most likely explanation: 

Tier 1 metros, Blinkit competes with a dense ecosystem of offline retailers, other quick-commerce platforms, and traditional kirana stores. Market share is fragmented. 

In Tier 3 cities, Blinkit often has fewer direct competitors, which means higher customer stickiness and a more loyal order base per outlet.

There's also a cost angle dark store setup in Tier 3 is cheaper, and if revenue per outlet is comparable or higher, the unit economics become very attractive.

FINDING 4 - THE VARIABLE THAT SHOULD MATTER, BUT DOESN'T

In traditional retail, shelf visibility is everything.

Planogramming - the science of placing products on shelves is a multi-billion dollar discipline built on the idea that what gets seen, gets bought.

In this dataset, Item_Visibility had a correlation of r = 0.017 with sales. Essentially zero.

WHY? 

Blinkit is an app. There are no shelves. "Visibility" in a quick-commerce environment is determined by search algorithms, recommendation engines, and ad placement , not physical display space.

The variable in this dataset is a legacy metric from traditional retail that simply doesn't translate to digital-first grocery.

This was one of my favourite findings ,not because of what it shows, but because of what it signals. The rules of traditional retail don't automatically apply to quick commerce. Analysts working in this space need to replace legacy metrics with digital-native ones: click-through rate, search position, add-to-cart rate.

THE MODEL- WHAT THE REGRESSION TOLD ME

I built a Multiple Linear Regression model using scikit-learn, with an 80/20 train-test split (6,818 training records, 1,705 test records).

Categorical variables were one-hot encoded using pd.get_dummies with drop_first=True to avoid the dummy variable trap.

Results:

- Test R² = 0.1746 (model explains ~17.5% of variance in sales)

- Mean Absolute Error = ₹661

- Train R² = 0.1596 → Test R² = 0.1746 (no overfitting — model generalizes well)

The outlet type dummy variables carried the most predictive weight, which aligns with everything the EDA showed. Item_MRP (β = 5.24) was the strongest continuous predictor.

The low overall R² is also informative on its own. It tells us that factors outside this dataset - promotional pricing, seasonality, competitor behaviour, individual customer patterns  are responsible for a large portion of sales variance. 

No static product-outlet dataset can fully capture those dynamics with a linear model.

WHAT I'D DO DIFFERENTLY NEXT TIME

This project was a solid first pass but it also showed me exactly where to go next:

1. Switch to ensemble methods like Random Forest or XGBoost to capture non-linear interactions between variables.

2. Introduce time-series data to model seasonality and festival demand spikes.

3. Replace Item_Visibility with digital-native metrics like search rank, CTR, or add-to-cart rate.

4. Move from product-outlet data to transaction-level data to unlock basket analysis and customer lifetime value estimates.

CLOSING THOUGHT

"Sales variation at Blinkit isn't random. It's structured around two things: 

what you charge, and

what kind of store you're selling from. 

Everything else is noise  until you have better data."


If you're a data analyst, or just curious about quick commerce, I'd love to hear your thoughts. What factors do you think are missing from a dataset like this?

Written by Anshula






Comments

Popular posts from this blog

How to create Macros in Excel

LEARNING DATA SCIENCE , WEEK 2

LEARNING DATA SCIENCE , PART 3