Multi-Tenant IoT Filtering Every Obvious Solution Failed Org Region North Region South Region West Site A Site B Site C Site D Site E Site F DEVICES 100k+ devices 7 approaches tried 100,000+ devices per user 2–3 node IDs per query

The Setup: An IoT Platform at Uncomfortable Scale

Imagine a multi-tenant IoT SaaS: factories, power grids, telecom towers — each client has thousands of devices reporting data every second. A single enterprise user (say, a regional manager at Reliance) might have access to 1–2 lakh devices (100,000–200,000). Their dashboard: a paginated table with 20+ columns — temperature, voltage, pressure, status, firmware version — filterable on any column, any combination.

THE STACK User Dashboard: paginated table, any-column filtering API / Query Layer PostgreSQL Users, Devices Ownership MongoDB Real-time readings 20 params ClickHouse Analytics, Time-series Aggregated Elasticsearch — Metadata search THE CATCH • 1-2 lakh devices per enterprise user • Writes: millions of documents per hour • Ownership: granted and revoked dynamically • Filters: any column, any database

That combination — scale, dynamic ownership, multi-database reads, arbitrary column filters — is what makes this genuinely hard. Let's walk through every approach we tried.

Approach 1 — The Naive $in Clause

The first instinct is completely natural: load the user's device IDs from SQL, then pass them to MongoDB.

// Step 1: get device IDs from relational source
const deviceIds = await sql.query(
  `SELECT device_id FROM user_devices WHERE user_id = ?`,
  [userId]
);
// deviceIds = [D1, D2, D3, ... D100000]  ← 100k elements

// Step 2: filter MongoDB with the $in list
const readings = await db.device_data.find({
  device_id: { $in: deviceIds },
  temperature: { $gt: 30 },
  status: 'ACTIVE'
}).sort({ timestamp: -1 }).limit(50);

This works flawlessly with a test user who has 20 devices. It degrades at 1,000. It collapses at 100,000.

WHY $in COLLAPSES AT SCALE Network Payload 100,000 IDs x 10 bytes ~1 MB sent on every query Index Traversal 100,000 individual B-tree seeks per single query 50 Concurrent Users 50 x 1MB = 50MB network 50 x 100k seeks simultaneously Query Latency Comparison 10 devices 2ms 1,000 devices ~200ms 100,000 devices 4,800ms ClickHouse, Elasticsearch: same problem -- every store receives the giant list on every page load, filter change, and sort operation. Verdict: Fine under ~1,000 devices. Prototype only beyond that.

Approach 2 — Store user_ids Inside Each Document

If cross-DB round trips are expensive, why not embed ownership directly in the telemetry document?

{
  "device_id": "D1",
  "user_ids": ["user_1", "user_456", "user_789"],
  "temperature": 32.4,
  "voltage": 4.5,
  "timestamp": "2025-01-15T10:00:00Z"
}
// Query becomes a single indexed lookup
db.device_data.find({
  user_ids: "user_456",
  temperature: { $gt: 30 }
});

Reads look great. The write side is a catastrophe.

THE WRITE MATH 1 Device 1 reading/second 86,400 docs/day Shared by 50 Users 86,400 x 50 user_ids 4.3M copies/day 100k Devices 4.3M x 100,000 430 BILLION/day Revocation Nightmare Grant access Patch how many historical docs? Revoke access UPDATE millions of rows under live load Verdict: Reasonable for write-once data with stable ownership. Never use for high-frequency IoT + dynamic sharing.

Approach 3 — Redis Cache + Chunked Parallel Queries

Keep ownership out of documents, but split the giant $in into parallel chunks processed concurrently:

// Load all device IDs from Redis (fast, avoids SQL round trip)
const allDeviceIds = await redis.smembers(`user:${userId}:devices`);
// allDeviceIds.length = 100,000

const CHUNK_SIZE = 500;
const chunks = chunkArray(allDeviceIds, CHUNK_SIZE);
// chunks.length = 200

// Fire 200 parallel MongoDB queries
const results = await Promise.all(
  chunks.map(chunk =>
    db.device_data.find({
      device_id: { $in: chunk },
      temperature: { $gt: 30 }
    }).toArray()
  )
);

// Merge, sort, paginate in application memory
const merged = results
  .flat()
  .sort((a, b) => b.timestamp - a.timestamp)
  .slice(page * pageSize, (page + 1) * pageSize);
WHY CHUNKING DOESN'T WORK Fan-out Math 100,000 IDs / 500 per chunk = 200 concurrent queries per page load At 30 Concurrent Users 30 x 200 = 6,000 simultaneous queries Pool: 100-200 connections max Pool exhaustion → cascade failure Pagination Fundamentally Broken Need: global page 1, sorted by timestamp Each chunk returns its own top-50 True pagination = merge ALL in memory 100k x 1000 = 100M docs in RAM Cache Sync Problem Redis SQL Any lag = user sees wrong devices Verdict: Better than one mega-$in. Still broken at this scale.

Approach 4 — Mapping Table in Each Database

Mirror the user_devices relationship inside each store so it can join locally — avoiding the cross-DB hop entirely.

// MongoDB: $lookup against a mirrored mapping collection
db.device_data.aggregate([
  {
    $lookup: {
      from: "user_device_mapping",
      localField: "device_id",
      foreignField: "device_id",
      pipeline: [{ $match: { user_id: "user_456" } }],
      as: "access"
    }
  },
  { $match: { access: { $ne: [] }, temperature: { $gt: 30 } } }
]);
-- ClickHouse: mapping table + join
SELECT d.*
FROM device_data d
INNER JOIN user_device_mapping m ON d.device_id = m.device_id
WHERE m.user_id = 'user_456'
  AND d.temperature > 30;

This looks clean and solves the cross-DB problem. It breaks in three ways at production scale:

THREE FAILURE MODES 1 Join Cost user_device_mapping: 100,000 rows device_data collection: 500M documents MongoDB $lookup = nested loop (100k iter) P99: 8,000ms+ 2 4-Way Sync PostgreSQL MongoDB ClickHouse Elastic Any failure = wrong devices shown, potential data leak Distributed consistency is hard 3 Tree ACL Breaks It Real platforms grant: "See everything under India region" Future devices auto-inherit A static flat list of device_ids cannot do this Model is fundamentally wrong Verdict: Works for flat ownership at small scale. Breaks under tree-based ACL + production volume.

The Real Problem: Access Control Is a Tree

This is the insight that makes all the previous approaches wrong at the root level. In real IoT products, users are not assigned to individual devices. They are assigned to nodes in a hierarchy, and they inherit the entire subtree.

ORG HIERARCHY (simplified) World (N1) India (N2) ← user_456 assigned HERE India (N3) different node, same label! Maharashtra (N8) Delhi (N5) Pune (N4) Factory A (N11) D3 D1 D2 D4 user_456 can see all of these Gujarat (N9) Pune (N6) D5 user_456 CANNOT see KEY CONSTRAINTS • Names are NOT unique -- "India", "Pune" repeat. Only IDs are unique. • New devices added under N2 are auto-visible to user_456 • If N8 moves under N3, user_456 loses those devices • Tree has millions of nodes, changes multiple times/hour, must work across MongoDB, ClickHouse, Elasticsearch The hard question: given node N2, resolve all devices in the subtree efficiently?

Approach 5 — Closure Table

Store every ancestor → descendant pair in SQL. Subtree membership becomes a single indexed lookup — no recursive CTE needed at read time.

-- Closure table structure
CREATE TABLE node_closure (
  ancestor_id   VARCHAR(36) NOT NULL,
  descendant_id VARCHAR(36) NOT NULL,
  depth         INT NOT NULL,
  is_device     BOOLEAN NOT NULL DEFAULT false,
  PRIMARY KEY (ancestor_id, descendant_id),
  INDEX idx_ancestor (ancestor_id),
  INDEX idx_descendant (descendant_id)
);

-- All device IDs under node N2 (user_456's assignment)
SELECT descendant_id AS device_id
FROM node_closure
WHERE ancestor_id = 'N2'
  AND is_device = true;
CLOSURE TABLE CONTENTS (partial) ancestor descendant depth is_device N1 N1 0 false N1 N2 1 false N1 N11 4 false N1 D1 5 true ← device N2 N2 0 false N2 N8 1 false N2 D1 4 true N11 D1 1 true Subtree of N2 = 1 indexed lookup Fast reads! BUT: Move N8 from N2 to N3 Must delete ALL ancestor-descendant pairs + re-insert under new ancestors 100k+ row deletes + inserts under live load Verdict: Excellent for slow-changing trees. Too expensive when large subtrees move frequently.

Approach 6 — Materialized Path

Store a string path per node. Subtree queries become a prefix match.

idnameparent_idpath
N1World/N1
N2IndiaN1/N1/N2
N8MaharashtraN2/N1/N2/N8
N4PuneN8/N1/N2/N8/N4
N3IndiaN1/N1/N3
N6PuneN3/N1/N3/N6
-- All devices under N2 (subtree prefix match)
SELECT id FROM nodes
WHERE path LIKE '/N1/N2/%'
  AND is_device = true;
-- Works because LIKE 'prefix%' can use a B-tree index

Simpler writes than closure table. The prefix LIKE scan degrades when the matching subtree returns millions of device IDs — you still end up with a huge list to pass to MongoDB and ClickHouse. The cross-DB problem remains unsolved.

Approach 7 — Graph Database (Neo4j)

Tree traversal is exactly what graph databases are built for. Neo4j makes the query elegant:

// Find all devices accessible to user_456
MATCH (user:User { id: 'user_456' })
      -[:ASSIGNED_TO]->(n:Node)
      -[:ANCESTOR_OF*]->(d:Device)
RETURN d.id AS device_id;

This solves the traversal problem beautifully — cheap edge changes, no recursive SQL, correct handling of non-unique labels. But it doesn't solve the cross-database filtering problem.

NEO4J SOLVES TRAVERSAL, NOT FILTERING Neo4j traversal: user_456 sees [D1, D2 ... D100k] Then you still do: db.device_data.find({ device_id: { $in: [D1..D100k] } }) Back to the $in explosion problem! Neo4j is a great piece of the puzzle but not the whole answer. Verdict: Use Neo4j as graph truth + traversal engine. Pair it with a separate strategy for cross-DB queries.

The Pattern That Works: Ancestor Nodes at Ingestion Time

The breakthrough insight: stop resolving the tree at read time. Resolve it at write time. When a device emits a reading, the ingestion pipeline attaches the full ancestry — every node ID from root to the device's location.

{
  "device_id": "D1",
  "ancestor_nodes": [
    { "id": "N1", "name": "World" },
    { "id": "N2", "name": "India" },
    { "id": "N8", "name": "Maharashtra" },
    { "id": "N4", "name": "Pune" },
    { "id": "N11", "name": "Factory A" }
  ],
  "temperature": 32.4,
  "voltage": 4.5,
  "status": "ACTIVE",
  "timestamp": "2025-01-15T10:00:00Z"
}

Now the query is trivially small regardless of fleet size:

// user_456 is assigned to nodes N2 and N6
const userNodes = await getUserNodeIds(userId); // ['N2', 'N6']

// MongoDB: tiny $in of node IDs — works at any device count
const readings = await db.device_data.find({
  "ancestor_nodes.id": { $in: userNodes },
  temperature: { $gt: 30 }
}).sort({ timestamp: -1 }).limit(50);
Before vs After: The Query That Changed Everything BEFORE: Filter by device IDs device_id: { $in: [ D1, D2, D3, D4, D5 ... D100000 ] } // 100,000 IDs in every query ~1 MB payload / request 4,800ms query latency Broken at scale AFTER: Filter by ancestor node IDs ancestor_nodes.id: { $in: [ 'N2', 'N6' ] } // 2–3 node IDs, always ~50 B payload / request 2ms query latency Scales to any fleet size THE FUNDAMENTAL SHIFT BEFORE (all previous approaches) User resolve 100k IDs $in 100k AFTER (ancestor nodes at ingestion) User load 2-3 node IDs $in 2-3 The $in list is the size of USER'S ORG ASSIGNMENTS not the device fleet. $in: ['N2', 'N6'] Always 2 items, regardless of fleet 100 or 100,000 devices: Same tiny query. Scales to any fleet size. Same fast index seek on ancestor_nodes.id Query cost is O(assigned nodes), not O(devices)

Why store both id and name: Access control uses IDs (unique). The UI lets users search by name — but "Pune" and "India" repeat across branches. Storing the full path allows the frontend to show disambiguation:

// User types "Pune" in the location picker
// UI shows full paths so user can pick the right one:
// → World > India (N2) > Maharashtra > Pune (N4)
// → World > India (N3) > Gujarat > Pune (N6)
// User selects row 1 → query uses N4, not the string "Pune"

The ClickHouse Immutability Problem

ClickHouse is append-optimized. Mass ALTER ... UPDATE to rewrite ancestry on every tree change is the wrong fight — it's expensive, slow, and defeats the purpose of a columnar analytics engine.

PRACTICAL SPLIT: MongoDB + ClickHouse device_latest (MongoDB) • One document per device • Current readings WITH ancestor_nodes • Updated on every new reading • Easy to update on tree changes Used for: Real-time dashboard, filters device_history (ClickHouse) • Append-only telemetry (never update) NO ancestor_nodes in fact table • Access via dictionary lookup at query time Used for: Time-series analytics, charts
-- ClickHouse dictionary: refreshed on schedule (e.g. every 5 min)
CREATE DICTIONARY user_nodes_dict (
  user_id  String,
  node_ids Array(String)
)
SOURCE(CLICKHOUSE(
  QUERY 'SELECT user_id, groupArray(node_id) FROM user_nodes GROUP BY user_id'
))
LIFETIME(300);  -- refresh every 5 minutes

-- Analytics query: dictionary resolves ancestry at query time
SELECT
  device_id,
  avg(temperature) AS avg_temp,
  max(voltage)     AS peak_voltage,
  count()          AS reading_count
FROM device_history
WHERE hasAny(
  ancestor_node_ids,
  dictGet('user_nodes_dict', 'node_ids', 'user_456')
)
AND timestamp > now() - INTERVAL 7 DAY
GROUP BY device_id
ORDER BY avg_temp DESC;

Full Architecture: End-to-End

INGESTION PIPELINE Device emits reading Kafka: raw_telemetry Ingestion Consumer Resolve ancestors from Redis MongoDB device_latest + ancestors (upsert per device) ClickHouse device_history (append-only) Elasticsearch metadata index + ancestor_nodes QUERY TIME User opens dashboard Load node IDs → ['N2', 'N6'] MongoDB (real-time) ancestor_nodes.id IN ['N2','N6'] ClickHouse (analytics) dict lookup + hasAny() TREE CHANGE EVENT Node N8 moves from N2 to N3 Kafka: NODE_MOVED {N8, N3} Neo4j: update edges Ancestry Repair Consumers (async) Recompute ancestors Update MongoDB Update Elasticsearch Invalidate Redis cache ClickHouse dict refresh on next 5-min cycle | Staleness: seconds to minutes

Tradeoffs: The Honest Summary

ApproachRead costWrite costTree changesUse when
Naive $inO(n) at scaleLowEasy< 1,000 devices
user_ids embeddedFastCatastrophicImpossibleWrite-once, stable ownership
Redis chunkingMediumMediumCache sync neededMedium scale, no tree ACL
Mapping table per DBJoin cost4-way syncSync all storesFlat ownership, small scale
Closure tableFast (indexed)MediumExpensive (bulk rewrite)Slow-changing trees
Materialized pathPrefix scanLowPath rewriteMedium scale, simple trees
Neo4j aloneFast traversalMediumCheapGraph truth — not cross-DB
Ancestors at ingestO(log n) alwaysSlight overheadAsync repairProduction multi-DB + trees

When This Pattern Doesn't Work

Be honest about the tradeoffs. The ancestor-at-ingestion pattern has real limitations:

  • Hard real-time revocation: When a node move happens, there's a staleness window (seconds to minutes) before all documents are repaired. If your security model requires instant revocation (financial, healthcare), this needs a stricter, more expensive design — typically a token-based approach where the access check happens at query time, not write time.
  • Storage overhead: Every document now carries the full ancestry array. For a 5-level deep tree, that's 5 extra node objects per document. Acceptable for most IoT; worth measuring for extremely high-frequency devices.
  • Repair complexity: The async ancestry repair pipeline is non-trivial. You need idempotent consumers, dead-letter queues, and monitoring for repair lag.

What I'd Do Differently

If I were building this again from scratch, I'd set up the ancestor-at-ingestion pattern from day one, and I'd invest early in the repair pipeline monitoring. The hardest part isn't the ingestion enrichment — that's straightforward. The hardest part is building confidence that the async repair is working correctly under load, and having the observability to know when it's lagging.

The failed approaches weren't wrong to try. Approach 1 is the right prototype. Approach 4 is the right design for a product with flat ownership and 50k devices. You only discover that you need the full ancestry pattern when the tree ACL requirements and the fleet scale arrive together — usually later than you'd like.

The core lesson

Push complexity to write time. Pay once per document at ingestion so every read is cheap. Design your access model around node IDs, not device IDs — your fleet will grow, but your org hierarchy won't flatten. And when in doubt, measure: the right approach for your constraints might not be the same as ours.