Streaming Algorithms
Last updated: Apr 3, 2026Table of Contents
- 0) Concept
- 0-0) Why Streaming Algorithms?
- 0-1) Types
- 0-2) When to Use Each Algorithm
- 1) Algorithm Templates
- 1-1) Reservoir Sampling (LC 382, 398)
- 1-2) Boyer-Moore Majority Vote (LC 169, 229)
- 1-3) Count-Min Sketch - Frequency Estimation
- 1-4) Bloom Filter - Membership Testing
- 2) LeetCode Problems by Pattern
- 2-1) Reservoir Sampling
- 2-2) Boyer-Moore Majority Vote
- 2-3) Count-Min Sketch (Conceptual)
- 2-4) Bloom Filter (Conceptual)
- 2-5) Related Streaming Problems
- 3) Common Mistakes & Edge Cases
- 🚫 Common Mistakes
- ⚠️ Edge Cases
- 4) Interview Tips & Complexity Analysis
- 💡 Interview Strategy
- 📊 Complexity Analysis
- 🎯 Interview Talking Points
- 🔧 Optimization Techniques
- 📚 Related Topics
- 5) Complete Code Examples
- Example 1: Weighted Reservoir Sampling (LC 528)
- Example 2: Stream Median (Related Problem LC 295)
- Summary
Streaming Algorithms
- Process data streams with limited memory
- When to use: Data too large to store, online processing, random sampling, frequency estimation
- Key LeetCode problems: LC 382, 398 (Reservoir Sampling), LC 169, 229 (Majority Element), LC 528 (Random Pick with Weight)
- Core techniques: Reservoir sampling, probabilistic data structures, approximate counting
- Space-time tradeoff: Trade accuracy for memory efficiency
Time Complexity: O(1) per element processed (streaming) Space Complexity: O(k) or O(log n) typically, independent of stream size
0) Concept
0-0) Why Streaming Algorithms?
Problem: Traditional algorithms assume entire dataset fits in memory
- Streaming Context: Data arrives continuously, too large to store
- Memory Constraint: Limited space, must process elements once
- Requirements: Make decisions with incomplete information
Key Characteristics:
- Single Pass: Process each element at most once
- Sublinear Space: Memory << data size
- Approximate Results: Trade exactness for efficiency (except Reservoir Sampling)
- Online Processing: Results available anytime
0-1) Types
1. Reservoir Sampling - Uniform Random Sampling
- Purpose: Select k random elements from stream of unknown size
- Guarantee: Each element has exactly k/n probability of selection
- Space: O(k) for k samples
- Applications: Random sampling, shuffling, fairness in selection
2. Morris Counter - Approximate Counting
- Purpose: Count large numbers with small memory
- Space: O(log log n) instead of O(log n)
- Trade-off: Approximate count with expected value n
- Applications: Large-scale counting, memory-constrained systems
3. Boyer-Moore Majority Vote - Heavy Hitters
- Purpose: Find element(s) appearing > n/k times
- Space: O(k) for top-k elements
- Applications: Majority element, top-k frequent elements
- Variants: Standard (k=2), Generalized (arbitrary k)
4. Count-Min Sketch - Frequency Estimation
- Purpose: Estimate frequency of elements in stream
- Space: O(w × d) hash table, w×d << n
- Guarantee: Overestimate by at most εn with probability 1-δ
- Applications: Network traffic analysis, query frequency
5. Bloom Filter - Membership Testing
- Purpose: Test if element is in set (probabilistic)
- Space: O(m) bits, m << n
- Guarantee: No false negatives, false positives with probability p
- Applications: Cache filtering, spell checking, malware detection
0-2) When to Use Each Algorithm
| Algorithm | Use Case | Space | Accuracy | Output |
|---|---|---|---|---|
| Reservoir Sampling | Random k samples | O(k) | Exact | k elements |
| Morris Counter | Count stream size | O(log log n) | Approximate | ~count |
| Boyer-Moore | Majority element | O(k) | Exact if exists | Candidates |
| Count-Min Sketch | Frequency estimation | O(w×d) | Approximate | ~frequency |
| Bloom Filter | Set membership | O(m) | False positives | yes/no |
Recognition Keywords:
- “Random sample”, “shuffle”, “fairness” → Reservoir Sampling
- “Majority element”, “appears more than n/k times” → Boyer-Moore
- “Estimate frequency”, “approximate count” → Count-Min Sketch
- “Check membership”, “seen before?” → Bloom Filter
- “Limited memory”, “one pass”, “streaming data” → Any streaming algorithm
1) Algorithm Templates
1-1) Reservoir Sampling (LC 382, 398)
Algorithm Overview:
- Maintain reservoir of k elements
- For element at index i (i ≥ k):
- Generate random j ∈ [0, i]
- If j < k, replace reservoir[j] with current element
- Proof: Element at index i has k/i chance to enter, k/(i+1) to stay → k/n final probability
Template 1: Single Random Element (k=1)
// LC 382 - Linked List Random Node
class Solution {
private ListNode head;
private Random random;
public Solution(ListNode head) {
this.head = head;
this.random = new Random();
}
/**
* time = O(N)
* space = O(1)
*/
public int getRandom() {
ListNode cur = head;
int result = cur.val;
int count = 1;
// Process stream: for each element i, include with prob 1/i
while (cur != null) {
// Generate random in [0, count-1], replace if 0
if (random.nextInt(count) == 0) {
result = cur.val;
}
count++;
cur = cur.next;
}
return result;
}
}
# Python implementation
import random
class Solution:
def __init__(self, head: ListNode):
self.head = head
def getRandom(self) -> int:
cur = self.head
result = cur.val
count = 1
cur = cur.next
while cur:
count += 1
# Include current element with probability 1/count
if random.randint(1, count) == count:
result = cur.val
cur = cur.next
return result
Template 2: k Random Elements
// LC 398 - Random Pick Index (variant: pick from array with duplicates)
class Solution {
private int[] nums;
private Random random;
public Solution(int[] nums) {
this.nums = nums;
this.random = new Random();
}
/**
* time = O(N)
* space = O(1)
*/
public int pick(int target) {
int result = -1;
int count = 0;
for (int i = 0; i < nums.length; i++) {
if (nums[i] == target) {
count++;
// Replace with probability 1/count
if (random.nextInt(count) == 0) {
result = i;
}
}
}
return result;
}
}
Template 3: General k-Reservoir Sampling
class ReservoirSampling {
private int k;
private int[] reservoir;
private Random random;
private int count; // Elements seen so far
public ReservoirSampling(int k) {
this.k = k;
this.reservoir = new int[k];
this.random = new Random();
this.count = 0;
}
/**
* time = O(1) per element
* space = O(k)
*/
public void add(int val) {
if (count < k) {
// Fill reservoir first
reservoir[count] = val;
} else {
// Generate random index in [0, count]
int j = random.nextInt(count + 1);
if (j < k) {
// Replace element in reservoir
reservoir[j] = val;
}
}
count++;
}
public int[] getSample() {
return Arrays.copyOf(reservoir, Math.min(k, count));
}
}
# Python k-reservoir sampling
import random
class ReservoirSampling:
def __init__(self, k: int):
self.k = k
self.reservoir = []
self.count = 0
def add(self, val: int) -> None:
"""Time: O(1), Space: O(k)"""
if self.count < self.k:
self.reservoir.append(val)
else:
# Random index in [0, count]
j = random.randint(0, self.count)
if j < self.k:
self.reservoir[j] = val
self.count += 1
def get_sample(self) -> list:
return self.reservoir[:]
Key Insight: Probability math ensures uniform distribution
- Element i enters with probability k/i
- Stays with probability: (1 - k/(i+1)) × (1 - k/(i+2)) × … × (1 - k/n)
- Final probability: k/i × i/(i+1) × (i+1)/(i+2) × … × (n-1)/n = k/n ✓
1-2) Boyer-Moore Majority Vote (LC 169, 229)
Algorithm Overview:
- Find element(s) appearing more than ⌊n/k⌋ times
- Key Idea: Pair different elements and cancel them out
- Majority element will survive cancellation
- Two-phase: (1) Find candidates, (2) Verify counts
Template 1: Standard Majority Element (> n/2)
// LC 169 - Majority Element
class Solution {
/**
* time = O(N)
* space = O(1)
*/
public int majorityElement(int[] nums) {
int candidate = nums[0];
int count = 1;
// Phase 1: Find candidate
for (int i = 1; i < nums.length; i++) {
if (count == 0) {
candidate = nums[i];
count = 1;
} else if (nums[i] == candidate) {
count++;
} else {
count--; // Cancel out different element
}
}
// Phase 2: Verify (not needed if majority guaranteed)
// If not guaranteed, count candidate appearances
return candidate;
}
}
# Python implementation
def majorityElement(nums: list[int]) -> int:
"""Time: O(n), Space: O(1)"""
candidate = None
count = 0
# Phase 1: Find candidate
for num in nums:
if count == 0:
candidate = num
count = 1
elif num == candidate:
count += 1
else:
count -= 1
return candidate
Template 2: Generalized - Elements Appearing > n/3 Times (LC 229)
// LC 229 - Majority Element II (> n/3)
class Solution {
/**
* time = O(N)
* space = O(1)
*/
public List<Integer> majorityElement(int[] nums) {
// At most 2 elements can appear > n/3 times
int candidate1 = 0, candidate2 = 0;
int count1 = 0, count2 = 0;
// Phase 1: Find candidates (at most k-1 candidates for n/k)
for (int num : nums) {
if (num == candidate1) {
count1++;
} else if (num == candidate2) {
count2++;
} else if (count1 == 0) {
candidate1 = num;
count1 = 1;
} else if (count2 == 0) {
candidate2 = num;
count2 = 1;
} else {
// Different from both: cancel out
count1--;
count2--;
}
}
// Phase 2: Verify candidates
count1 = 0;
count2 = 0;
for (int num : nums) {
if (num == candidate1) count1++;
else if (num == candidate2) count2++;
}
List<Integer> result = new ArrayList<>();
if (count1 > nums.length / 3) result.add(candidate1);
if (count2 > nums.length / 3) result.add(candidate2);
return result;
}
}
# Python - Majority Element II
def majorityElement(nums: list[int]) -> list[int]:
"""Time: O(n), Space: O(1)"""
# Phase 1: Find up to 2 candidates
candidate1, candidate2 = None, None
count1, count2 = 0, 0
for num in nums:
if num == candidate1:
count1 += 1
elif num == candidate2:
count2 += 1
elif count1 == 0:
candidate1, count1 = num, 1
elif count2 == 0:
candidate2, count2 = num, 1
else:
count1 -= 1
count2 -= 1
# Phase 2: Verify
result = []
for candidate in [candidate1, candidate2]:
if nums.count(candidate) > len(nums) // 3:
result.append(candidate)
return result
Generalized Template: Top k-1 Elements (> n/k)
class BoyerMooreGeneralized {
/**
* Find elements appearing > n/k times
* time = O(N * k)
* space = O(k)
*/
public List<Integer> majorityElement(int[] nums, int k) {
// Can have at most k-1 elements appearing > n/k times
Map<Integer, Integer> candidates = new HashMap<>();
// Phase 1: Find candidates
for (int num : nums) {
if (candidates.containsKey(num)) {
candidates.put(num, candidates.get(num) + 1);
} else if (candidates.size() < k - 1) {
candidates.put(num, 1);
} else {
// Decrement all counts (cancellation)
List<Integer> toRemove = new ArrayList<>();
for (Map.Entry<Integer, Integer> entry : candidates.entrySet()) {
int count = entry.getValue() - 1;
if (count == 0) {
toRemove.add(entry.getKey());
} else {
candidates.put(entry.getKey(), count);
}
}
for (int key : toRemove) {
candidates.remove(key);
}
}
}
// Phase 2: Verify candidates
Map<Integer, Integer> counts = new HashMap<>();
for (int num : nums) {
if (candidates.containsKey(num)) {
counts.put(num, counts.getOrDefault(num, 0) + 1);
}
}
List<Integer> result = new ArrayList<>();
for (Map.Entry<Integer, Integer> entry : counts.entrySet()) {
if (entry.getValue() > nums.length / k) {
result.add(entry.getKey());
}
}
return result;
}
}
1-3) Count-Min Sketch - Frequency Estimation
Algorithm Overview:
- Probabilistic data structure for frequency estimation
- Uses d hash functions and w counters per hash
- Query: return minimum count across all hash functions
- Guarantee: Overestimates by at most εn with probability 1-δ
- Parameters: w = ⌈e/ε⌉, d = ⌈ln(1/δ)⌉
Template:
class CountMinSketch {
private int[][] count; // d × w matrix
private int d; // Number of hash functions
private int w; // Width of each row
private long[] hashA, hashB; // Hash parameters
private static final long PRIME = 2147483647L; // Large prime
/**
* Constructor
* epsilon: error tolerance (e.g., 0.01 for 1% error)
* delta: failure probability (e.g., 0.01 for 99% confidence)
*/
public CountMinSketch(double epsilon, double delta) {
this.w = (int) Math.ceil(Math.E / epsilon);
this.d = (int) Math.ceil(Math.log(1.0 / delta));
this.count = new int[d][w];
// Initialize hash functions: h(x) = ((a*x + b) % p) % w
Random random = new Random();
hashA = new long[d];
hashB = new long[d];
for (int i = 0; i < d; i++) {
hashA[i] = random.nextInt(Integer.MAX_VALUE);
hashB[i] = random.nextInt(Integer.MAX_VALUE);
}
}
private int hash(int item, int i) {
// Hash function i
long hash = ((hashA[i] * item + hashB[i]) % PRIME) % w;
return (int) hash;
}
/**
* Add item to sketch
* time = O(d)
* space = O(1)
*/
public void add(int item) {
for (int i = 0; i < d; i++) {
int index = hash(item, i);
count[i][index]++;
}
}
/**
* Estimate frequency of item
* time = O(d)
* space = O(1)
* Returns: estimated count (may overestimate, never underestimate)
*/
public int estimateCount(int item) {
int minCount = Integer.MAX_VALUE;
for (int i = 0; i < d; i++) {
int index = hash(item, i);
minCount = Math.min(minCount, count[i][index]);
}
return minCount;
}
/**
* Get space usage
*/
public int getSpace() {
return d * w; // Number of counters
}
}
# Python Count-Min Sketch
import math
import random
class CountMinSketch:
def __init__(self, epsilon: float, delta: float):
"""
epsilon: error tolerance
delta: failure probability
"""
self.w = math.ceil(math.e / epsilon)
self.d = math.ceil(math.log(1.0 / delta))
self.count = [[0] * self.w for _ in range(self.d)]
# Initialize hash parameters
PRIME = 2147483647
self.hash_a = [random.randint(1, PRIME-1) for _ in range(self.d)]
self.hash_b = [random.randint(0, PRIME-1) for _ in range(self.d)]
self.PRIME = PRIME
def _hash(self, item: int, i: int) -> int:
"""Hash function i"""
return ((self.hash_a[i] * item + self.hash_b[i]) % self.PRIME) % self.w
def add(self, item: int) -> None:
"""Time: O(d), Space: O(1)"""
for i in range(self.d):
index = self._hash(item, i)
self.count[i][index] += 1
def estimate_count(self, item: int) -> int:
"""Time: O(d), Returns estimated frequency"""
return min(self.count[i][self._hash(item, i)] for i in range(self.d))
Usage Example:
// Track top-k frequent elements in stream
CountMinSketch cms = new CountMinSketch(0.01, 0.01); // 1% error, 99% confidence
// Process stream
for (int item : stream) {
cms.add(item);
}
// Query frequencies
int freq = cms.estimateCount(42); // Estimated frequency of item 42
1-4) Bloom Filter - Membership Testing
Algorithm Overview:
- Bit array of m bits, k hash functions
- Insert: Set k bits to 1 using k hash functions
- Query: Check if all k bits are 1
- Guarantee: No false negatives, false positive rate ≈ (1 - e^(-kn/m))^k
- Optimal k: (m/n) × ln(2)
Template:
class BloomFilter {
private BitSet bitSet;
private int m; // Bit array size
private int k; // Number of hash functions
private long[] hashSeeds;
/**
* Constructor
* expectedElements: expected number of elements
* falsePositiveRate: desired false positive probability (e.g., 0.01)
*/
public BloomFilter(int expectedElements, double falsePositiveRate) {
// Optimal m: -(n * ln(p)) / (ln(2)^2)
this.m = (int) Math.ceil(
-(expectedElements * Math.log(falsePositiveRate)) / Math.pow(Math.log(2), 2)
);
// Optimal k: (m/n) * ln(2)
this.k = (int) Math.ceil((m / (double) expectedElements) * Math.log(2));
this.bitSet = new BitSet(m);
// Initialize hash seeds
Random random = new Random();
hashSeeds = new long[k];
for (int i = 0; i < k; i++) {
hashSeeds[i] = random.nextLong();
}
}
private int hash(String item, int i) {
long hash = hashSeeds[i];
for (char c : item.toCharArray()) {
hash = hash * 31 + c;
}
return Math.abs((int) (hash % m));
}
/**
* Add item to filter
* time = O(k)
* space = O(1)
*/
public void add(String item) {
for (int i = 0; i < k; i++) {
int index = hash(item, i);
bitSet.set(index);
}
}
/**
* Check if item might be in set
* time = O(k)
* space = O(1)
* Returns: true if might exist (possible false positive)
* false if definitely not in set (no false negative)
*/
public boolean mightContain(String item) {
for (int i = 0; i < k; i++) {
int index = hash(item, i);
if (!bitSet.get(index)) {
return false; // Definitely not in set
}
}
return true; // Might be in set
}
/**
* Get expected false positive rate
*/
public double getFalsePositiveRate(int insertedElements) {
return Math.pow(1 - Math.exp(-k * insertedElements / (double) m), k);
}
}
# Python Bloom Filter
import math
import mmh3 # MurmurHash3 (install: pip install mmh3)
from bitarray import bitarray # (install: pip install bitarray)
class BloomFilter:
def __init__(self, expected_elements: int, false_positive_rate: float):
"""
expected_elements: expected number of items
false_positive_rate: desired false positive probability
"""
# Optimal m
self.m = math.ceil(
-(expected_elements * math.log(false_positive_rate)) / (math.log(2) ** 2)
)
# Optimal k
self.k = math.ceil((self.m / expected_elements) * math.log(2))
self.bit_array = bitarray(self.m)
self.bit_array.setall(0)
def add(self, item: str) -> None:
"""Time: O(k)"""
for i in range(self.k):
index = mmh3.hash(item, i) % self.m
self.bit_array[index] = 1
def might_contain(self, item: str) -> bool:
"""Time: O(k), False positives possible, no false negatives"""
for i in range(self.k):
index = mmh3.hash(item, i) % self.m
if not self.bit_array[index]:
return False # Definitely not in set
return True # Might be in set
Usage Example:
// Check if URL has been visited
BloomFilter bf = new BloomFilter(10000, 0.01); // 10k URLs, 1% false positive
// Add URLs
bf.add("https://example.com");
bf.add("https://google.com");
// Query
if (bf.mightContain("https://example.com")) {
// Might be visited (99% confident)
}
if (!bf.mightContain("https://new-site.com")) {
// Definitely NOT visited (100% sure)
}
2) LeetCode Problems by Pattern
2-1) Reservoir Sampling
| Problem | Difficulty | Pattern | Notes |
|---|---|---|---|
| LC 382 | Medium | Single random (k=1) | Linked list, unknown length |
| LC 398 | Medium | Weighted sampling | Array with duplicates |
| LC 528 | Medium | Weighted random | Prefix sum + binary search |
| LC 519 | Medium | Random 2D matrix | Reset and shuffle |
LC 382 - Linked List Random Node
- Unknown list length
- O(n) time per call, O(1) space
- Classic reservoir sampling k=1
LC 398 - Random Pick Index
- Multiple occurrences of target
- Uniform selection among duplicates
- Reservoir sampling variant
2-2) Boyer-Moore Majority Vote
| Problem | Difficulty | Threshold | Candidates |
|---|---|---|---|
| LC 169 | Easy | > n/2 | 1 candidate |
| LC 229 | Medium | > n/3 | 2 candidates |
| - | - | > n/k | k-1 candidates |
LC 169 - Majority Element
- Guaranteed majority exists
- Single candidate suffices
- No verification phase needed
LC 229 - Majority Element II
- Up to 2 elements can appear > n/3 times
- Must verify candidates
- Two-phase algorithm essential
2-3) Count-Min Sketch (Conceptual)
Applications (Not direct LC problems):
- Top K Frequent Elements (LC 347) - Can use CMS + min-heap
- Kth Largest Element in Stream (LC 703) - Related streaming problem
- Data Stream as Disjoint Intervals (LC 352) - Stream processing
When to Apply:
- Large streams where HashMap exceeds memory
- Approximate answers acceptable
- Need frequency estimates, not exact counts
2-4) Bloom Filter (Conceptual)
Applications (Not direct LC problems):
- Contains Duplicate (LC 217) - Can use BF for space optimization
- First Unique Character (LC 387) - Variant: use BF as pre-filter
- Design HashMap (LC 706) - Can add BF layer for optimization
When to Apply:
- Membership queries on large sets
- False positives acceptable
- Space-critical applications (e.g., web crawlers)
2-5) Related Streaming Problems
| Problem | Difficulty | Pattern | Algorithm |
|---|---|---|---|
| LC 295 | Hard | Data stream median | Two heaps |
| LC 346 | Easy | Moving average | Sliding window + queue |
| LC 352 | Hard | Disjoint intervals | TreeMap/BST |
| LC 703 | Easy | Kth largest in stream | Min-heap |
3) Common Mistakes & Edge Cases
🚫 Common Mistakes
Reservoir Sampling:
-
Wrong Probability Calculation
// ❌ WRONG: This is NOT uniform if (random.nextInt(i + 1) < k) { // Incorrect for k=1 reservoir[...] = element; } // ✅ CORRECT: For k=1 if (random.nextInt(count) == 0) { result = element; } -
Forgetting to Update Count
// ❌ WRONG: Count not incremented while (cur != null) { if (random.nextInt(count) == 0) result = cur.val; cur = cur.next; // Forgot: count++ } -
Off-by-One in Random Range
// ❌ WRONG: random.nextInt(count + 1) for decision // ✅ CORRECT: random.nextInt(count) for k=1
Boyer-Moore:
-
Missing Verification Phase
// ❌ WRONG: Assuming candidate is always majority return candidate; // What if no majority exists? // ✅ CORRECT: Verify count int actualCount = 0; for (int num : nums) { if (num == candidate) actualCount++; } return actualCount > nums.length / 2 ? candidate : -1; -
Wrong Cancellation Logic
// ❌ WRONG: Only decrement candidate's count if (num != candidate) count--; // ✅ CORRECT: Check count == 0 to reset if (count == 0) { candidate = num; count = 1; } -
For LC 229: Not Handling Same Candidates
// ❌ WRONG: candidate1 and candidate2 can be same initially // ✅ CORRECT: Check candidate1 first, then candidate2 if (num == candidate1) count1++; else if (num == candidate2) count2++;
Count-Min Sketch:
-
Using Max Instead of Min
// ❌ WRONG: Taking maximum (gives wrong estimate) int max = Arrays.stream(counts).max().getAsInt(); // ✅ CORRECT: Taking minimum (minimizes overestimation) int min = Arrays.stream(counts).min().getAsInt(); -
Insufficient Hash Functions
- Too few hash functions → high collision rate
- Rule of thumb: d ≥ 3 for reasonable accuracy
Bloom Filter:
-
Resetting Bits on Delete
// ❌ WRONG: Bloom filters don't support deletion public void delete(String item) { for (int i = 0; i < k; i++) { bitSet.clear(hash(item, i)); // Breaks other items! } } // ✅ CORRECT: Use Counting Bloom Filter for deletions -
Ignoring False Positive Rate
- Too small bit array → high false positive rate
- Calculate optimal m and k based on expected elements
⚠️ Edge Cases
-
Empty Stream
- Reservoir: Return empty or throw exception
- Boyer-Moore: No majority element
- Bloom Filter: All queries return false
-
Single Element
- Reservoir: Always return that element
- Boyer-Moore: Single element is majority
-
All Elements Same
- Boyer-Moore: Single candidate with count = n
- Reservoir: Any element is representative
-
Stream Size Unknown
- Reservoir Sampling handles this naturally
- Count-Min Sketch: May need to adjust parameters dynamically
-
Integer Overflow
- Count-Min Sketch: Use long for large counts
- Hash functions: Use proper modulo arithmetic
4) Interview Tips & Complexity Analysis
💡 Interview Strategy
When to Use Streaming Algorithms:
Recognition Patterns:
- “Process data stream” → Streaming algorithm
- “Limited memory” or “O(1) space” → Streaming
- “Unknown size” or “infinite stream” → Reservoir Sampling
- “Majority element” → Boyer-Moore
- “Approximate count/frequency” → Count-Min Sketch
- “Have you seen before?” → Bloom Filter
Problem-Solving Framework:
1. Identify constraint: Memory vs Accuracy
├─ Exact answer needed? → Reservoir Sampling or Boyer-Moore
└─ Approximate OK? → Count-Min Sketch or Bloom Filter
2. Determine stream characteristics:
├─ Need random samples? → Reservoir Sampling
├─ Need heavy hitters? → Boyer-Moore
├─ Need frequency estimates? → Count-Min Sketch
└─ Need membership tests? → Bloom Filter
3. Consider space-time tradeoff:
├─ More space → Better accuracy
└─ Less space → More approximation error
📊 Complexity Analysis
| Algorithm | Time (per element) | Space | Accuracy |
|---|---|---|---|
| Reservoir Sampling | O(1) | O(k) | Exact |
| Boyer-Moore | O(1) | O(k) | Exact (if exists) |
| Count-Min Sketch | O(d) | O(w × d) | ε-approximate |
| Bloom Filter | O(k) | O(m) bits | False positives |
Space Comparisons:
- HashMap: O(n) - exact counts for all elements
- Count-Min Sketch: O(w × d) where w × d << n
- Bloom Filter: O(m) bits, m = O(n log(1/p))
Typical Parameters:
- Count-Min Sketch: w = 1000, d = 5 → 5KB for millions of elements
- Bloom Filter: 10 bits per element for 1% false positive rate
🎯 Interview Talking Points
-
Why Reservoir Sampling Works:
- “Each element has equal probability k/n of being selected”
- “Prove by induction: P(element i selected) = k/i × (i/(i+1)) × … × (n-1)/n = k/n”
-
Boyer-Moore Intuition:
- “Pair up different elements and cancel them”
- “Majority element survives cancellation”
- “At most k-1 elements can appear more than n/k times”
-
Count-Min Sketch Trade-off:
- “Trade space for accuracy: w and d control error bounds”
- “Never underestimates (only overestimates due to collisions)”
- “Minimum across hash functions reduces overestimation”
-
Bloom Filter Properties:
- “False positives possible, false negatives impossible”
- “Can’t delete items (unless using Counting Bloom Filter)”
- “Space-efficient: ~10 bits per element for 1% FP rate”
🔧 Optimization Techniques
-
Reservoir Sampling:
- Use
random.nextInt(count++)idiom for cleaner code - For weighted sampling: Use prefix sum + binary search
- Use
-
Boyer-Moore:
- For k=2 (majority > n/2): Skip verification if guaranteed
- For k=3 (> n/3): Track exactly 2 candidates
- For general k: Use HashMap with size k-1
-
Count-Min Sketch:
- Conservative Update: Only increment minimum count
- Use faster hash functions (MurmurHash, CityHash)
- Adjust w and d based on accuracy requirements
-
Bloom Filter:
- Choose hash functions that minimize correlation
- Consider Counting Bloom Filter for deletions
- Use bit-level operations for space efficiency
📚 Related Topics
- Sampling: Stratified sampling, importance sampling
- Approximate Algorithms: HyperLogLog (cardinality estimation)
- Probabilistic: Skip lists, treaps
- Streaming: Misra-Gries algorithm, Space Saving algorithm
- Online Algorithms: Competitive analysis, adversarial models
5) Complete Code Examples
Example 1: Weighted Reservoir Sampling (LC 528)
// LC 528 - Random Pick with Weight
class Solution {
private int[] prefixSum;
private Random random;
/**
* time = O(N) for constructor
* space = O(N)
*/
public Solution(int[] w) {
prefixSum = new int[w.length];
prefixSum[0] = w[0];
for (int i = 1; i < w.length; i++) {
prefixSum[i] = prefixSum[i - 1] + w[i];
}
random = new Random();
}
/**
* time = O(log N) using binary search
* space = O(1)
*/
public int pickIndex() {
int target = random.nextInt(prefixSum[prefixSum.length - 1]) + 1;
// Binary search for target in prefixSum
int left = 0, right = prefixSum.length - 1;
while (left < right) {
int mid = left + (right - left) / 2;
if (prefixSum[mid] < target) {
left = mid + 1;
} else {
right = mid;
}
}
return left;
}
}
Example 2: Stream Median (Related Problem LC 295)
// LC 295 - Find Median from Data Stream
class MedianFinder {
private PriorityQueue<Integer> maxHeap; // Lower half
private PriorityQueue<Integer> minHeap; // Upper half
public MedianFinder() {
maxHeap = new PriorityQueue<>((a, b) -> b - a);
minHeap = new PriorityQueue<>();
}
/**
* time = O(log N)
* space = O(N)
*/
public void addNum(int num) {
// Add to max heap first
maxHeap.offer(num);
// Balance: move largest from max to min
minHeap.offer(maxHeap.poll());
// Maintain size property: maxHeap.size >= minHeap.size
if (maxHeap.size() < minHeap.size()) {
maxHeap.offer(minHeap.poll());
}
}
/**
* time = O(1)
* space = O(1)
*/
public double findMedian() {
if (maxHeap.size() > minHeap.size()) {
return maxHeap.peek();
}
return (maxHeap.peek() + minHeap.peek()) / 2.0;
}
}
Summary
Core Streaming Algorithms:
- ✅ Reservoir Sampling - Uniform random sampling from stream
- ✅ Boyer-Moore - Find majority/heavy hitters (exact)
- ✅ Count-Min Sketch - Frequency estimation (approximate)
- ✅ Bloom Filter - Membership testing (probabilistic)
Key Takeaways:
- Streaming algorithms trade accuracy for space efficiency
- Reservoir Sampling: Exact, O(k) space for k samples
- Boyer-Moore: Exact for heavy hitters, O(k) space
- Count-Min Sketch: Approximate, tunable error bounds
- Bloom Filter: No false negatives, space-efficient
Interview Focus:
- Understand probability proofs (especially Reservoir Sampling)
- Know when to use each algorithm
- Master two-phase verification (Boyer-Moore)
- Explain space-accuracy tradeoffs
Practice Problems:
- Start with LC 382, 398 (Reservoir Sampling)
- Master LC 169, 229 (Boyer-Moore)
- Understand LC 295, 703 (Stream + Heap)
- Study hash-based designs (LC 706, 705)