Python Collections Module: Hidden Gems for Data Science
Introduction
When we think about Python for Data Science, libraries like pandas, NumPy, and scikit-learn usually steal the spotlight. And rightfully so — they are incredible tools. But sometimes, the most elegant solution to a data problem is already sitting quietly in Python’s standard library.
Today, I want to talk about one of my favorite underappreciated modules: collections.
The collections module provides specialized container datatypes that can replace Python’s general-purpose built-in containers (dict, list, tuple, set) in many common scenarios — often with cleaner code and better performance.
Let’s explore the five key data structures it offers and see how they apply to real data science tasks.
1. Counter — Counting Made Trivial
Counter is a dict subclass for counting hashable objects. If you’ve ever written a word frequency analysis or counted unique values in a column, you’ve probably reinvented this wheel.
from collections import Counter
# Count word frequencies in a text
text = "the quick brown fox jumps over the lazy dog the fox"
word_counts = Counter(text.split())
print(word_counts)
# Counter({'the': 3, 'fox': 2, 'quick': 1, 'brown': 1, ...})
# Most common words
print(word_counts.most_common(2))
# [('the', 3), ('fox', 2)]Data Science Use Case: Quickly computing value distributions, finding the mode of a dataset, or building a simple bag-of-words model for NLP tasks.
# Count occurrences in a pandas-like list
labels = ['cat', 'dog', 'cat', 'bird', 'cat', 'dog', 'cat']
label_dist = Counter(labels)
print(label_dist) # Counter({'cat': 4, 'dog': 2, 'bird': 1})2. defaultdict — No More KeyError Anxiety
defaultdict is a dict subclass that automatically initializes missing keys with a default value. This eliminates the need for tedious if key in dict checks.
from collections import defaultdict
# Group items by category
data = [
('fruit', 'apple'), ('vegetable', 'carrot'),
('fruit', 'banana'), ('vegetable', 'broccoli'),
('fruit', 'cherry')
]
grouped = defaultdict(list)
for category, item in data:
grouped[category].append(item)
print(dict(grouped))
# {'fruit': ['apple', 'banana', 'cherry'], 'vegetable': ['carrot', 'broccoli']}Data Science Use Case: Grouping records by a key, building adjacency lists for graph algorithms, or accumulating values during data preprocessing without boilerplate initialization code.
3. namedtuple — Readable, Lightweight Objects
namedtuple creates tuple subclasses with named fields. They’re memory-efficient like regular tuples but far more readable because you access fields by name instead of index.
from collections import namedtuple
# Define a data structure for a customer record
Customer = namedtuple('Customer', ['name', 'age', 'city', 'spend'])
c1 = Customer(name='Alice', age=30, city='Jakarta', spend=150.0)
c2 = Customer(name='Bob', age=25, city='Bandung', spend=200.0)
print(c1.name) # Alice
print(c2.spend) # 200.0
# Still works as a regular tuple
print(c1[0]) # AliceData Science Use Case: Representing structured records when you don’t need the full overhead of a pandas DataFrame or a dataclass. Great for intermediate data processing steps where readability matters.
4. deque — Efficient Double-Ended Queue
deque (double-ended queue) supports fast appends and pops from both ends — O(1) time complexity compared to O(n) for lists when popping from the left.
from collections import deque
# Maintain a sliding window of the last 5 values
window = deque(maxlen=5)
for i in range(10):
window.append(i)
print(f"Window: {list(window)}")
# Output:
# Window: [0]
# Window: [0, 1]
# Window: [0, 1, 2]
# Window: [0, 1, 2, 3]
# Window: [0, 1, 2, 3, 4]
# Window: [1, 2, 3, 4, 5] ← oldest value (0) automatically removed
# Window: [2, 3, 4, 5, 6] ← oldest value (1) automatically removed
# ...Data Science Use Case: Implementing sliding window calculations (moving averages), maintaining a fixed-size buffer of recent data points, or building efficient FIFO/LIFO queues for streaming data pipelines.
5. OrderedDict — When Insertion Order Matters
OrderedDict remembers the order in which items were inserted. Since Python 3.7+, regular dict objects also maintain insertion order, but OrderedDict still offers unique methods like move_to_end() and popitem(last=...).
from collections import OrderedDict
# Implement a simple LRU (Least Recently Used) cache
class LRUCache:
def __init__(self, capacity: int):
self.cache = OrderedDict()
self.capacity = capacity
def get(self, key):
if key not in self.cache:
return -1
self.cache.move_to_end(key) # Mark as recently used
return self.cache[key]
def put(self, key, value):
if key in self.cache:
self.cache.move_to_end(key)
self.cache[key] = value
if len(self.cache) > self.capacity:
self.cache.popitem(last=False) # Remove least recently used
# Usage
cache = LRUCache(3)
cache.put('a', 1)
cache.put('b', 2)
cache.put('c', 3)
print(list(cache.cache.keys())) # ['a', 'b', 'c']
cache.get('a') # Access 'a', move to end
cache.put('d', 4) # Evict 'b' (least recently used)
print(list(cache.cache.keys())) # ['c', 'a', 'd']Data Science Use Case: Building custom caching layers for expensive computations, implementing eviction policies for in-memory data stores, or managing ordered pipelines where sequence matters.
Bringing It All Together
Here’s a mini data processing pipeline that uses several collections tools in one go:
from collections import Counter, defaultdict, namedtuple, deque
# Sample: analyze a stream of user click events
ClickEvent = namedtuple('ClickEvent', ['user_id', 'page', 'timestamp'])
events = [
ClickEvent('u1', '/home', '10:00'),
ClickEvent('u2', '/products', '10:01'),
ClickEvent('u1', '/products', '10:02'),
ClickEvent('u3', '/home', '10:03'),
ClickEvent('u1', '/cart', '10:04'),
ClickEvent('u2', '/cart', '10:05'),
ClickEvent('u3', '/products', '10:06'),
]
# 1. Count page popularity
page_counts = Counter(event.page for event in events)
print("Page popularity:", page_counts.most_common())
# 2. Group events by user
user_events = defaultdict(list)
for event in events:
user_events[event.user_id].append(event.page)
print("User journeys:", dict(user_events))
# 3. Track last 3 events in a sliding window
recent = deque(maxlen=3)
for event in events:
recent.append(event.page)
print("Last 3 pages viewed:", list(recent))Output:
Page popularity: [('/home', 2), ('/products', 3), ('/cart', 2)]
User journeys: {'u1': ['/home', '/products', '/cart'], 'u2': ['/products', '/cart'], 'u3': ['/home', '/products']}
Last 3 pages viewed: ['/cart', '/home', '/products']Key Takeaways
| Data Structure | Best For | Performance Benefit |
|---|---|---|
Counter | Counting, frequency analysis | Cleaner than manual dict counting |
defaultdict | Grouping, accumulating | No KeyError handling needed |
namedtuple | Lightweight structured data | Memory-efficient, readable |
deque | Queues, sliding windows | O(1) append/pop from both ends |
OrderedDict | Ordered caches, LRU | move_to_end(), popitem() |
The beauty of the collections module is that it’s already there — no pip install required. It’s part of Python’s standard library, battle-tested, and optimized in C under the hood.
Next time you find yourself writing verbose boilerplate for counting, grouping, or managing ordered data, take a moment to check if collections already has a tool for the job. You might be surprised how much cleaner your code becomes.
Happy coding! 🐍
~ Kang Ifaz