Find out how HackerEarth can boost your tech recruiting

Learn more
piller_image

Collections and Defaultdict in Python

NSA whistleblower in exile, Edward Snowden, talks about how FBI could have reviewed 650K emails in less than 8 days!

snowden_tweet

Snowden says the FBI could have used hashing to identify emails that were not copies of ones they had already seen. Few things capture people’s interest like alleged conspiracies and political intrigue, yes? I’m no different. But what interests more is hashing. Touted by many as the “greatest idea in programming,” hashing, which involves the hash function, helps you find, say A, stored somewhere, say B. For example, the organizing and accessing of names and numbers in your “can’t bear to be parted from” smartphone.

Hashing is a technique where a data-structure called the “hash map” is implemented. This structure is an associative array where specific keys are mapped to specific values. A hash function is then used to compute an index into an array of buckets or slots from which the desired value can be found. The result is that (key, value) lookups are extremely fast and more efficient than searches based on popular trees like BST. To get in-depth knowledge about hashing, I recommend that you can go through our “Basics of Hash Tables” in our practice section.

Almost all modern languages have hashing implemented at the language level. In Python, hashing is implemented using the dictionary data structure, which is one of the basic data structures a beginner in Python learns. If you have only been using the dict module implementation in your code, I suggest you look at other implementations like defaultdicts and ordereddicts and use them more frequently in your code. Here, we will look more closely into the defaultdict module.

Defaultdicts come in the Collections internal library. Collections contains alternatives to the general purpose Python containers like dict, set, list, and tuple. Kind of like the Dark Knight is the more interesting “implementation” of Bruce Wayne.

burger

Defaultdict is subclassed from the built-in dict module. You may have encountered the following common uses cases for which you have been using the default container.

Building nested dicts or JSON type constructs:

JSON is a very popular data structure. One of the major use cases for a JSON is creating web APIs. JSON also neatly corresponds to our dict object. A sample JSON object could look like this.

{"menu":
    {"id": "file",
    "value": "File",
    "popup": {
        "menuitem": [
        {"value": "New", "onclick": "CreateNewDoc()"},
        {"value": "Open", "onclick": "OpenDoc()"},
        {"value": "Close", "onclick": "CloseDoc()"}
    ]}
}}

Source:http://json.org/example.html.

We cannot create a json file by using the following command; it will throw a KeyError.

some_dict = {}
some_dict["menu"]["popup"]["value"] = "New"

So, we will have to write complicated error handling code to handle this KeyError.

This way of writing is considered un-Pythonic. In its place, try out the following construct.

import collections
tree = lambda: collections.defaultdict(tree)
some_dict = tree()
# below will create non existent keys
some_dict["menu"]["popup"]["value"] = "New"

A defaultdict is initialized with a function (“default factory”) that takes no arguments and provides the default value for a non-existent key. A defaultdict will never raise a KeyError. Any key that does not exist gets the value returned by the default factory.

Please ensure that you pass function objects to defaultdict. Do not call the function, that is, defaultdict(func), not defaultdict(func()).

Let’s check out how it works.

ice_cream = collections.defaultdict(lambda: 'Vanilla')
ice_cream['Sarah'] = 'Chunky Monkey'
ice_cream['Abdul'] = 'Butter Pecan'
print(ice_cream['Sarah']) # out: 'Chunky Monkey'
print(ice_cream['Joe']) # out: 'Vanilla

Having cool default values:

Another fast and flexible use case is to use itertools.repeat() which can supply any constant value.

import itertools
def constant_factory(value):
    return itertools.repeat(value).next
d = collections.defaultdict(constant_factory(''))
d.update(name='John', action='ran')
print('%(name)s %(action)s to %(object)s' % d)

This should print out “John ran to.” As you can observe, the “object” variable gracefully defaulted to an empty string.

Performance:

Like you see in this stackoverflow post, we tried to do a similar benchmarking only between dicts(setdefault) and defaultdict. You can see it  here: https://github.com/infinite-Joy/hacks/blob/master/defaultdict_benchmarking.ipynb

from collections import defaultdict

try:
    t=unichr(100)
except NameError:
    unichr=chr

def f1(li):
    '''defaultdict'''
    d = defaultdict(list)
    for k, v in li:
        d[k].append(v)
    return d.items()

def f2(li):
    '''setdefault'''
    d={}
    for k, v in li:
        d.setdefault(k, []).append(v)
    return d.items()


if __name__ == '__main__':
    import timeit
    import sys
    print(sys.version)
    few=[('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
    fmt='{:>12}: {:10.2f} micro sec/call ({:,} elements, {:,} keys)'
    for tag, m, n in [('small',5,10000), ('medium',20,1000), ('bigger',1000,100), ('large',5000,10)]:
        for f in [f1,f2]:
            s = few*m
            res=timeit.timeit("{}(s)".format(f.__name__), setup="from __main__ import {}, s".format(f.__name__), number=n)
            st=fmt.format(f.__doc__, res/n*1000000, len(s), len(f(s)))
            print(st)
            s = [(unichr(i%0x10000),i) for i in range(1,len(s)+1)]
            res=timeit.timeit("{}(s)".format(f.__name__), setup="from __main__ import {}, s".format(f.__name__), number=n)
            st=fmt.format(f.__doc__, res/n*1000000, len(s), len(f(s)))
            print(st)
        print()

Below is the output that I got on my machine using Anaconda.

3.5.2 |Anaconda 4.1.1 (32-bit)| (default, Jul  5 2016, 11:45:57) [MSC v.1900 32 bit (Intel)]
 defaultdict:       5.48 micro sec/call (25 elements, 3 keys)
 defaultdict:      11.20 micro sec/call (25 elements, 25 keys)
  setdefault:       7.80 micro sec/call (25 elements, 3 keys)
  setdefault:       8.97 micro sec/call (25 elements, 25 keys)

 defaultdict:      14.66 micro sec/call (100 elements, 3 keys)
 defaultdict:      42.19 micro sec/call (100 elements, 100 keys)
  setdefault:      26.71 micro sec/call (100 elements, 3 keys)
  setdefault:      34.78 micro sec/call (100 elements, 100 keys)

 defaultdict:     623.21 micro sec/call (5,000 elements, 3 keys)
 defaultdict:    2207.91 micro sec/call (5,000 elements, 5,000 keys)
  setdefault:    1329.99 micro sec/call (5,000 elements, 3 keys)
  setdefault:    3076.57 micro sec/call (5,000 elements, 5,000 keys)

 defaultdict:    4625.00 micro sec/call (25,000 elements, 3 keys)
 defaultdict:   15950.98 micro sec/call (25,000 elements, 25,000 keys)
  setdefault:    6907.47 micro sec/call (25,000 elements, 3 keys)
  setdefault:   17605.08 micro sec/call (25,000 elements, 25,000 keys)

Following are  the broad inferences that can be made from the data:

1. defaultdict is faster and simpler with small data sets.
2. defaultdict is faster for larger data sets with more homogenous key sets.
3. setdefault has an advantage over defaultdict if we consider more heterogeneous key sets.

Note: The results have been taken by running it on my machine with Python 3.5 implementation of Anaconda. I strongly recommend you to not follow these blindly. Do your own benchmarking tests with your own data before implementing your algorithm.

Now that we have discussed the DefaultDict module, I hope that you are already thinking of using it more and also refactoring your code base to implement this module more. Next, I’ll be coming up with a detailed discussion on the Counter module.

References:
stackoverflow, How are Python’s Built In Dictionaries Implemented
stackoverflow, Is a Python dictionary an example of a hash table?e
python.org, Dictionary in Python
python.org, Python3 docs, collections — Container datatypes
python.org, Python2 docs, collections — Container datatypes
accelebrate, Using defaultdict in Python

Hackerearth Subscribe

Get advanced recruiting insights delivered every month

Related reads

Best Interview Questions For Assessing Tech Culture Fit in 2024
Best Interview Questions For Assessing Tech Culture Fit in 2024

Best Interview Questions For Assessing Tech Culture Fit in 2024

Finding the right talent goes beyond technical skills and experience. Culture fit plays a crucial role in building successful teams and fostering long-term…

Best Hiring Platforms in 2024: Guide for All Recruiters
Best Hiring Platforms in 2024: Guide for All Recruiters

Best Hiring Platforms in 2024: Guide for All Recruiters

Looking to onboard a recruiting platform for your hiring needs/ This in-depth guide will teach you how to compare and evaluate hiring platforms…

Best Assessment Software in 2024 for Tech Recruiting
Best Assessment Software in 2024 for Tech Recruiting

Best Assessment Software in 2024 for Tech Recruiting

Assessment software has come a long way from its humble beginnings. In education, these tools are breaking down geographical barriers, enabling remote testing…

Top Video Interview Softwares for Tech and Non-Tech Recruiting in 2024: A Comprehensive Review
Top Video Interview Softwares for Tech and Non-Tech Recruiting in 2024: A Comprehensive Review

Top Video Interview Softwares for Tech and Non-Tech Recruiting in 2024: A Comprehensive Review

With a globalized workforce and the rise of remote work models, video interviews enable efficient and flexible candidate screening and evaluation. Video interviews…

8 Top Tech Skills to Hire For in 2024
8 Top Tech Skills to Hire For in 2024

8 Top Tech Skills to Hire For in 2024

Hiring is hard — no doubt. Identifying the top technical skills that you should hire for is even harder. But we’ve got your…

How HackerEarth and Olibr are Reshaping Tech Talent Discovery
How HackerEarth and Olibr are Reshaping Tech Talent Discovery

How HackerEarth and Olibr are Reshaping Tech Talent Discovery

In the fast-paced tech world, finding the right talent is paramount. For years, HackerEarth has empowered tech recruiters to identify top talent through…

Hackerearth Subscribe

Get advanced recruiting insights delivered every month

View More

Top Products

Hackathons

Engage global developers through innovation

Hackerearth Hackathons Learn more

Assessments

AI-driven advanced coding assessments

Hackerearth Assessments Learn more

FaceCode

Real-time code editor for effective coding interviews

Hackerearth FaceCode Learn more

L & D

Tailored learning paths for continuous assessments

Hackerearth Learning and Development Learn more