Python Collections : High Performing Containers For Complex Problems

Posted on Updated on


1.Introduction

Python is known for its powerful general purpose built-in data types like list, dict, tuple and set.  But Python also has collection objects like Java and C++.  These objects are developed on top of the general built-in containers with addtional functionalities which can be used in special scenarios.

The objective of this article is to introduce python collection objects and explain them with apropriate code snippets.  The collections library contains the collections objects, they are namedtuples (v2.6), deque (v2.4), ChainMap(v3.3), Counter(v2.7 ), OrderedDict(v2.7), defaultdict(v2.5) .  Python 3.x also has userDict, userList, userString to create own custom container types (not in the scope of this article), which deserves a separate article.

NOTE: Python 2.x user might aware of various releases they are objects got introduced.  All these objects are available in Python 3.x from 3.1 onwards, except that ChainMap which  got introduced in v3.3.  All the code snippets in the articles are executed in Python 3.5 environment.

2. Namedtuple

As the name suggests, namedtuple is a tuple with name.  In standard tuple, we access the elements using the index, whereas namedtuple allows user to define name for elements.  This is very handy especially processing csv (comma separated value) files and working with complex and large dataset, where the code becomes messy with the use of indices (not so pythonic).

2.1 Example 1

Namedtuples are available in collections library in python. We have to import collections library before using any of container object from this library.

>>>from collections import namedtuple
>>>saleRecord = namedtuple('saleRecord','shopId saleDate salesAmout totalCustomers')
>>>
>>>
>>>#Assign values to a named tuple 
>>>shop11=saleRecord(11,'2015-01-01',2300,150) 
>>>shop12=saleRecord(shopId=22,saleDate="2015-01-01",saleAmout=1512,totalCustomers=125)

In the above code snippet, in the first line we import namedtuple from the collections library. In the second line we create a namedtuple called “saleRecord”, which has shopId, saleDate, salesAmount and totalCustomers as fields. Note that namedtuple() takes two string arguments, first argument is the name of tuple and second argument is the list of fields names seperated by space or comma. In the above example space is used as delimeter.
We have also created two tuples here. They are shop11 and shop12.  For shop11, the values are assigned to fields based on the order of the fields and shop12, the values are assigned using the names.

2.2 Example 2

>>>#Reading as a namedtuple
>>>print("Shop Id =",shop12.shopId)
12
>>>print("Sale Date=",shop12.saleDate)
2015-01-01
>>>print("Sales Amount =",shop12.salesAmount)
1512
>>>print("Total Customers =",shop12.totalCustomers)
125

The above code is pretty much clear that tuple is accessed using the names. It is also possible to access them using indexes of the tuples which is the usual way.

2.3 Interesting Methods and Members

2.3.1 _make

The _make method is used to convert the given iteratable item (list, tuple,dictionary) into a named tuple.

>>>#Convert a list into a namedtuple
>>>aList = [101,"2015-01-02",1250,199]
>>>shop101 = saleRecord._make(aList)
>>>print(shop101)
saleRecord(shopId=101, saleDate='2015-01-02', salesAmount=1250, totalCustomers=199)

>>>#Convert a tuple into a namedtuple
>>>aTup =(108,"2015-02-28",1990,189)
>>>shop108=saleRecord._make(aTup)
>>>print(shop108)
saleRecord(shopId=108, saleDate='2015-02-28', salesAmount=1990, totalCustomers=189)
>>>

2.3.2 _fields

The _fields is a tuple, which contains the names of the tuple.

>>>print(sho108._fields)
>>>('shopId', 'saleDate', 'salesAmount', 'totalCustomers')

2.4 CSV File Processing

As we discussed namedtuple will be very handy while processing a csv data file, where we can access the data using names instead of indexes, which make the code more meaningful and efficient.

from csv import reader
from collections import namedtuple

saleRecord = namedtuple('saleRecord','shopId saleDate totalSales totalCustomers')
fileHandle = open("salesRecord.csv","r")
csvFieldsList=csv.reader(fileHandle)
for fieldsList in csvFieldsList:
    shopRec = saleRecord._make(fieldsList)
    overAllSales += shopRec.totalSales;

print("Total Sales of The Retail Chain =",overAllSales)

In the above code snippet, we have the files salesRecord.csv which contains sales records of shops of a particular retain chain. It contains the values for the fields shopId,saleDate,totalSales,totalCustomers. The fields are delimited by comma and the records are delimited by new line.
The csv.reader() read the file and provides a iterator. The iterator, “csvFieldsList” provides list of fields for every single row of the csv file. As we know the _make() converts the list into namedtuple and the rest of the code is self explanatory.

 

3.Counter

Counter is used for rapid tallies.  It is a dictionary, where the elements are stored as keys and their counts are stored as values.

3.1 Creating Counters

The Counter() class takes an iteratable object as an argument and computes the count for each element in the object and present as a key value pair.

>>>from collections import Counter
>>>listOfInts=[1,2,3,4,1,2,3,1,2,1]
>>>cnt=Counter(listOfInts)
>>>print(cnt)
Counter({1: 4, 2: 3, 3: 2, 4: 1})

In the above code snippet, listOfInts is a list which contains numbers. It is passed to Counter() and we got cnt, which is a container object. The cnt is a dictionary, which contains the unique numbers present in the given list as keys, and their respect counts as the value.

3.2 Accessing Counters

Counter is a subclass of dictionary.  So it can be accessed the same as dictionary.   The “cnt” can be handled as a regular dictionary object.

>>> cnt.items()
dict_items([(1, 4), (2, 3), (3, 2), (4, 1)])
>>> cnt.keys()
dict_keys([1, 2, 3, 4])
>>> cnt.values()
dict_values([4, 3, 2, 1])

3.3 Interesting Methods & Usecases

3.3.1 most_common

The most_common(n) of Counter class, provides most commonly occured keys. The n is used as a rank, for example, n = 2 will provide top two keys.

>>>name = "Saravanan Subramanian"
>>>letterCnt=Counter(name)
>>>letterCnt.most_common(1)
[('a', 7)]
>>>letterCnt.most_common(2)
[('a', 7), ('n', 4)]
>>>letterCnt.most_common(3)
[('a', 7), ('n', 4), ('r', 2)]

In the above code, we could see that the string is parsed as independent characters as keys and their respective count is stored as values. So, the letterCnt.most_common(1) provides the top letter which has highest occurances.

3.3.2 Operations on Counter

The Counter() subclass is also called as Multiset. It supports addition, substraction, unition and intersection operations on the Counter class.

>>> a = Counter(x=1,y=2,z=3)
>>> b = Counter(x=2,y=3,z=4)
>>> a+b
Counter({'z': 7, 'y': 5, 'x': 3})
>>> a-b       #This will result in negative values & will be omitted
Counter()    
>>> b-a
Counter({'y': 1, 'x': 1, 'z': 1})
>>> a & b    #Chooses the minimum values from their respective pair
Counter({'z': 3, 'y': 2, 'x': 1})
>>> a | b   #Chooses the maximum values from their respective pair
Counter({'z': 4, 'y': 3, 'x': 2})

4. Default Dictionary

The defaultdict() is available part of collections library. It allows the user to specify a function to be called when key is not present in the dictionary.

In a standard dictionary, accesing an element where the key is not present, will raise “Key Error”. So, this is a problem when working working with collections (list, set, etc), especially while creating them.

So, when a dictionary is queried for a key, which is not exists, the function passed as an argument to the named argument “default_dictionary” of default_dict() will called to set a value for given “key” into dictionary.

4.1 Creating Default Dictionary

The defaultdict() is available part of collections library.  The default dict takes a function without argument which returns value as an argument.

4.1.1 Example 1

>>> 
>>> booksIndex = defaultdict(lambda:'Not Available')
>>> booksIndex['a']='Arts'
>>> booksIndex['b']='Biography'
>>> booksIndex['c']='Computer'
>>> print(booksIndex)
defaultdict(<function  at 0x030EB3D8>, {'c': 'Computer', 'b': 'Biography', 'a': 'Arts'})
>>> booksIndex['z']
'Not Available'
>>> print(booksIndex)
defaultdict(<function  at 0x030EB3D8>, {'c': 'Computer', 'b': 'Biography', 'z': 'Not Available', 'a': 'Arts'})
>>> 

In the above example, the booksIndex is a defaultdict, where it set ‘Not Available” as a value if any non-existant key is accessed. We have added values for keys a, b & c into the defaultdict. The print(booksIndex) shows that the defaultdict contains values only for these keys. While trying to access the value for key ‘z’, which we have not set, it returned value as ‘Not Available‘ and updated the dictionary.

4.1.2 Example 2

>>> titleIndex = [('a','Arts'),('b','Biography'),('c','Computer'),('a','Army'),('c','Chemistry'),('d','Dogs')]
>>> rackIndices = defaultdict(list)
>>> for id,title in titleIndex:
	rackIndices[id].append(title)	
>>> rackIndices.items()
dict_items([('d', ['Dogs']), ('b', ['Biography']), ('a', ['Arts', 'Army']), ('c', ['Computer', 'Chemistry'])])
>>> 

In the above example, titleIndex contains a list of tuples. We want to aggregate this list of tuples to identify titles for each alphabets. So, we can have a dictionary where key is the alphabet and value is the list of titles. Here we used a defaultdict with “list” as a function to be called for missing elements. So for each new elements list will be called, and it will create an empty list object. The consecutive append() methods on the list will add elements to the list.

5. Ordered Dictionary

The ordered dictionary maintains the order of elements addition into the dictionary, where the standard dictionary will not maintain the order of inclusion.

5.1 Ordered Dictionary Creation

Ordered Dictionary is created using OrderedDict() from collections library. It an subsclass of regular dictionary, so it inherits all other methods and behaviours of regular dictionary.

>>> from collections import OrderedDict
>>> dOrder=OrderedDict()
>>> dOrder['a']='Alpha'
>>> dOrder['b']='Bravo'
>>> dOrder['c']='Charlie'
>>> dOrder['d']='Delta'
>>> dOrder['e']='Echo'
>>> dOrder
>>> OrderedDict([('a', 'Alpha'), ('b', 'Bravo'), ('c', 'Charlie'), ('d', 'Delta'), ('e', 'Echo')])
>>> >>> dOrder.keys()
odict_keys(['a', 'b', 'c', 'd', 'e'])
>>> dOrder.values()
odict_values(['Alpha', 'Bravo', 'Charlie', 'Delta', 'Echo'])
>>> dOrder.items()
odict_items([('a', 'Alpha'), ('b', 'Bravo'), ('c', 'Charlie'), ('d', 'Delta'), ('e', 'Echo')])
>>> 

5.2 Creating from other iteratable items

OrderedDict can also be created by passing an dictionary or a list of key, value pair tuples.

>>> from collections import OrderedDict
>>> listKeyVals = [(1,"One"),(2,"Two"),(3,"Three"),(4,"Four"),(5,"Five")]
>>> x = OrderedDict(listKeyVals)
>>> x
OrderedDict([(1, 'One'), (2, 'Two'), (3, 'Three'), (4, 'Four'), (5, 'Five')])
>>> 

5.3 Sort and Store

One of the interesting use case for OrderedDict is Rank problem. For example, consider the problem a dictionary contains students names and their marks, now we have to find out the best student and rank them according to their marks. So, OrderedDict is the right choice here. Since OrderedDict will remember the order or addition and sorted() will sort a dictionary we can combine both to created a rank list based on the student marks. Please check the example below:

>>> studentMarks={}
>>> studentMarks["Saravanan"]=100
>>> studentMarks["Subhash"]=99
>>> studentMarks["Raju"]=78
>>> studentMarks["Arun"]=85
>>> studentMarks["Hasan"]=67
>>> studentMarks
{'Arun': 85, 'Subhash': 99, 'Raju': 78, 'Hasan': 67, 'Saravanan': 100}
>>> sorted(studentMarks.items(),key=lambda t:t[0])
[('Arun', 85), ('Hasan', 67), ('Raju', 78), ('Saravanan', 100), ('Subhash', 99)]
>>> sorted(studentMarks.items(),key=lambda t:t[1])
[('Hasan', 67), ('Raju', 78), ('Arun', 85), ('Subhash', 99), ('Saravanan', 100)]
>>> sorted(studentMarks.items(), key = lambda t:-t[1])
[('Saravanan', 100), ('Subhash', 99), ('Arun', 85), ('Raju', 78), ('Hasan', 67)]
>>> rankOrder = OrderedDict(sorted(studentMarks.items(), key = lambda t:-t[1]))
>>> rankOrder
OrderedDict([('Saravanan', 100), ('Subhash', 99), ('Arun', 85), ('Raju', 78), ('Hasan', 67)])

In the above example, studentMarks is a dictionary contains the student name as a key and their mark as the value. It got sorted using its value and passed to OrderedDict and got stored in rankOrder. Now rankOrder contains the highest marked student as the first entry, and next highest as the second entry and so on. This ordered is presevered in this dictionary.

6. Deque

Deque means double ended queue and it pronounced as “deck”. It is an extention to the standard list data structure. The standard list allows the user to append or extend elements only at the end. But deque allows the user to operate on both ends, so that the user can implement both stacks and queues.

6.1 Creation & Performing Operations on Deque

The deque() is available in collections library. It takes iteratable entity as an argument and an optional maximum length. If maxlen is set, it ensure that deque length does not exceeds the size of the maxlen.

>>> from collections import deque
>>> aiao = deque([1,2,3,4,5],maxlen=5)
aiao = deque([1,2,3,4,5])
>>> aiao.append(6)
>>> aiao
deque([2, 3, 4, 5, 6], maxlen=5)
>>> aiao.appendleft(1)
>>> aiao
deque([1, 2, 3, 4, 5], maxlen=5)

In the above example, we have created a deque with maxlen 5, once we appended 6th element on the right, it pushed first element on the left.  Similarly, it pushes out the last element on the right when we append element on the left.

6.2 Operations on Right

Operations on the right are common to performing any opertions on the list.  The methods append(), extend() and pop() are operate on the rightside of the deque().

>>> aiao.append(6)
>>> aiao
deque([2, 3, 4, 5, 6], maxlen=5)
>>> aiao.extend([7,8,9])
>>> aiao
deque([5, 6, 7, 8, 9], maxlen=5)
>>> aiao.pop()
9

6.3 Operation on the Left

The special feature of performing operations on the left is supported by set of methods like appendleft(), extendleft(), popleft().

>>> aiao = deque([1,2,3,4,5],maxlen=5)
>>> aiao.appendleft(0)
>>> aiao
deque([0, 1, 2, 3, 4], maxlen=5)
>>> aiao.extendleft([-1,-2,-3])
>>> aiao
deque([-3, -2, -1, 0, 1], maxlen=5)
>>> aiao.popleft()
-3

6.4 Example 2 (without maxlen)

If the maxlen value is not set, the deque does not perform any trimming operations to maintain the size of the deque.

>>> aiao = deque([1,2,3,4,5])
>>> aiao.appendleft(0)
>>> aiao
deque([0, 1, 2, 3, 4, 5])
>>> aiao.extendleft([-1,-2,-3])
>>> aiao
deque([-3, -2, -1, 0, 1, 2, 3, 4, 5])
>>> 

From the above example, the deque aiao continues to grow for the append and extend operations performed on it.

7. ChainMap

ChainMap allows to combine multiple dictionaries into a single dictionary, so that operations can be performed on single logical entity.  The ChainMap() does not create any new dictionary, instead it maintains references to the original dictionaries, all operations are performed only on the referred dictionaries.

7.1 Creating ChainMap

>>> from collections import ChainMap
>>> x = {'a':'Alpha','b':'Beta','c':'Cat'}
>>> y = { 'c': "Charlie", 'd':"Delta", 'e':"Echo"}
>>> z = ChainMap(x,y)
>>> z
ChainMap({'c': 'Cat', 'b': 'Beta', 'a': 'Alpha'}, {'d': 'Delta', 'c': 'Charlie', 'e': 'Echo'})
>>> list(z.keys())
['b', 'd', 'c', 'e', 'a']
>>> list(z.values())
['Beta', 'Delta', 'Cat', 'Echo', 'Alpha']
>>> list(z.items())
[('b', 'Beta'), ('d', 'Delta'), ('c', 'Cat'), ('e', 'Echo'), ('a', 'Alpha')]

We have created ChainMap z from other dictionaries x & y. The ChainMap z is reference to the dictionaries x and y. ChainMap will not maintain duplicate keys, it returns presents value ‘Cat’ for key ‘c’. So, basically it skips the second occurance of the same key.

>>> x
{'c': 'Cat', 'b': 'Beta', 'a': 'Alpha'}
>>> y
{'d': 'Delta', 'c': 'Charlie', 'e': 'Echo'}
>>> x.pop('c')
'Cat'
>>> x
{'b': 'Beta', 'a': 'Alpha'}
>>> list(z.keys())
['d', 'c', 'b', 'e', 'a']
>>> list(z.values())
['Delta', 'Charlie', 'Beta', 'Echo', 'Alpha']
>>> list(z.items())
[('d', 'Delta'), ('c', 'Charlie'), ('b', 'Beta'), ('e', 'Echo'), ('a', 'Alpha')]
>>> 

In the above code, we have removed the key ‘c’ from dict x. Now the ChainMap points the value for key ‘c’ to “Charlie”, which is present in y.

8. Summary

We have seen various python collection data types and understand them with example and use cases. The official python documentation can be referred for further reading.

9. References

[1] – Python Wiki – https://docs.python.org/3.5/library/collections.html

Advertisements

One thought on “Python Collections : High Performing Containers For Complex Problems

    bharath dantuluri said:
    February 23, 2017 at 4:03 am

    Nice explanation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s