A list of handy tips and tricks when programming in python
from functools import partial
from datetime import datetime
import logging, string, pandas as pd, sqlparse
from fastcore.all import *
from fastcore.docments import *
from IPython.display import Markdown,display, HTML
import pandas as pd
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter
def print_function_source(fn):
fn = print_decorator
formatter = HtmlFormatter()
display(HTML('<style type="text/css">{}</style>{}'.format(
formatter.get_style_defs('.highlight'),
highlight(inspect.getsource(fn), PythonLexer(), formatter))))
The purpose of this is to introduce concepts I believe data scientists could benefit from knowing.
I am assuming that the reader knows the basics of programming. I will cover concepts I frequently see that I think are not used enough or appropriately, regardless of how basic or advanced they may be.
Comprehensions in python should be used when possible. They are faster than forloops and require less code when they fit
x = [2,3,4,5]
out=[]
%timeit for i in range(1000000): out.append(i+1)
%timeit [i+1 for i in range(1000000)]
This is basically special syntax for a forloop, and are useful in a subset of forloops. Basically any time you see the pattern where you initialize something, then modify or build it in the forloop you can likely use a comprehension
out = []
for o in range(5): out.append(o**o)
out
[o**o for o in range(5)]
List comprehensions are most common but you can also do tuple comprehension, set comprehension, dict comprehension, or other data types.
set(o**o for o in range(5))
{str(o):o**o for o in range(5)}
A few handy patterns are:
adict = {"a":1,"b":2}
{v:k for k,v in adict.items()}
x = [1,2,3,4]
y = [5,6,7,8]
[a+b for a,b in zip(x,y)]
unique_combos = L((a,b) for a in x for b in y)
unique_combos
Destructured assignments mean to can break up iterables when you assign. This is handy to reduce pointless lines of code.
a,b = 5,6
a,b,c = [],[],{}
Another use is to break up lists to create lists where we take all the first elements out into it's own list, and the second elements out into their own lists.
I often see this done with multiple list comprehension, doing [o[0] for o in [x,y,z]]
to get the first element, then repeating for other elements.
However, we can do this easier with the help of zip and destructured assignments
nested_list = [[1,2,3],[4,5,6],[7,8,9]]
nested_list
first_elements, second_elements, third_elements = list(zip(*nested_list))
print(f"{first_elements=}")
print(f"{second_elements=}")
print(f"{third_elements=}")
Fastcore is a great library to know. It's got a lot of useful features and extensions to the python standard library and it's designed to be used in live environments like jupyter notebooks.
See this blog post
Nice way of documenting code concisely and being able to access info from code. It's concise, easy to manipulate to display how you want, and easy to read. I much prefer it over the large numpy style docstrings that are big string blocks
from fastcore.docments import *
def distance(pointa:tuple, # tuple representing the coordinates of the first point (x,y)
pointb:tuple=(0,0) # tuple representing the coordinates of the first point (x,y)
)->float: # float representing distance between pointa and pointb
'''Calculates the distance between pointa and pointb'''
edges = np.abs(np.subtract(pointa,pointa))
distance = np.sqrt((edges**2).sum())
return distance
docstring(distance)
docments(distance)
docments(distance,full=True)
Everyone agrees testing is important. But not all testing is equal. The needs for unit testing the google code base are not the same as the needs a data scientist needs for building and deploying models, libraries, and most software.
Fastcore is a great tool for most of my testing needs. Fast and simple enough that I can add tests as I build and as I am exploring and building models. I want testing to enhance my development workflow, not be something I have to painstakingly build at the end.
Sometimes simple assert statements are sufficient, but there's small annoyances. For example, a small change in type can mean a failed test. Sometimes that change in type should cause a failure, sometimes I'm ok if it's a different type if the values are the same
from fastcore.test import *
test_eq([1,2],(1,2))
For floating points it has handy functionality for that, which is very common in data science. For example, we may want .1 + .1 + .1 == .3
to be true, because they are close enough based on floating point precision
.1 + .1 + .1 == .3
test_close(.1 + .1 + .1, .3)
We can test that something fails, if there are particular situation we want to ensure raise errors.
def _fail(): raise Exception("foobar")
test_fail(_fail)
We can test if 2 lists have the same values, just in different orders (convenient for testing some situations with random mini-batches).
a = list(range(5))
b = a.copy()
b.reverse()
test_shuffled(a,b)
There's more of course, check out the docs
L is a replacement for a list, but with lots of adding functionality. Some of it are functional programming concepts, some is numpy like stuff, and some is just niceities (like cleaner printing).
alist = L(1,2,3,4,3)
alist.sort()
alist.sorted()
alist.unique()
alist.filter(lambda x: x < 3)
alist.map(lambda x: x * 2)
Attrdict
is another nice thing from fastcore, that makes dictionaries a bit nicer to use.
regdict = {'a':2,'b':3}
adict = AttrDict({'a':2,'b':3})
adict
adict.a
def _fail(): return regdict.a
test_fail(_fail)
Logging is super important. if you log stuff as you work properly you can always look back at what was done previously. Sometimes it's hard to tell what's going on as you run and re-run different things. Logging is handy not just in production for debugging, but also as a tool when you are developing. There are many tools to help with logging and visualizing results (for example W&B or tensorboard for deep learning) - but the foundations are good to understand and use too!
logging.basicConfig(filename="./mylog.log")
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def get_current_time(): return datetime.now().strftime('%Y%m%d_%H%M%S')
logger.info (f'{get_current_time()}|This is an info message')
!head -4 mylog.log
def log_stuff(msg,**kwargs):
dt = get_current_time()
logger.info(f"{dt}|{msg}")
for k,v in kwargs.items(): logger.info(f"{dt}|{k}={v}")
log_stuff('this is what I want to log',
trainig_set='50 records',
validation_set='70_records')
This is a simple example of what these terms mean:
def callbackFunc1(s): print('Callback Function 1: Length of the text file is : ', s)
def callbackFunc2(s): print('Callback Function 2: Length of the text file is : ', s)
def HigherOrderFunction(path, callback):
with open(path, "r") as f: callback(len(f.read()))
HigherOrderFunction("mylog.log", callbackFunc1)
HigherOrderFunction("mylog.log", callbackFunc2)
This is handy in a lot of situations.
Filter is a common higher order function.
L(1,2,3,4,5).filter(lambda x: x>3)
This is very flexible because we can put filtering logic of any complexity in a function and use that to filter a list of any type.
Map is another very common higher order function.
L(1,2,3,4,5).map(lambda x: x**2)
It is again super flexible because we can apply a function of any complexity to have it be applied and modify each element of the list.
L(1,2,3,4,5).map(lambda x: string.ascii_lowercase[x])
We could make a function for logging, where we can pass a function in that we want to use for logging (ie info vs warning).
def log_stuff(msg,fn=logger.info,**kwargs):
dt = get_current_time()
fn(f"{dt}|{msg}")
for k,v in kwargs.items(): fn(f"{dt}|{k}={v}")
log_stuff('abcd',a=1,b=55)
!tail -3 mylog.log
log_stuff('something might be awry',fn=logger.critical,a=1,b=55)
!tail -3 mylog.log
You can also make a generic file processor that you can pass callbacks to. This file processor can include log statements to log what you're doing, so you can minimize repeating lots of code. For now, we'll do a simple processor, and callbacks to clean and format a messy sql file.
def process_file(fpath,callbacks):
with open(fpath, "r") as f: contents = f.read()
for callback in callbacks: contents = callback(contents)
return contents
sql_formatter_cb = partial(sqlparse.format,
strip_comments=True,comma_first=True,
keyword_case='upper', identifier_case='lower',
reindent=True, indent_width=4,)
qrys = process_file('test.sql',[sql_formatter_cb,sqlparse.split])
def sql_pprint(sql): display(Markdown(f"```sql\n\n{sql}\n\n```"))
for qry in qrys: sql_pprint(qry)
Decorators give you a way to add the same functionality to many functions (like inheritance does for classes). You typically use decorator using the @
syntax, which modified the function.
def add_another(func):
def wrapper(number):
print(f"The decorator took over!")
print(f"I could log the original number ({number}) here!")
print(f"Or I could log the original answer ({func(number)}) here!")
return func(number) + 1
return wrapper
@add_another
def add_one(number): return number + 1
So when we use a decorator, the code in the wrapper
function is called instead of the original function. Typically the wrapper
function calls the original function (otherwise there would be no point in decorating it as you'd just have a new unrelated function).
For example, maybe you want to print (or log) particular function call times and the args. See this decorator that does just that (and can be used on methods too)
from datetime import datetime
def print_decorator(func):
def wrapper(*args, **kwargs):
print(f"{datetime.now()}:{func}:args={args}:kwargs={kwargs}")
return func(*args, **kwargs)
return wrapper
@print_decorator
def simple_add(a,b): return a + b
simple_add(2,4)
@print_decorator
def complex_add(a,b,*args,**kwargs):
out = a + b
for arg in args: out = out + arg
for kwarg in kwargs.values(): out = out + kwarg
return out
complex_add(5,2,3,foo=6,bar=10)
What we have seen is applying a decorator to functions we fully define but we can also apply them to previously existing functions like ones we import from a library. This is helpful not just in understanding one way you can extend an existing libraries functionality, but also in understanding what decorators are. They aren't magical.
Let's add logging to pd.DataFrame
using our existing decorator so we can see when a dataframe is constructed.
LoggingDataFrame = print_decorator(pd.DataFrame)
df = LoggingDataFrame([1,2,3])
df.head()
The key thing to notice here is that the @
syntax really isn't doing anything magical. It's just passing the function into the decorator and using that as the function definition. It's just syntactic sugar for a higher order function that takes a function and returns a function.
To understand why this works, think through what our decorator is doing.
wrapper
. This wrapper
function called the argument passed into it, but also has other code.print_function_source(print_decorator)
Inheritance is the idea that you a class can "Inherit" attributes and methods from other classes.
For example a class could have an attribute a
, and it can be used to create a new class to give it that attribute without having to specify it.
class aClass: a = 2
class bClass(aClass): pass
aClass.a == bClass.a
In many cases there are common things we want to inherit in lots of classes. One example is having access to the date. Often you want this for logging, or printing, or any number of things. By subclassing you don't have to reformat the date each time in your classes.
class DateMinuteMixin:
date_format='%Y%m%d_%H%M%S'
dte = datetime.now()
@property
def date_str(self): return self.dte.strftime(self.date_format)
Another handy use is to have generic behavior for handling different file types. In this case, we have a mixin where it opens and reads a sql file. Rather than rewriting this code for every class that needs to read a sql file, you can inherit from a class when you need that functionality.
💡 Tip
You can define an abstract property like below to let users know that after inheriting this class, they need to define that property. In this case, they define the
sql_filepath
, and they get the contents of the file for free via the other methods.
import abc
class SqlFileMixin:
@abc.abstractproperty
def sql_filepath(self):
pass
@property
def sql_file(self):
return open(self.sql_filepath)
@property
def query(self):
return self.sql_file.read()
import numpy as np
class someClass:
def __init__(self,a): self.a = a
def __str__(self): return f"This object's a is : {self.a}"
def __getitem__(self,idx): return self.a[idx-1]
def __add__(self,some_class): return list(map(lambda x,y: x + y, self.a, some_class.a))
a = someClass(x)
a.a
a + a
a[1]
a
print(a)
Iterators are useful when you don't want to just load all data in memory all at once. They are often defined with yield
, but there are other ways.
def mapper(items,fn):
for item in items: yield item
it = mapper([2,4,6,8],square)
it
next(it), next(it), next(it)
You can also process it sequentially in a loop.
for item in mapper([2,4,6,8],square):
print(item)
print_plus = partial(print,end='\n++++++\n')
with open('test.txt', 'rb') as f:
iterator = iter(partial(f.read, 64), b'')
print_plus(type(iterator))
for block in iterator: print_plus(block)