Basics and Foundation of Python

With notes from Think Python (2nd Edition)

Sam Rosen

Outline

Python Intro
Python Overview
- Basic Types (int, float, string, bool)
- Data Structures (list, tuple, set, dict)
- Other useful things (file, Exception)
Example Script from the Survey
More Python
Exercises

Python vs. R

You will use R in your early courses more
Python is generally better for professional purposes
Both have strong tools for statistical analysis
- Some areas are better done in R, others in Python
- You will be good at both by the time you graduate
Although they are different languages, getting comfortable with Python can help with getting comfortable with R

Integers

Integers are numbers with no decimal. In Python 3 they are unbounded.
Numbers are useful.

4 + 2
5 / 2
2 ** 3
2 ^ 3  # This is not an exponent
8 // 3  # Floor division can be very useful

int("1002")  # Convert string to int
int(10.9) == 10  # Convert `float` to `int`
divmod(10, 3) == (3, 1)  # Get floor division and remainder

Floats

Floats are numbers with a decimal. In Python 3 they are bounded above and below by \(\approx \pm10^{308}\).
Numbers remain useful with a decimal component.

4.5 + 2.5  # 7.0
5.2 / 2.3  
2.2 ** 3.1
#  2.1 ^ 3.1  # This is still not an exponent and causes an error
8.4 // 3.1  # 2.0

float("-1002.101")  # Convert string to int
float(10) == 10.0  # Convert `int` to `float` (almost always unnecessary)
_, remainder = divmod(8.4, 2.05)  # (4.0, 0.20000000000000107)

print(0.2 == remainder)  # Floating point arithmetic can be confusing
print(remainder - 0.2)
print(abs(remainder - 0.2) < 1e-10)

False
1.0547118733938987e-15
True

Boolean

bools are used for conditional execution

def check_num(x):
  x = float(x)
  if not x.is_integer():
    print(x, "is a float")
  elif x % 2:
    print(x, "is odd")
  else:
    print(x, "is even")

check_num(4.0)
check_num(5)
check_num(0.1)

4.0 is even
5.0 is odd
0.1 is a float

In python, most things can be cast to a bool. If they cast to True, then they are “truthy”.

bool(0)
bool(1.1)
bool("")
bool("Hi")
bool([1,2,3])
bool([]) or bool([5])
bool(1) and bool(10)
not bool({"hi": "there"})

Useful Functions

any: returns True if any element is “truthy”

any([False, 1, "", {}])  # True
any([False, 0, "", [], 0])  # False

all: returns True is all elements are “truthy”

all([1, "hello", "False", [1]])  # True
all([1, "hello", "False", [1], ""])  # False

filter: easy way to filter a sequence of elements

list(filter(lambda num: num > 5, [2, 3, 4, 10, 20]))

[10, 20]

int: casting bools to ints can be very useful: int(True) or int(False)

Good Practice

Do not do if x == True:, just if x:

None

Many programming languages have a concept of a “null pointer”, i.e. a value to signify no value. In Python, this is called None.
To check if something is None simply do x is None or x is not None.

Good Practice

x == None and x != None also work, but using is is considered to be correct, because there is only one None object.
By default, all functions return None, so you will run into bugs if you do

def my_func(x):
  print(x ** 2)
  
y = my_func(4)
print(y)  # None
y * 2  # TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

Functions with a goal to “find” something may also return None if it cannot be found. It’s important to determine return values with documentation.

s = "abcdef"

try:
  print("z index:", s.index("z"))
except ValueError as e:
  print("Exception:", e)

print("z find:", s.find("z"))
#  This is a dictionary comprehension
letter_lookup = {letter: index for index, letter in enumerate(s)}
print("Index of b:", letter_lookup.get("b"))
print("Index of z:", letter_lookup.get("z"))

Exception: substring not found
z find: -1
Index of b: 1
Index of z: None

Strings

Strings are sets of characters that are useful for many things.
Everything in Python has indexing start at 0, including strings.

my_str = "hello there"
print(my_str[1])

num = 4
print(str(num) + " is my favorite number")

e
4 is my favorite number

Useful Functions/Methods

len("abcdef")  # 6

sorted("defabc")  # "abcdef"

reversed("hello world")  # "dlrow olleh"

"hi friend".replace("friend", "john")  # "hi john"

"HELLO FRIEND".lower()  # "hello friend"

"hello, how are you?".split()  # ["hello,", "how", "are", "you?"]

"     hello there!   ".strip()  # "hello there!"

"+".join(["good", "evening", "friend"])  # "good+evening+friend"

"hello" in "hello friend"  # True

x = input("How are you?")  # User input

Good Practice

Validate user input:

x = input("How are you?")
if x.lower() == "good":
  print("Awesome!")

Immutable:

x = "hello"
x[1] = "g"  # TypeError: 'str' object does not support item assignment

Single, Double, Triple Quotes, and f-strings

x = 'He said to me, "Hi!"'
y = "He said to me, \"Hi!\""
long_str = """Wow! that is crazy!
I can't believe it!
What on Earth?"""

color = "red"
num = 20
print(f"My favorite color is {color}. I have {num} shirts with it.")

My favorite color is red. I have 20 shirts with it.

Lists

Lists are mutable sequences of any type of element. They are good for indexing and maintaining some level of order.

my_list = [1, 2, "c", "d"]

my_list.append(5)  # Add to end of list
my_list[1] = "b"
print(my_list)

[1, 'b', 'c', 'd', 5]

Useful Methods

.extend: add another sequence to a list
.count: count the number of an element in a list
.pop: remove the last element from a list
.index: find the index of an element in a list

my_list = ["a", 2, "c"]
my_list.extend([2, "c"])
print(my_list)

print(my_list.count("c"))
print(my_list.index("c"))
print(my_list.pop())
print(my_list)

['a', 2, 'c', 2, 'c']
2
2
c
['a', 2, 'c', 2]

Good Practice

Lists need to be copied with .copy(); otherwise they are aliased

my_list = [1,2,3]
my_list2 = my_list

my_list2.append(4)
print(my_list)
print(my_list is my_list2)

[1, 2, 3, 4]
True

Lists, like strings, can be sliced to get various parts of them

my_list = list("abcdefgh")
print(my_list[:3], my_list[4:], my_list[2:5], my_list[:-1], my_list[::2])

['a', 'b', 'c'] ['e', 'f', 'g', 'h'] ['c', 'd', 'e'] ['a', 'b', 'c', 'd', 'e', 'f', 'g'] ['a', 'c', 'e', 'g']

Lists can go inside lists! These may be referred to as multi-dimensional arrays

[[1,2,3],
 [4,5,6]]

When building lists, sometimes a list comprehension is easier to read and write

my_list = []
for num in range(5):
  my_list.append(num ** 2)
  
my_list2 = [num ** 2 for num in range(5)]

print([num1 - num2 for num1, num2 in zip(my_list, my_list2)])

[0, 0, 0, 0, 0]

Sets

Sets in Python operate very similarly to sets in standard mathematics. Elements must be unique, they are not in any order, and determining membership is a priority.

my_set = {"a", "b", 1, 2}
print("a" in my_set)

True

Useful Methods

.pop: Get and remove a random element from a set
.add: Add element to a set
.remove: Remove an element from a set
.union: Combine sets
and many other set operations

Good Practice

Use a frozenset if you do not plan on changing it

my_set = frozenset(["a", "b", "c"])

Recognize the many ways to make a set

my_set1 = {1, 2, 3}
my_set2 = set(range(1, 4))
my_set3 = {x for x in range(1, 4)}

Sets are also aliased

my_set4 = my_set1
my_set4.add(5)
print(my_set1)
print(my_set1 is my_set4)

{1, 2, 3, 5}
True

Dictionaries

dicts are great ways to map keys to values. A simple example is a histogram:

my_str = "Welcome to Duke!"

my_dict = {}
for character in my_str:
  if character in my_dict:
    my_dict[character] = my_dict[character] + 1
  else:
    my_dict[character] = 1

print(my_dict)

{'W': 1, 'e': 3, 'l': 1, 'c': 1, 'o': 2, 'm': 1, ' ': 2, 't': 1, 'D': 1, 'u': 1, 'k': 1, '!': 1}

Useful Methods

.keys: Get an iterable of all the keys in a dictionary
.values: Get an iterable of all the values in a dictionary
.items: Get an iterable of all the key-value pairs in a dictionary
.update: Combine two dictionaries

my_dict = {"a": 1, "b": 2}
my_dict2 = {"b": 3, "c": 4}
my_dict.update(my_dict2)
print(my_dict)

{'a': 1, 'b': 3, 'c': 4}

.get: Get a value from a dictionary but specify a default

my_str = "Welcome to Duke!"

my_dict = {}
for character in my_str:
  my_dict[character] = my_dict.get(character, 0) + 1
print(my_dict)

{'W': 1, 'e': 3, 'l': 1, 'c': 1, 'o': 2, 'm': 1, ' ': 2, 't': 1, 'D': 1, 'u': 1, 'k': 1, '!': 1}

Good Practice

Do key in my_dict not key in my_dict.keys()
collections.defaultdict is another way to handle the need for default values in a dictionary
Nest dictionaries if necessary

my_data = dict(
  item1={
    "status": "open",
    "description": "..."
  },
  item2={
    "status": "closed",
    "issue": "..."
  }
)

my_data["item1"]["status"]

'open'

Tuples

Very similar to lists, but they cannot be changed! They are immutable.

my_list = [1, 2, 3]
my_tuple = tuple(my_list)
my_list[1] = 100
print(my_tuple)

try:
  my_tuple[1] = 100
except Exception as e:
  print(e)

(1, 2, 3)
'tuple' object does not support item assignment

Good Practice

To make a tuple with one element: x = (1,)
Immutability allows tuples to be used as dict keys and be stored in sets:

my_key = (1, 2, 3, "four")
my_dict = {my_key: "my favorite numbers"}
my_set = set()
my_set.add(my_key)

print(my_dict)
print(my_set)

my_other_key = (1, ["a", "list"], 3)
try:  # All elements of the tuple need to be immutable too
  my_set.add(my_other_key)
except Exception as e:
  print(e)

{(1, 2, 3, 'four'): 'my favorite numbers'}
{(1, 2, 3, 'four')}
unhashable type: 'list'

Iteration

Programming is all about doing tasks repeatedly because they would be too annoying to do by hand. Many objects have a natural way to iterate over them.

from collections import Counter  # Makes a dictionary that counts elements
my_str = "I enjoy eating almonds"

def print_data(data):
  for item in data:
    print(item, end=" ")
  print()

print_data(my_str)
print_data(set(my_str))  # No order or repeats
print_data(list(my_str))
print_data(tuple(my_str))
print_data(Counter(my_str))  # Dictionaries are ordered by insertion, iterates over keys

I   e n j o y   e a t i n g   a l m o n d s 
o g d e l s t y j   n m i a I 
I   e n j o y   e a t i n g   a l m o n d s 
I   e n j o y   e a t i n g   a l m o n d s 
I   e n j o y a t i g l m d s

Useful Functions

len: Get the natural size of an object
enumerate: Iterate over an object, but with index, value pairs.
zip: Iterate over two object at the same time
map: Map the values of an iterable using a function
itertools module: Contains many useful functions for specific kinds of iteration

Good Practice

Consider how you might need to nest iteration

values = [1,2,3,4,5]
pairwise_distance = []
for value1 in values:
  for value2 in values:
    pairwise_distance.append(abs(value1 - value2))

while loops are useful if you are unsure how many iterations are needed

# Find the 101st prime
primes_found = []
current_num = 2
while len(primes_found) < 100:
  if is_prime(current_num):
    primes_found.append(current_num)
  current_num += 1

Use continue and break

my_str = "aabbccaa"
for letter in my_str:
  if letter == "b":
    continue  # Skip an iteration
  if letter == "c":
    break  # Exit loop, also used with while loops
  print(letter, end = "")

aa

Functions

Functions are an essential part of passing functionality in Python from modules or in your code. It is a good idea to split your program up into repeatable parts to help with debugging.

def my_func(a, b, c=5):
  return f"({a} + {b}) * {c} = {(a+b) * c}"

print(my_func(1, 1, 0))
print(my_func(10, 4))

(1 + 1) * 0 = 0
(10 + 4) * 5 = 70

Recursion is an essential concept in Computer Science and parts of Data Science:

def sum_of_first_k_nums(k):
  if k == 0:  # Base case
    return 0
  return k + sum_of_first_k_nums(k - 1)

sum_of_first_k_nums(5)

Good Practice

Functions can have default arguments and return items like the previous slide
Functions can also take an arbitrary amount of arguments

def my_func_with_inf_args(*args, **kwargs):
  print(args)
  print(kwargs)
  
my_func_with_inf_args(1, 2, 3, a=4, b=5)

(1, 2, 3)
{'a': 4, 'b': 5}

Functions can be passed around in Python

def call_twice(some_func, *args, **kwargs):
  some_func(*args, **kwargs)
  some_func(*args, **kwargs)
  
call_twice(my_func_with_inf_args, 1, 2, f=9)

(1, 2)
{'f': 9}
(1, 2)
{'f': 9}

Although it’s more typical to see this as a lambda function

call_twice(lambda: my_func_with_inf_args(1, 2, f=9))

(1, 2)
{'f': 9}
(1, 2)
{'f': 9}

Objects

Objects are everywhere and naturally you can make your own.

from random import shuffle

SUITES = ("Hearts", "Spades", "Clubs", "Diamonds")
RANKS = (2, 3, 4, 5, 6, 7, 8, 9, 10, "Jack", "Queen", "King", "Ace")

class CardDeck:
  def __init__(self, empty=False):  # Constructor
    self.cards = []
    if not empty:
      for suit in SUITES:
        for rank in RANKS:
          self.cards.append((suit, rank))
  
  def add_card(self, suit, rank):
    self.cards.append((suit, rank))
  
  def shuffle(self):
    shuffle(self.cards)
    
  def draw_card(self):
    return self.cards.pop()

my_deck = CardDeck()
my_deck.shuffle()
print(my_deck.draw_card())
print(my_deck.draw_card())

('Hearts', 9)
('Spades', 7)

Useful Methods

Your own!
Magic Methods

class CardDeck:
  # ...
  def __len__(self):
    return len(self.cards)
  
  def __contains__(self, card):
    return card in self.cards

my_deck = CardDeck()
len(my_deck)
("Hearts", "Queen") in my_deck

Good Practice

class DiscardPile(CardDeck):  # Inheritance let's you reuse code
  def __init__(self):
    CardDeck.__init__(self, empty=True)

  def add_card(self, deck):
    drawn_card = deck.draw_card()
    self.cards.append(drawn_card)
    return drawn_card

my_deck = CardDeck()
discard = DiscardPile()
discarded_card = discard.add_card(my_deck)

Terminology matters
- Function: stand-alone function (print)
- Method: function that is attached to a class (draw_card)
- Attribute: variable attached to a class (my_deck.cards)
- Constructor: the __init__ method
- Instance: a constructed object (my_deck)

Exceptions and Error Handling

If your program runs into an error, it will terminate if the resulting Exception is not caught.

my_list = [2, "hi"]
try:
  my_str = ",".join(my_list)  # TypeError: sequence item 0: expected str instance, int found
except TypeError:
  my_list = [str(element) for element in my_list]
  my_str = ",".join(my_list)

# Program continues...

Common Exceptions

Exception: All exceptions fall under this class
ArithmeticError: Base exception for OverflowError, ZeroDivisionError, FloatingPointError
AttributeError: Attempting to access an attribute that does not exist on an object
IndexError: Attempting to use an invalid index on a sequence
KeyboardInterrupt: Raised when ctrl-c is pressed during execution
NameError: Using a variable that does not exist
TypeError: Operating on two objects with incompatible types
ValueError: Input to a function is invalid

Good Practice

Avoid catch-all exceptions. If you use a try-catch, usually you have a specific exception in mind to handle. Other exceptions should be raised to see bugs when they happen.

try:
  some_complicated_function()
except Exception:  # Bad
  print("SOMETHING went wrong!")

Print them!

try:
  some_complicated_function()
except MemoryError as e:
  print("Out of memory!", e)

raise your own!

def my_complicated_function(some_matrix):
  if not some_matrix.is_square():
    raise ValueError(f"some_matrix is not square!")

Files

One way to read input and write output for a program is by writing to a file. This is generally a good idea if you might want to save the results for later.

# Writing files, use the mode argument
# Careful! This will delete the file if it is present
my_file = open("path_to_file.txt", mode="w")  
my_file.write("output I want to keep")
my_file.close()  # You must close the file, or your results may be lost

# Reading files
my_file = open("path_to_file.txt", mode="r")  # The mode is r by default
print(my_file.readline())
my_file.close()

Useful Functions/Methods

The os.path module and the pathlib module contain many methods for operating on the file system:
- os.path.exists: Determine if a file exists at a given path
- os.path.join: Join two path components together
- os.listdir: List files in a directory
json.load in the json module is vital for reading .json files. It is also useful for writing dictionaries to a file with json.dump.
.readlines(): Read a file in line-by-line altogether

Good Practice

You can use the with statement to automatically close files when you are done using them. This includes if your program terminates unexpectedly.

with open("my_file.txt") as f:
  f.readline()
  # ...
# Outside the above indentation f cannot be used as it is closed

If your files contain text that is not in the ASCII codec, specify the encoding:

with open("my_file.txt", encoding="utf-8") as f:
  print(f.readline())  # Supports many unicode characters

Example Script

Problem Statement

A friend has a directory of 1000 files where each file has one of the following extensions: .csv, .tsv, .json. However, each file has comments throughout it delimited by ##, so they do not follow the proper format. They ask you to write a Python script which will combine all the files into 1 while removing the comments and ensuring the data is in a proper .csv format.

What is the input?

Character-delineated files and JSON

    person,age,job,favorite_color
    amy,20,waiter,blue
    barry,30,engineer,grey
    ## we have only adults in the dataset
    carl,25,None,purple ## None means unemployed
    ## this person did not understand the survey dan,29,superhero,pineapple

    person    age   job   favorite_color
    amy   20    waiter    blue
    barry   30    engineer    grey
    ## we have only adults in the dataset
    carl    25    None    purple ## None means unemployed
    ## this person did not understand the survey dan    29    superhero   pineapple

    {
      "data": [ ## List of people
        {
          "name": "amy",
          "age": 20,
          "job": "waiter",
          "favorite_color": "blue"
        },
        {
          ## Barry is friends with my Dad, Jerry
          "name": "barry", 
          "age": 30,
          "job": "engineer",
          "favorite_color": "grey"
        },
        ... omitted for brevity
      ]
    }

Psuedo Code

Create a variable to store data from all files

For every file my friend has
  Read them in line-by-line
  Remove all comments in each line
  Remove all empty lines
  If it is a .csv
    Seperate the commas and store it
  If it is a .tsv
    Seperate the tabs and store it
  If it is a .json
    Read the .json and store it
    
Turn the variable that is storing all the data into a .csv

Step 1

Create a variable to store data from all files

HEADERS = ["name", "age", "job", "favorite_color"]

my_data = [
  # ("amy", 20, "waiter", "blue") example entry
]

For every file my friend has, Read it in as a string

import os

path_to_folder = "./my_friend/stored/the/files/here"

for file_path in os.listdir(path_to_folder):
  file = open(file_path)
  file_as_lines = file.readlines()
  # [
  #    "person,age,job,favorite_color\n",
  #    "amy,20,waiter,blue\n",
  #    "barry,30,engineer,grey\n",
  #    "## we have only adults in the dataset\n",
  #    "carl,25,None,purple ## None means unemployed\n",
  #    "## this person did not understand the survey dan,29,superhero,pineapple"
  # ]

Step 2

Remove all comments in each line then remove all empty lines

def remove_comments(line):
  no_whitespace = line.strip() #  => "## this is a comment"
  if "##" in no_whitespace:
    comment_starts_at_index = no_whitespace.index("##")
    filtered = no_whitespace[:comment_starts_at_index]
    return filtered.strip() # In case there are spaces around the comment
  return no_whitespace

print("test 1", remove_comments("  ## this is a comment  "))
print("test 2", remove_comments("person,age,job,favorite_color"))
print("test 3", remove_comments("person,age,job,favorite_color## comments   "))

test 1 
test 2 person,age,job,favorite_color
test 3 person,age,job,favorite_color

file_with_no_comments = ""
for line in file_as_lines:
  comments_removed = remove_comments(line)
  if comments_removed:
    file_with_no_comments += comments_removed + "\n"

Step 3

If it is a .csv or .tsv, seperate the commas or tabs and store it

def get_delimited_entries(file_as_str, delimiter):
  to_return = []
  lines = file_as_str.split("\n")
  for line in lines[1:]:  # Skip header
    to_return.append(line.split(delimiter))
  
  return to_return

if file_path.endswith(".csv"):
  my_data.extend(get_delimited_entries(file_with_no_comments, ","))
elif file_path.endswith(".tsv"):
  my_data.extend(get_delimited_entries(file_with_no_comments, "\t"))

Step 4

If it is a .json read the .json and store it

import json

# ... continued
elif file_path.endswith(".json"):
  as_dict = json.loads(file_with_no_comments)
  for entry in as_dict["data"]:
    my_data.append([str(entry[header]) for header in HEADERS])
else:
  print(f"Could not read file: {file_path}")
  
  file.close()
# end for loop over files

Step 5

Turn the variable that is storing all the data into a .csv

as_csv = ",".join(HEADERS) + "\n"
for entry in my_data:
  as_csv += ",".join(entry) + "\n"

with open("output.csv", "w") as output_file:
  output_file.write(as_csv)

Full script here

More Python

Libraries

numpy, pandas: Common libraries for data processing
django: Framework for building websites via full-stack development
csv: Built-in library for handling csv files
SQLAlchemy: Extensive database support
requests: Make HTTP requests for API calls and web scraping
Pillow: Image processing
six: Python 2 and 3 compatibility
Polars: Pandas alternative
BeautifulSoup: Parse HTML input
pygame: Game development
nltk: Natural Language Processing
PyTorch, TensorFlow: Neural Networks 🤮
Many, many more…

New Stuff

Python is actively updated!
Performance improvements! (3.11)
Walrus Operator (3.8)
f-strings (3.6)
Pattern Matching (3.10)
Type Annotations (Introduced in 3.5 but still getting changes 3.11)
async and await (3.7)
If you want to get better at Python, it helps to learn the new features and why they were added. Many of them address the weak points of Python.
Furthermore, Python’s most used libraries get updated frequently!

Testing

In a professional environment, unit tests may be written to ensure bugs are not introduced during the development process.

import unittest
from my_module import my_func

class TestMyStuff(unittest.TestCase):
    def test_my_func(self):
        my_output = my_func(1, 2, 3)
        self.assertEqual(6, my_output)
        
    def test_my_func_with_strings(self):
        my_output = my_func(1, 2, 3, "h")
        self.assertEqual(60, my_output)
    
    def test_my_func_raises_exception(self):
        with self.assertRaises(ValueError):
          my_output = my_func(1, 2, 3, "h", "")

Lower-level Language Bindings

As a high-level language, Python is relatively slow.
Python can call optimized code written in C, C++, fortran, etc. to achieve very similar speeds.
This is why numpy is so much faster.
Use the right tool for the right job.

Exercises

Colab