Basics and Foundation of Python

With notes from Think Python (2nd Edition)

Sam Rosen

Outline

  • Python Intro
  • Python Overview
    • Basic Types (int, float, string, bool)
    • Data Structures (list, tuple, set, dict)
    • Other useful things (file, Exception)
  • Example Script from the Survey
  • More Python
  • Exercises

Python vs. R

  • You will use R in your early courses more
  • Python is generally better for professional purposes
  • Both have strong tools for statistical analysis
    • Some areas are better done in R, others in Python
    • You will be good at both by the time you graduate
  • Although they are different languages, getting comfortable with Python can help with getting comfortable with R

Integers

  • Integers are numbers with no decimal. In Python 3 they are unbounded.
  • Numbers are useful.
4 + 2
5 / 2
2 ** 3
2 ^ 3  # This is not an exponent
8 // 3  # Floor division can be very useful

int("1002")  # Convert string to int
int(10.9) == 10  # Convert `float` to `int`
divmod(10, 3) == (3, 1)  # Get floor division and remainder

Floats

  • Floats are numbers with a decimal. In Python 3 they are bounded above and below by \(\approx \pm10^{308}\).
  • Numbers remain useful with a decimal component.
4.5 + 2.5  # 7.0
5.2 / 2.3  
2.2 ** 3.1
#  2.1 ^ 3.1  # This is still not an exponent and causes an error
8.4 // 3.1  # 2.0

float("-1002.101")  # Convert string to int
float(10) == 10.0  # Convert `int` to `float` (almost always unnecessary)
_, remainder = divmod(8.4, 2.05)  # (4.0, 0.20000000000000107)

print(0.2 == remainder)  # Floating point arithmetic can be confusing
print(remainder - 0.2)
print(abs(remainder - 0.2) < 1e-10)
False
1.0547118733938987e-15
True

Boolean

  • bools are used for conditional execution
def check_num(x):
  x = float(x)
  if not x.is_integer():
    print(x, "is a float")
  elif x % 2:
    print(x, "is odd")
  else:
    print(x, "is even")

check_num(4.0)
check_num(5)
check_num(0.1)
4.0 is even
5.0 is odd
0.1 is a float
  • In python, most things can be cast to a bool. If they cast to True, then they are “truthy”.
bool(0)
bool(1.1)
bool("")
bool("Hi")
bool([1,2,3])
bool([]) or bool([5])
bool(1) and bool(10)
not bool({"hi": "there"})

Useful Functions

  • any: returns True if any element is “truthy”
any([False, 1, "", {}])  # True
any([False, 0, "", [], 0])  # False
  • all: returns True is all elements are “truthy”
all([1, "hello", "False", [1]])  # True
all([1, "hello", "False", [1], ""])  # False
  • filter: easy way to filter a sequence of elements
list(filter(lambda num: num > 5, [2, 3, 4, 10, 20]))
[10, 20]
  • int: casting bools to ints can be very useful: int(True) or int(False)

Good Practice

  • Do not do if x == True:, just if x:

None

  • Many programming languages have a concept of a “null pointer”, i.e. a value to signify no value. In Python, this is called None.
  • To check if something is None simply do x is None or x is not None.

Good Practice

  • x == None and x != None also work, but using is is considered to be correct, because there is only one None object.
  • By default, all functions return None, so you will run into bugs if you do
def my_func(x):
  print(x ** 2)
  
y = my_func(4)
print(y)  # None
y * 2  # TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
  • Functions with a goal to “find” something may also return None if it cannot be found. It’s important to determine return values with documentation.
s = "abcdef"

try:
  print("z index:", s.index("z"))
except ValueError as e:
  print("Exception:", e)

print("z find:", s.find("z"))
#  This is a dictionary comprehension
letter_lookup = {letter: index for index, letter in enumerate(s)}
print("Index of b:", letter_lookup.get("b"))
print("Index of z:", letter_lookup.get("z"))
Exception: substring not found
z find: -1
Index of b: 1
Index of z: None

Strings

  • Strings are sets of characters that are useful for many things.
  • Everything in Python has indexing start at 0, including strings.
my_str = "hello there"
print(my_str[1])

num = 4
print(str(num) + " is my favorite number")
e
4 is my favorite number

Useful Functions/Methods

len("abcdef")  # 6

sorted("defabc")  # "abcdef"

reversed("hello world")  # "dlrow olleh"

"hi friend".replace("friend", "john")  # "hi john"

"HELLO FRIEND".lower()  # "hello friend"

"hello, how are you?".split()  # ["hello,", "how", "are", "you?"]

"     hello there!   ".strip()  # "hello there!"

"+".join(["good", "evening", "friend"])  # "good+evening+friend"

"hello" in "hello friend"  # True

x = input("How are you?")  # User input

Good Practice

  • Validate user input:
x = input("How are you?")
if x.lower() == "good":
  print("Awesome!")
  • Immutable:
x = "hello"
x[1] = "g"  # TypeError: 'str' object does not support item assignment
  • Single, Double, Triple Quotes, and f-strings
x = 'He said to me, "Hi!"'
y = "He said to me, \"Hi!\""
long_str = """Wow! that is crazy!
I can't believe it!
What on Earth?"""

color = "red"
num = 20
print(f"My favorite color is {color}. I have {num} shirts with it.")
My favorite color is red. I have 20 shirts with it.

Lists

  • Lists are mutable sequences of any type of element. They are good for indexing and maintaining some level of order.
my_list = [1, 2, "c", "d"]

my_list.append(5)  # Add to end of list
my_list[1] = "b"
print(my_list)
[1, 'b', 'c', 'd', 5]

Useful Methods

  • .extend: add another sequence to a list
  • .count: count the number of an element in a list
  • .pop: remove the last element from a list
  • .index: find the index of an element in a list
my_list = ["a", 2, "c"]
my_list.extend([2, "c"])
print(my_list)

print(my_list.count("c"))
print(my_list.index("c"))
print(my_list.pop())
print(my_list)
['a', 2, 'c', 2, 'c']
2
2
c
['a', 2, 'c', 2]

Good Practice

  • Lists need to be copied with .copy(); otherwise they are aliased
my_list = [1,2,3]
my_list2 = my_list

my_list2.append(4)
print(my_list)
print(my_list is my_list2)
[1, 2, 3, 4]
True
  • Lists, like strings, can be sliced to get various parts of them
my_list = list("abcdefgh")
print(my_list[:3], my_list[4:], my_list[2:5], my_list[:-1], my_list[::2])
['a', 'b', 'c'] ['e', 'f', 'g', 'h'] ['c', 'd', 'e'] ['a', 'b', 'c', 'd', 'e', 'f', 'g'] ['a', 'c', 'e', 'g']
  • Lists can go inside lists! These may be referred to as multi-dimensional arrays
[[1,2,3],
 [4,5,6]]
  • When building lists, sometimes a list comprehension is easier to read and write
my_list = []
for num in range(5):
  my_list.append(num ** 2)
  
my_list2 = [num ** 2 for num in range(5)]

print([num1 - num2 for num1, num2 in zip(my_list, my_list2)])
[0, 0, 0, 0, 0]

Sets

  • Sets in Python operate very similarly to sets in standard mathematics. Elements must be unique, they are not in any order, and determining membership is a priority.
my_set = {"a", "b", 1, 2}
print("a" in my_set)
True

Useful Methods

  • .pop: Get and remove a random element from a set
  • .add: Add element to a set
  • .remove: Remove an element from a set
  • .union: Combine sets
  • and many other set operations

Good Practice

  • Use a frozenset if you do not plan on changing it
my_set = frozenset(["a", "b", "c"])
  • Recognize the many ways to make a set
my_set1 = {1, 2, 3}
my_set2 = set(range(1, 4))
my_set3 = {x for x in range(1, 4)}
  • Sets are also aliased
my_set4 = my_set1
my_set4.add(5)
print(my_set1)
print(my_set1 is my_set4)
{1, 2, 3, 5}
True

Dictionaries

  • dicts are great ways to map keys to values. A simple example is a histogram:
my_str = "Welcome to Duke!"

my_dict = {}
for character in my_str:
  if character in my_dict:
    my_dict[character] = my_dict[character] + 1
  else:
    my_dict[character] = 1

print(my_dict)
{'W': 1, 'e': 3, 'l': 1, 'c': 1, 'o': 2, 'm': 1, ' ': 2, 't': 1, 'D': 1, 'u': 1, 'k': 1, '!': 1}

Useful Methods

  • .keys: Get an iterable of all the keys in a dictionary

  • .values: Get an iterable of all the values in a dictionary

  • .items: Get an iterable of all the key-value pairs in a dictionary

  • .update: Combine two dictionaries

my_dict = {"a": 1, "b": 2}
my_dict2 = {"b": 3, "c": 4}
my_dict.update(my_dict2)
print(my_dict)
{'a': 1, 'b': 3, 'c': 4}
  • .get: Get a value from a dictionary but specify a default
my_str = "Welcome to Duke!"

my_dict = {}
for character in my_str:
  my_dict[character] = my_dict.get(character, 0) + 1
print(my_dict)
{'W': 1, 'e': 3, 'l': 1, 'c': 1, 'o': 2, 'm': 1, ' ': 2, 't': 1, 'D': 1, 'u': 1, 'k': 1, '!': 1}

Good Practice

  • Do key in my_dict not key in my_dict.keys()

  • collections.defaultdict is another way to handle the need for default values in a dictionary

  • Nest dictionaries if necessary

my_data = dict(
  item1={
    "status": "open",
    "description": "..."
  },
  item2={
    "status": "closed",
    "issue": "..."
  }
)

my_data["item1"]["status"]
'open'

Tuples

Very similar to lists, but they cannot be changed! They are immutable.

my_list = [1, 2, 3]
my_tuple = tuple(my_list)
my_list[1] = 100
print(my_tuple)

try:
  my_tuple[1] = 100
except Exception as e:
  print(e)
(1, 2, 3)
'tuple' object does not support item assignment

Good Practice

  • To make a tuple with one element: x = (1,)

  • Immutability allows tuples to be used as dict keys and be stored in sets:

my_key = (1, 2, 3, "four")
my_dict = {my_key: "my favorite numbers"}
my_set = set()
my_set.add(my_key)

print(my_dict)
print(my_set)

my_other_key = (1, ["a", "list"], 3)
try:  # All elements of the tuple need to be immutable too
  my_set.add(my_other_key)
except Exception as e:
  print(e)
{(1, 2, 3, 'four'): 'my favorite numbers'}
{(1, 2, 3, 'four')}
unhashable type: 'list'

Iteration

  • Programming is all about doing tasks repeatedly because they would be too annoying to do by hand. Many objects have a natural way to iterate over them.
from collections import Counter  # Makes a dictionary that counts elements
my_str = "I enjoy eating almonds"

def print_data(data):
  for item in data:
    print(item, end=" ")
  print()

print_data(my_str)
print_data(set(my_str))  # No order or repeats
print_data(list(my_str))
print_data(tuple(my_str))
print_data(Counter(my_str))  # Dictionaries are ordered by insertion, iterates over keys
I   e n j o y   e a t i n g   a l m o n d s 
o g d e l s t y j   n m i a I 
I   e n j o y   e a t i n g   a l m o n d s 
I   e n j o y   e a t i n g   a l m o n d s 
I   e n j o y a t i g l m d s 

Useful Functions

  • len: Get the natural size of an object
  • enumerate: Iterate over an object, but with index, value pairs.
  • zip: Iterate over two object at the same time
  • map: Map the values of an iterable using a function
  • itertools module: Contains many useful functions for specific kinds of iteration

Good Practice

  • Consider how you might need to nest iteration
values = [1,2,3,4,5]
pairwise_distance = []
for value1 in values:
  for value2 in values:
    pairwise_distance.append(abs(value1 - value2))
  • while loops are useful if you are unsure how many iterations are needed
# Find the 101st prime
primes_found = []
current_num = 2
while len(primes_found) < 100:
  if is_prime(current_num):
    primes_found.append(current_num)
  current_num += 1
  • Use continue and break
my_str = "aabbccaa"
for letter in my_str:
  if letter == "b":
    continue  # Skip an iteration
  if letter == "c":
    break  # Exit loop, also used with while loops
  print(letter, end = "")
aa

Functions

  • Functions are an essential part of passing functionality in Python from modules or in your code. It is a good idea to split your program up into repeatable parts to help with debugging.
def my_func(a, b, c=5):
  return f"({a} + {b}) * {c} = {(a+b) * c}"

print(my_func(1, 1, 0))
print(my_func(10, 4))
(1 + 1) * 0 = 0
(10 + 4) * 5 = 70
  • Recursion is an essential concept in Computer Science and parts of Data Science:
def sum_of_first_k_nums(k):
  if k == 0:  # Base case
    return 0
  return k + sum_of_first_k_nums(k - 1)

sum_of_first_k_nums(5)
15

Good Practice

  • Functions can have default arguments and return items like the previous slide
  • Functions can also take an arbitrary amount of arguments
def my_func_with_inf_args(*args, **kwargs):
  print(args)
  print(kwargs)
  
my_func_with_inf_args(1, 2, 3, a=4, b=5)
(1, 2, 3)
{'a': 4, 'b': 5}
  • Functions can be passed around in Python
def call_twice(some_func, *args, **kwargs):
  some_func(*args, **kwargs)
  some_func(*args, **kwargs)
  
call_twice(my_func_with_inf_args, 1, 2, f=9)
(1, 2)
{'f': 9}
(1, 2)
{'f': 9}
  • Although it’s more typical to see this as a lambda function
call_twice(lambda: my_func_with_inf_args(1, 2, f=9))
(1, 2)
{'f': 9}
(1, 2)
{'f': 9}

Objects

Objects are everywhere and naturally you can make your own.

from random import shuffle

SUITES = ("Hearts", "Spades", "Clubs", "Diamonds")
RANKS = (2, 3, 4, 5, 6, 7, 8, 9, 10, "Jack", "Queen", "King", "Ace")

class CardDeck:
  def __init__(self, empty=False):  # Constructor
    self.cards = []
    if not empty:
      for suit in SUITES:
        for rank in RANKS:
          self.cards.append((suit, rank))
  
  def add_card(self, suit, rank):
    self.cards.append((suit, rank))
  
  def shuffle(self):
    shuffle(self.cards)
    
  def draw_card(self):
    return self.cards.pop()

my_deck = CardDeck()
my_deck.shuffle()
print(my_deck.draw_card())
print(my_deck.draw_card())
('Hearts', 9)
('Spades', 7)

Useful Methods

class CardDeck:
  # ...
  def __len__(self):
    return len(self.cards)
  
  def __contains__(self, card):
    return card in self.cards

my_deck = CardDeck()
len(my_deck)
("Hearts", "Queen") in my_deck

Good Practice

class DiscardPile(CardDeck):  # Inheritance let's you reuse code
  def __init__(self):
    CardDeck.__init__(self, empty=True)

  def add_card(self, deck):
    drawn_card = deck.draw_card()
    self.cards.append(drawn_card)
    return drawn_card

my_deck = CardDeck()
discard = DiscardPile()
discarded_card = discard.add_card(my_deck)
  • Terminology matters
    • Function: stand-alone function (print)
    • Method: function that is attached to a class (draw_card)
    • Attribute: variable attached to a class (my_deck.cards)
    • Constructor: the __init__ method
    • Instance: a constructed object (my_deck)

Exceptions and Error Handling

If your program runs into an error, it will terminate if the resulting Exception is not caught.

my_list = [2, "hi"]
try:
  my_str = ",".join(my_list)  # TypeError: sequence item 0: expected str instance, int found
except TypeError:
  my_list = [str(element) for element in my_list]
  my_str = ",".join(my_list)

# Program continues...

Common Exceptions

  • Exception: All exceptions fall under this class
  • ArithmeticError: Base exception for OverflowError, ZeroDivisionError, FloatingPointError
  • AttributeError: Attempting to access an attribute that does not exist on an object
  • IndexError: Attempting to use an invalid index on a sequence
  • KeyboardInterrupt: Raised when ctrl-c is pressed during execution
  • NameError: Using a variable that does not exist
  • TypeError: Operating on two objects with incompatible types
  • ValueError: Input to a function is invalid

Good Practice

  • Avoid catch-all exceptions. If you use a try-catch, usually you have a specific exception in mind to handle. Other exceptions should be raised to see bugs when they happen.
try:
  some_complicated_function()
except Exception:  # Bad
  print("SOMETHING went wrong!")
  • Print them!
try:
  some_complicated_function()
except MemoryError as e:
  print("Out of memory!", e)
  • raise your own!
def my_complicated_function(some_matrix):
  if not some_matrix.is_square():
    raise ValueError(f"some_matrix is not square!")

Files

  • One way to read input and write output for a program is by writing to a file. This is generally a good idea if you might want to save the results for later.
# Writing files, use the mode argument
# Careful! This will delete the file if it is present
my_file = open("path_to_file.txt", mode="w")  
my_file.write("output I want to keep")
my_file.close()  # You must close the file, or your results may be lost

# Reading files
my_file = open("path_to_file.txt", mode="r")  # The mode is r by default
print(my_file.readline())
my_file.close()

Useful Functions/Methods

  • The os.path module and the pathlib module contain many methods for operating on the file system:
    • os.path.exists: Determine if a file exists at a given path
    • os.path.join: Join two path components together
    • os.listdir: List files in a directory
  • json.load in the json module is vital for reading .json files. It is also useful for writing dictionaries to a file with json.dump.
  • .readlines(): Read a file in line-by-line altogether

Good Practice

  • You can use the with statement to automatically close files when you are done using them. This includes if your program terminates unexpectedly.
with open("my_file.txt") as f:
  f.readline()
  # ...
# Outside the above indentation f cannot be used as it is closed
  • If your files contain text that is not in the ASCII codec, specify the encoding:
with open("my_file.txt", encoding="utf-8") as f:
  print(f.readline())  # Supports many unicode characters

Example Script

Problem Statement

A friend has a directory of 1000 files where each file has one of the following extensions: .csv, .tsv, .json. However, each file has comments throughout it delimited by ##, so they do not follow the proper format. They ask you to write a Python script which will combine all the files into 1 while removing the comments and ensuring the data is in a proper .csv format.

What is the input?

Character-delineated files and JSON

    person,age,job,favorite_color
    amy,20,waiter,blue
    barry,30,engineer,grey
    ## we have only adults in the dataset
    carl,25,None,purple ## None means unemployed
    ## this person did not understand the survey dan,29,superhero,pineapple 
    person    age   job   favorite_color
    amy   20    waiter    blue
    barry   30    engineer    grey
    ## we have only adults in the dataset
    carl    25    None    purple ## None means unemployed
    ## this person did not understand the survey dan    29    superhero   pineapple 
    {
      "data": [ ## List of people
        {
          "name": "amy",
          "age": 20,
          "job": "waiter",
          "favorite_color": "blue"
        },
        {
          ## Barry is friends with my Dad, Jerry
          "name": "barry", 
          "age": 30,
          "job": "engineer",
          "favorite_color": "grey"
        },
        ... omitted for brevity
      ]
    }

Psuedo Code

Create a variable to store data from all files

For every file my friend has
  Read them in line-by-line
  Remove all comments in each line
  Remove all empty lines
  If it is a .csv
    Seperate the commas and store it
  If it is a .tsv
    Seperate the tabs and store it
  If it is a .json
    Read the .json and store it
    
Turn the variable that is storing all the data into a .csv

Step 1

Create a variable to store data from all files

HEADERS = ["name", "age", "job", "favorite_color"]

my_data = [
  # ("amy", 20, "waiter", "blue") example entry
]

For every file my friend has, Read it in as a string

import os

path_to_folder = "./my_friend/stored/the/files/here"

for file_path in os.listdir(path_to_folder):
  file = open(file_path)
  file_as_lines = file.readlines()
  # [
  #    "person,age,job,favorite_color\n",
  #    "amy,20,waiter,blue\n",
  #    "barry,30,engineer,grey\n",
  #    "## we have only adults in the dataset\n",
  #    "carl,25,None,purple ## None means unemployed\n",
  #    "## this person did not understand the survey dan,29,superhero,pineapple"
  # ]

Step 2

Remove all comments in each line then remove all empty lines

def remove_comments(line):
  no_whitespace = line.strip() #  => "## this is a comment"
  if "##" in no_whitespace:
    comment_starts_at_index = no_whitespace.index("##")
    filtered = no_whitespace[:comment_starts_at_index]
    return filtered.strip() # In case there are spaces around the comment
  return no_whitespace

print("test 1", remove_comments("  ## this is a comment  "))
print("test 2", remove_comments("person,age,job,favorite_color"))
print("test 3", remove_comments("person,age,job,favorite_color## comments   "))
test 1 
test 2 person,age,job,favorite_color
test 3 person,age,job,favorite_color
file_with_no_comments = ""
for line in file_as_lines:
  comments_removed = remove_comments(line)
  if comments_removed:
    file_with_no_comments += comments_removed + "\n"

Step 3

If it is a .csv or .tsv, seperate the commas or tabs and store it

def get_delimited_entries(file_as_str, delimiter):
  to_return = []
  lines = file_as_str.split("\n")
  for line in lines[1:]:  # Skip header
    to_return.append(line.split(delimiter))
  
  return to_return

if file_path.endswith(".csv"):
  my_data.extend(get_delimited_entries(file_with_no_comments, ","))
elif file_path.endswith(".tsv"):
  my_data.extend(get_delimited_entries(file_with_no_comments, "\t"))

Step 4

If it is a .json read the .json and store it

import json

# ... continued
elif file_path.endswith(".json"):
  as_dict = json.loads(file_with_no_comments)
  for entry in as_dict["data"]:
    my_data.append([str(entry[header]) for header in HEADERS])
else:
  print(f"Could not read file: {file_path}")
  
  file.close()
# end for loop over files

Step 5

Turn the variable that is storing all the data into a .csv

as_csv = ",".join(HEADERS) + "\n"
for entry in my_data:
  as_csv += ",".join(entry) + "\n"

with open("output.csv", "w") as output_file:
  output_file.write(as_csv)

Full script here

More Python

Libraries

  • numpy, pandas: Common libraries for data processing
  • django: Framework for building websites via full-stack development
  • csv: Built-in library for handling csv files
  • SQLAlchemy: Extensive database support
  • requests: Make HTTP requests for API calls and web scraping
  • Pillow: Image processing
  • six: Python 2 and 3 compatibility
  • Polars: Pandas alternative
  • BeautifulSoup: Parse HTML input
  • pygame: Game development
  • nltk: Natural Language Processing
  • PyTorch, TensorFlow: Neural Networks 🤮
  • Many, many more…

New Stuff

Testing

In a professional environment, unit tests may be written to ensure bugs are not introduced during the development process.

import unittest
from my_module import my_func

class TestMyStuff(unittest.TestCase):
    def test_my_func(self):
        my_output = my_func(1, 2, 3)
        self.assertEqual(6, my_output)
        
    def test_my_func_with_strings(self):
        my_output = my_func(1, 2, 3, "h")
        self.assertEqual(60, my_output)
    
    def test_my_func_raises_exception(self):
        with self.assertRaises(ValueError):
          my_output = my_func(1, 2, 3, "h", "")

Lower-level Language Bindings

  • As a high-level language, Python is relatively slow.
  • Python can call optimized code written in C, C++, fortran, etc. to achieve very similar speeds.
  • This is why numpy is so much faster.
  • Use the right tool for the right job.

Exercises

Colab