Image: ESA - C.Carreau (SEMPDN9OY2F)
Overview | Assignments | Policies

CIS 4550/5550: Internet and Web Systems (Fall 2024)

This course focuses on the issues encountered in building Internet and Web systems, such as scalability, interoperability, consistency, replication, fault tolerance, and security. We will examine how services like Google or Amazon handle billions of requests from all over the world each day, (almost) without failing or becoming unreachable. We will study how to collect massive-scale data sets, how to process them, and how to extract useful information from them, and we will have a look at the massive, heavily distributed infrastructure that is used to run these services and similar cloud-based services today.

An important feature of the course is that we will not just discuss issues and solutions but also provide hands-on experience, using web search as our case study. There will be several substantial implementation projects throughout the semester, each of which will focus on a particular component of the search engine, such as frontend, storage, crawler, or indexer. The final project will be to build a Google-style search engine, and to deploy and run it on the cloud.

Notice that this is NOT a course on web design or on web application development! Instead of learning how to use a web server such as Apache or a scalable analytics system such as Spark, we will actually build our own little web server, and a little mini-"Spark", from scratch. As a side effect, you will learn about some aspects of large-scale software development, such as working with APIs and specifications, thinking about modularity, reading other people's code, managing versions, and debugging.

CIS 5550 is now a core course for the MSE degree as well as an option for the WPE I requirement for PhD students. The Daily Pennsylvanian published a nice article about this course.

Instructor

Vincent Liu
Office hours: Wednesdays 2:00 - 3:00 pm (Levine 574)

Teaching assistants

[Head TA] Sid Sannapareddy sidsan@seas.upenn.edu OH: Fri 12:00 - 2:00 pm @ Levine 5th Floor Bump Space
Yutai Zhang yutai@seas.upenn.edu OH: Fri 10:00 am - 12:00 pm @ OHQ
Yuxuan Xiong yuxuanx@design.upenn.edu OH: Sat 9:00 - 11:00 am @ OHQ
Grace Lee glee1@seas.upenn.edu OH: Sat 6:00 - 8:00 pm @ OHQ
Xingjian Wang xwang7@seas.upenn.edu OH: Sun 10:00 am - 12:00 pm @ OHQ
Kevin Liu kliu2360@seas.upenn.edu OH: Mon 7:00 - 9:00 pm @ OHQ
Jason Ren jren2@seas.upenn.edu OH: Tue 1:00 - 3:00 pm @ Levine 3rd Floor Bump Space
Tang Gao tanggao@seas.upenn.edu OH: Tue 3:00 - 5:00 pm @ OHQ
Cyrus Singer cysinger@seas.upenn.edu OH: Wed 3:30 - 5:30 pm @ Levine 5th Floor Elevator Bump Space
Rui Xia xia7@seas.upenn.edu OH: Thu 1:00 - 3:00 pm @ OHQ
Emily Shang emshg@seas.upenn.edu OH: Thu 4:00 - 6:00 pm @ Levine 3rd Floor Bump Space

Format

The format will be two 1.5-hour lectures per week, plus assigned readings. There will be regular homework assignments, two in-class midterms, and a substantial implementation project with experimental validation and a report.

Time and location

Tuesdays and Thursdays 10:15-11:45am (COLL 200)

Prerequisites

This course expects familiarity with threads and concurrency, as well as strong Java programming skills. Those highly proficient in another programming language, such as C++ or C#, should be able to translate their skills easily. The course will require a considerable amount of programming, as well as the ability to work with your classmates in teams.

Textbooks

Distributed Systems: Principles and Paradigms, 3rd edition, by Tanenbaum and van Steen, Prentice Hall (ISBN 978-1530281756).
You can buy a physical copy (e.g., for $35 on Amazon) or download a free digital copy here.

Additional materials will be provided as handouts or in the form of light technical papers.

Grading

Homework 40%, Term project 25%, Exams 30%, Participation 5%

Policies

You can find a list of key course policies here.

Assignments

Homework assignments are available for download. Please join the discussion group as well!

Tentative schedule

DateTopicDetailsReadingRemarks
27-Aug Introduction [Slides] Introduction
Overview
Logistics
Policies
HW0 released
29-Aug Internet basics [Slides] The Internet
Interdomain routing; BGP; valley-free
Path properties
TCP and UDP
Socket basics; echo server
[30-Aug] HW0 due; HW1 released
3-Sep The Web [Slides] The Web; hyperlinks; history of the Web
Client-server model
HTTP/1, TLS
HTML/CSS basics
HTTP/2
Lampson: "Hints for Computer System Design"
Introduction to HTTP/2
5-Sep Scalability [Slides] Parallelization
Consistency
Mutual exclusion; locking; deadlocks
NUMA and Shared-Nothing
Frontend-backend, Sharding
Vogels: "Eventually Consistent"
10-Sep Dynamic content [Slides] Motivation: Dynamic content
Routes
Managing state; cookies; sessions
Tracking; business model of the web
Spark Framework Overview
12-Sep The Client Side [Slides] JavaScript
DOM
MDN: A reintroduction to JavaScript [13-Sep] HW1 due; HW2 released
17-Sep The Client Side (cont.) [Slides] Dynamic requests
AJAX
19-Sep Naming [Slides] Name spaces and directories
DNS architecture
Security issues with DNS
DNSSEC, DANE
Globally Distributed Content Delivery [20-Sep] HW2 due; HW3 released
24-Sep The Cloud [Slides] Data centers
Cloud computing
Types of clouds
History of Cloud Computing
Case study: EC2
Armbrust et al.: "A View of Cloud Computing"
26-Sep RPCs [Slides] Web services; APIs; API examples
Remote procedure calls
Handling RPC failures
Data interchange
XML
Chapter 4.2 in the Tanenbaum book [27-Sep] HW3 due; HW4 released
1-OctFirst midterm exam
Oct 3-6Spring Term Break
7-OctLast day to drop
8-Oct Key-value Stores [Slides] Key-value stores
KVS on the Cloud
Sharding and coordination
Case study: S3
Case study: DynamoDB
Cooper et al.: "PNUTS to Sherpa: Lessons from Yahoo!'s Cloud Database"
10-Oct Basic fault tolerance [Slides] Faults and fault models
Primary-backup replication
Chapter 7.5 in the Tanenbaum book [11-Oct] HW4 due; HW5 released
15-Oct Basic fault tolerance (cont) [Slides] Availability and Durability
The CAP theorem
Quorum replication
17-Oct Scalable Analytics [Slides] Introduction to scalable analytics
MapReduce
The Streams API
Apache Spark
Lambdas and serialization
Zaharai et al.: "Spark: Cluster Computing with Working Sets" [18-Oct] HW5 due; HW6 released
22-Oct Spark basics [Slides] Spark jobs
Working with files
Spark transformations
Spark actions
The Structured API
Zaharia et al.: "Resilient Distributed Datasets"
24-Oct Spark continued [Slides] HDFS
Apache Livy
Distributed shared variables
Graph algorithms in Spark
Shvachko: "Apache Hadoop: The Scalability Update" [25-Oct] HW6 due; HW7 released
25-OctLast day to pass/fail
29-Oct Crawling [Slides] Structure of the Web
Crawling basics
SEO
Crawler etiquette
Heydon and Nayork: "Mercator: A scalable, extensible Web crawler" Project handout released
31-Oct Information retrieval [Slides] Basic IR model; precision/recall
Boolean model
Vector model
TF/IDF
Stemming and lemmatization
Chapter 1 in "An Introduction to Information Retrieval" [1-Nov] HW7 due; HW8 released
4-NovLast day to withdraw
5-Nov Authoritativeness [Slides] Motivation: off-page features
HITS
PageRank
Sinks and hogs
Brin and Page: "The PageRank Citation Ranking: Bringing Order to the Web" Team registrations due; project begins
7-Nov Search engines [Slides] Building a search engine
Case study: Google
Case study: Mercator
Project overview
Modern search
Brin and Page: "The Anatomy of a Large-Scale Hypertextual Web Search Engine" [8-Nov] HW8 due; HW9 released
12-Nov Decentralized systems [Slides] Centralization and its effects
Partly centralized systems
Unstructured overlays
Structured overlays
Druschel and Rodrigues: "Peer-to-Peer Systems"
14-Nov Key-based routing; DHTs [Slides] Consistent hashing and DHTs
Key-based routing
Basic Chord
Fault tolerance in Chord
KBR and security
Stoica et al.: "Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications" [15-Nov] HW9 due
19-Nov Advanced Fault Tolerance [Slides] Non-crash fault models Schneider: Implementing Fault-Tolerant Services Using the State Machine Approach
21-Nov Advanced Fault Tolerance (cont.) State-machine replication
Paxos
The Byzantine Generals Problem
Byzantine Fault Tolerance
26-Nov Advanced Fault Tolerance (cont.) State-machine replication
Nov 28 - Dec 1Thanksgiving Break
3-Dec Security Threat models
Crypto basics
Digital signatures
Attacks and Defenses
OWASP Top 10
5-DecSecond midterm exam
Dec 10-11Reading days
Dec 12-19Finals period (in-person project demos)