Image: ESA - C.Carreau (SEMPDN9OY2F)
Overview | Assignments | Policies

CIS 5550: Internet and Web Systems (Fall 2023)

This course focuses on the issues encountered in building Internet and Web systems, such as scalability, interoperability, consistency, replication, fault tolerance, and security. We will examine how services like Google or Amazon handle billions of requests from all over the world each day, (almost) without failing or becoming unreachable. We will study how to collect massive-scale data sets, how to process them, and how to extract useful information from them, and we will have a look at the massive, heavily distributed infrastructure that is used to run these services and similar cloud-based services today.

An important feature of the course is that we will not just discuss issues and solutions but also provide hands-on experience, using web search as our case study. There will be several substantial implementation projects throughout the semester, each of which will focus on a particular component of the search engine, such as frontend, storage, crawler, or indexer. The final project will be to build a Google-style search engine, and to deploy and run it on the cloud.

Notice that this is NOT a course on web design, or on web application development! Instead of learning how to use a web server such as Apache or a scalable analytics system such as Spark, we will actually build our own little web server, and a little mini-"Sparkā€", from scratch. As a side effect, you will learn about some aspects of large-scale software development, such as working with APIs and specifications, thinking about modularity, reading other people's code, managing versions, and debugging.

CIS 5550 is now a core course for the MSE degree as well as an option for the WPE I requirement for PhD students. The Daily Pennsylvanian published a nice article about this course.


Linh Thi Xuan Phan
Office hours: Thursdays 12:00-1:00pm (Levine 576)

Teaching assistants

If location is not specified, OHQ will be used temporarily.

GuanWen Qiu (Head TA) OH: Mondays 3:30-5:00pm (Levine 501)
Crescent Xiong OH: Mondays 5:00-7:00pm (Levine 5th floor bump space)
Larry Huang OH: Tuesdays 1:30-3:00pm (Levine 3rd floor bump space)
Jinhui Luo OH: Tuesdays 3:30-5:00pm (Levine 501 bump space)
Yujuan Song OH: Wednesdays 1:00-3:00pm (Levine 5th floor bump space.)
Yuanqi Wang OH: Wednesdays 3:00-5:00pm (Levine 5th floor bump space
Xingjian Wang OH: Thursdays 2:30-4:30pm (Levine 501 bump space)
Zhiyu (Oliver) Lei OH: Thursdays 4:30-6:30pm (Levine 501 bump space)
Zhengyi Xiao OH: Fridays 1:00-3:00pm (Levine 501 bump space)
Charles Cheng OH: Fridays 4:00-6:00pm (Location: OHQ)


The format will be two 1.5-hour lectures per week, plus assigned readings. There will be regular homework assignments, two in-class midterms, and a substantial implementation project with experimental validation and a report.

Time and location

Tuesdays + Thursdays 10:15-11:45am (DRL A1)


This course expects familiarity with threads and concurrency, as well as strong Java programming skills. Those highly proficient in another programming language, such as C++ or C#, should be able to translate their skills easily. The course will require a considerable amount of programming, as well as the ability to work with your classmates in teams.


Distributed Systems: Principles and Paradigms, 3rd edition, by Tanenbaum and van Steen, Prentice Hall (ISBN 978-1530281756).
You can buy a physical copy (e.g., for $35 on Amazon) or download a free digital copy here.

Additional materials will be provided as handouts or in the form of light technical papers.


Homework 40%, Term project 25%, Exams 30%, Participation 5%


You can find a list of key course policies here.


Homework assignments are available for download. Please join the discussion group as well!

Tentative schedule

29-Aug Introduction [Slides] Introduction
HW0 released
31-Aug Internet basics [Slides] The Internet
Interdomain routing; BGP; valley-free
Path properties
Socket basics; echo server
5-Sep The Web [Slides] The Web; hyperlinks; history of the Web
Client-server model
HTML/CSS basics
Introduction to HTTP/2 HW0 due; HW1 released
7-Sep Scalability [Slides] Parallelization
Mutual exclusion; locking; deadlocks
NUMA and Shared-Nothing
Frontend-backend, Sharding
Vogels: "Eventually Consistent"
12-Sep Dynamic content [Slides] Motivation: Dynamic content
Managing state; cookies; sessions
Tracking; business model of the web
Spark Framework Overview HW1 due; HW2 released
14-Sep The Client Side [Slides] JavaScript
Dynamic requests
MDN: A reintroduction to JavaScript
19-Sep Naming [Slides] Name spaces and directories
DNS architecture
Security issues with DNS
Globally Distributed Content Delivery HW2 due; HW3 released
21-Sep The Cloud [Slides] Data centers
Cloud computing
Types of clouds
History of Cloud Computing
Case study: EC2
Armbrust et al.: "A View of Cloud Computing"
26-Sep RPCs [Slides] Web services; APIs; API examples
Remote procedure calls
Handling RPC failures
Data interchange
Chapter 4.2 in the Tanenbaum book HW3 due; HW4 released
28-Sep Storage [Slides] Key-value stores
KVS on the Cloud
Sharding and coordination
Case study: S3
Case study: DynamoDB
Cooper et al.: "PNUTS to Sherpa: Lessons from Yahoo!'s Cloud Database"
3-Oct Basic fault tolerance [Slides] Faults and fault models
Primary-backup replication
Chapter 7.5 in the Tanenbaum book HW4 due
5-OctFirst midterm exam
9-OctLast day to drop
10-Oct Basic fault tolerance (cont) Availability and Durability
The CAP theorem
Quorum replication
HW5 released (on 10/5)
Oct 12-15Fall Break
17-Oct Scalable Analytics [Slides] Introduction to scalable analytics
The Streams API
Apache Spark
Lambdas and serialization
Zaharai et al.: "Spark: Cluster Computing with Working Sets" HW5 due; HW6 released
19-Oct Spark basics [Slides] Spark jobs
Working with files
Spark transformations
Spark actions
The Structured API
Zaharia et al.: "Resilient Distributed Datasets"
24-Oct Spark continued [Slides] HDFS
Apache Livy
Distributed shared variables
Graph algorithms in Spark
Shvachko: "Apache Hadoop: The Scalability Update" HW6 due; HW7 released; Project handout released
31-Oct Crawling [Slides] Structure of the Web
Crawling basics
Crawler etiquette
Heydon and Nayork: "Mercator: A scalable, extensible Web crawler" HW7 due; HW8 released; Team registrations due; project begins
2-Nov Information retrieval [Slides] Basic IR model; precision/recall
Boolean model
Vector model
Stemming and lemmatization
Chapter 1 in "An Introduction to Information Retrieval"
6-NovLast day to withdraw
7-Nov Authoritativeness [Slides] Motivation: off-page features
Sinks and hogs
Brin and Page: "The PageRank Citation Ranking: Bringing Order to the Web" HW8 due; HW9 released
9-Nov Search engines [Slides] Building a search engine
Case study: Google
Case study: Mercator
Project overview
Modern search
Brin and Page: "The Anatomy of a Large-Scale Hypertextual Web Search Engine"
14-Nov Decentralized systems [Slides] Centralization and its effects
Partly centralized systems
Unstructured overlays
Structured overlays
Druschel and Rodrigues: "Peer-to-Peer Systems" HW9 due
16-Nov Key-based routing; DHTs [Slides] Consistent hashing and DHTs
Key-based routing
Basic Chord
Fault tolerance in Chord
KBR and security
Stoica et al.: "Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications"
21-Nov Advanced Fault Tolerance [Slides] Non-crash fault models Schneider: Implementing Fault-Tolerant Services Using the State Machine Approach
Nov 23-26Thanksgiving Break
Nov 28+30 Advanced Fault Tolerance (cont.) [Slides] State-machine replication
The Byzantine Generals Problem
Byzantine Fault Tolerance
Schneider: Implementing Fault-Tolerant Services Using the State Machine Approach
30-Nov Transactions [Slides] Introduction to transactions
Concurrency control
Log-based recovery
Two-phase commit
Distributed concurrency control
Shute et al.: "F1: A Distributed SQL Database That Scales"
5-Dec Security [Slides] Threat models
Crypto basics
Digital signatures
Attacks and Defenses
OWASP Top 10
7-DecSecond midterm exam
Dec 12-13Reading days
Dec 14-21Finals period (in-person project demos)