Image: ESA - C.Carreau (SEMPDN9OY2F)
Overview | Assignments | Policies

CIS 5550: Internet and Web Systems (Fall 2023)

This course focuses on the issues encountered in building Internet and Web systems, such as scalability, interoperability, consistency, replication, fault tolerance, and security. We will examine how services like Google or Amazon handle billions of requests from all over the world each day, (almost) without failing or becoming unreachable. We will study how to collect massive-scale data sets, how to process them, and how to extract useful information from them, and we will have a look at the massive, heavily distributed infrastructure that is used to run these services and similar cloud-based services today.

An important feature of the course is that we will not just discuss issues and solutions but also provide hands-on experience, using web search as our case study. There will be several substantial implementation projects throughout the semester, each of which will focus on a particular component of the search engine, such as frontend, storage, crawler, or indexer. The final project will be to build a Google-style search engine, and to deploy and run it on the cloud.

Notice that this is NOT a course on web design, or on web application development! Instead of learning how to use a web server such as Apache or a scalable analytics system such as Spark, we will actually build our own little web server, and a little mini-"Sparkā€", from scratch. As a side effect, you will learn about some aspects of large-scale software development, such as working with APIs and specifications, thinking about modularity, reading other people's code, managing versions, and debugging.

CIS 5550 is now a core course for the MSE degree as well as an option for the WPE I requirement for PhD students. The Daily Pennsylvanian published a nice article about this course.

Instructor

Linh Thi Xuan Phan
Office hours: Thursdays 12:00-1:00pm (Levine 576)

Teaching assistants

If location is not specified, OHQ will be used temporarily.

GuanWen Qiu (Head TA) guanwenq@seas.upenn.edu OH: Mondays 3:30-5:00pm (Levine 501)
Crescent Xiong zihanx3@seas.upenn.edu OH: Mondays 5:00-7:00pm (Levine 5th floor bump space)
Larry Huang huangl24@seas.upenn.edu OH: Tuesdays 1:30-3:00pm (Levine 3rd floor bump space)
Jinhui Luo jinhuil@seas.upenn.edu OH: Tuesdays 3:30-5:00pm (Levine 501 bump space)
Yujuan Song syujuan@seas.upenn.edu OH: Wednesdays 1:00-3:00pm (Levine 5th floor bump space.)
Yuanqi Wang wyq@sas.upenn.edu OH: Wednesdays 3:00-5:00pm (Levine 5th floor bump space
Xingjian Wang xwang7@seas.upenn.edu OH: Thursdays 2:30-4:30pm (Levine 501 bump space)
Zhiyu (Oliver) Lei zlei6@seas.upenn.edu OH: Thursdays 4:30-6:30pm (Levine 501 bump space)
Zhengyi Xiao zxiao98@seas.upenn.edu OH: Fridays 1:00-3:00pm (Levine 501 bump space)
Charles Cheng chacheng@seas.upenn.edu OH: Fridays 4:00-6:00pm (Location: OHQ)

Format

The format will be two 1.5-hour lectures per week, plus assigned readings. There will be regular homework assignments, two in-class midterms, and a substantial implementation project with experimental validation and a report.

Time and location

Tuesdays + Thursdays 10:15-11:45am (DRL A1)

Prerequisites

This course expects familiarity with threads and concurrency, as well as strong Java programming skills. Those highly proficient in another programming language, such as C++ or C#, should be able to translate their skills easily. The course will require a considerable amount of programming, as well as the ability to work with your classmates in teams.

Textbooks

Distributed Systems: Principles and Paradigms, 3rd edition, by Tanenbaum and van Steen, Prentice Hall (ISBN 978-1530281756).
You can buy a physical copy (e.g., for $35 on Amazon) or download a free digital copy here.

Additional materials will be provided as handouts or in the form of light technical papers.

Grading

Homework 40%, Term project 25%, Exams 30%, Participation 5%

Policies

You can find a list of key course policies here.

Assignments

Homework assignments are available for download. Please join the discussion group as well!

Tentative schedule

DateTopicDetailsReadingRemarks
Aug 29 Introduction [Slides] Introduction
Overview
Logistics
Policies
HW0 released
Aug 31 Internet basics [Slides] The Internet
Interdomain routing; BGP; valley-free
Path properties
TCP and UDP
Socket basics; echo server
Sep 5 The Web [Slides] The Web; hyperlinks; history of the Web
Client-server model
HTTP/1, TLS
HTML/CSS basics
HTTP/2
Introduction to HTTP/2 HW0 due; HW1 released
Sep 7 Scalability [Slides] Parallelization
Consistency
Mutual exclusion; locking; deadlocks
NUMA and Shared-Nothing
Frontend-backend, Sharding
Vogels: "Eventually Consistent"
Sep 12 Dynamic content [Slides] Motivation: Dynamic content
Routes
Managing state; cookies; sessions
Tracking; business model of the web
Spark Framework Overview HW1 due; HW2 released
Sep 14 The Client Side [Slides] JavaScript
DOM
Dynamic requests
AJAX
MDN: A reintroduction to JavaScript
Sep 19 Naming [Slides] Name spaces and directories
DNS architecture
Security issues with DNS
DNSSEC, DANE
Globally Distributed Content Delivery HW2 due; HW3 released
Sep 21 The Cloud [Slides] Data centers
Cloud computing
Types of clouds
History of Cloud Computing
Case study: EC2
Armbrust et al.: "A View of Cloud Computing"
Sep 26 RPCs [Slides] Web services; APIs; API examples
Remote procedure calls
Handling RPC failures
Data interchange
XML
Chapter 4.2 in the Tanenbaum book HW3 due; HW4 released
Sep 28 Storage [Slides] Key-value stores
KVS on the Cloud
Sharding and coordination
Case study: S3
Case study: DynamoDB
Cooper et al.: "PNUTS to Sherpa: Lessons from Yahoo!'s Cloud Database"
Oct 3 Basic fault tolerance Faults and fault models
Primary-backup replication
Availability and Durability
The CAP theorem
Quorum replication
Chapter 7.5 in the Tanenbaum book HW4 due
Oct 5First midterm exam
Oct 9Last day to drop
Oct 10 Indexing; GFS case study Motivation: Cost of operations
B+ tree overview
Motivation; distributed file systems;NFS
GFS architecture
GFS operation
Discussion of GFS
Comer: "The Ubiquitous B-Tree" HW5 released (on 10/5)
Oct 12-15Fall Break
Oct 17 Scalable Analytics Introduction to scalable analytics
MapReduce
The Streams API
Apache Spark
Lambdas and serialization
Zaharai et al.: "Spark: Cluster Computing with Working Sets" HW5 due; HW6 released
Oct 19 Spark basics Spark jobs
Working with files
Spark transformations
Spark actions
The Structured API
Zaharia et al.: "Resilient Distributed Datasets"
Oct 24 Spark continued HDFS
Apache Livy
Distributed shared variables
Graph algorithms in Spark
Shvachko: "Apache Hadoop: The Scalability Update" HW6 due; HW7 released; Project handout released
Oct 26 Crawling Structure of the Web
Crawling basics
SEO
Crawler etiquette
Heydon and Nayork: "Mercator: A scalable, extensible Web crawler"
Oct 27Last day to pass/fail
Oct 31 Information retrieval Basic IR model; precision/recall
Boolean model
Vector model
TF/IDF
Stemming and lemmatization
Chapter 1 in "An Introduction to Information Retrieval" HW7 due; HW8 released; Team registrations due; project begins
Nov 2 Authoritativeness Motivation: off-page features
HITS
PageRank
Sinks and hogs
Brin and Page: "The PageRank Citation Ranking: Bringing Order to the Web"
Nov 6Last day to withdraw
Nov 7 Search engines Building a search engine
Case study: Google
Case study: Mercator
Project overview
Modern search
Guest lecture by Raj Singh
Brin and Page: "The Anatomy of a Large-Scale Hypertextual Web Search Engine" HW8 due; HW9 released
Nov 9 Engineering software systems; virtualization Software engineering
Version control
Testing
Debugging
Effective teams
Why virtualization?
Virtualization basics
Containers
Serverless computing
The Agile Coach
Nov 14 Decentralized systems Centralization and its effects
Partly centralized systems
Unstructured overlays
Structured overlays
Druschel and Rodrigues: "Peer-to-Peer Systems" HW9 due
Nov 16 Key-based routing; DHTs Consistent hashing and DHTs
Key-based routing
Basic Chord
Fault tolerance in Chord
KBR and security
Stoica et al.: "Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications"
Nov 21 Advanced Fault Tolerance Non-crash fault models
State-machine replication
Paxos
The Byzantine Generals Problem
Byzantine Fault Tolerance
Schneider: Implementing Fault-Tolerant Services Using the State Machine Approach
Nov 23-26Thanksgiving Break
Nov 28 Transactions Introduction to transactions
Concurrency control
Log-based recovery
Two-phase commit
Distributed concurrency control
Shute et al.: "F1: A Distributed SQL Database That Scales"
Nov 30 Security Threat models
Crypto basics
Digital signatures
Attacks and Defenses (part 1)
Attacks and Defenses (part 2)
OWASP Top 10
Dec 5 Special topics TBA



Dec 7Second midterm exam
Dec 12-13Reading days
Dec 14-21Finals period (in-person project demos)