Image: ESA - C.Carreau (SEMPDN9OY2F)
Overview | Assignments | Policies

CIS 5550: Internet and Web Systems (Spring 2023)

This course focuses on the issues encountered in building Internet and Web systems, such as scalability, interoperability, consistency, replication, fault tolerance, and security. We will examine how services like Google or Amazon handle billions of requests from all over the world each day, (almost) without failing or becoming unreachable. We will study how to collect massive-scale data sets, how to process them, and how to extract useful information from them, and we will have a look at the massive, heavily distributed infrastructure that is used to run these services and similar cloud-based services today.

An important feature of the course is that we will not just discuss issues and solutions but also provide hands-on experience, using web search as our case study. There will be several substantial implementation projects throughout the semester, each of which will focus on a particular component of the search engine, such as frontend, storage, crawler, or indexer. The final project will be to build a Google-style search engine, and to deploy and run it on the cloud.

Notice that this is NOT a course on web design, or on web application development! Instead of learning how to use a web server such as Apache or a scalable analytics system such as Spark, we will actually build our own little web server, and a little mini-"Sparkā€", from scratch. As a side effect, you will learn about some aspects of large-scale software development, such as working with APIs and specifications, thinking about modularity, reading other people's code, managing versions, and debugging.

CIS 5550 is now a core course for the MSE degree as well as an option for the WPE I requirement for PhD students. The Daily Pennsylvanian published a nice article about this course.

Instructor

Andreas Haeberlen
Office hours: Mondays 9-10am (Levine 560)

Teaching assistants

Yixiang Xiao yx1215@seas.upenn.edu OH: Mondays 2:00-3:30pm (Levine GRW 5th floor bump space)
Zhikai Zheng zhikz@seas.upenn.edu OH: Mondays 7:00-8:30pm (Levine GRW 5th floor bump space)
Lifu Zhang leafz@seas.upenn.edu OH: Tuesdays 3:30-5:00pm (Levine GRW 5th floor bump space)
Guanwen Qiu (Head TA) guanwenq@seas.upenn.edu OH: Tuesdays 7:30-9:00pm (Levine 501 bump space)
Ziyan Liu ziyanliu@seas.upenn.edu OH: Wednesdays 2:00-3:30pm (Levine GRW 5th floor bump space)
Eric Jackson ebj29@sas.upenn.edu OH: Wednesdays 3:30-5:00pm (Levine GRW 5th floor bump space)
Yiyun Liu liuyiyun@seas.upenn.edu OH: Thursdays 3:30-5:00pm (Levine GRW 5th floor bump space)
Yihan Wang wangyh1@seas.upenn.edu OH: Saturdays 4:00-5:30pm (Virtual)

Format

The format will be two 1.5-hour lectures per week, plus assigned readings. There will be regular homework assignments, two in-class midterms, and a substantial implementation project with experimental validation and a report.

Time and location

Mondays + Wednesdays noon-1:30pm (LRSM Auditorium)

Prerequisites

This course expects familiarity with threads and concurrency, as well as strong Java programming skills. Those highly proficient in another programming language, such as C++ or C#, should be able to translate their skills easily. The course will require a considerable amount of programming, as well as the ability to work with your classmates in teams.

Textbooks

Distributed Systems: Principles and Paradigms, 3rd edition, by Tanenbaum and van Steen, Prentice Hall (ISBN 978-1530281756).
You can buy a physical copy (e.g., for $35 on Amazon) or download a free digital copy here.

Additional materials will be provided as handouts or in the form of light technical papers.

Grading

Homework 40%, Term project 25%, Exams 30%, Participation 5%

Policies

You can find a list of key course policies here.

Assignments

Homework assignments are available for download. Please join the discussion group as well!

Tentative schedule

DateTopicDetailsReadingRemarks
Jan 11 Introduction [Slides] Introduction
Overview
Logistics
Policies
HW0 released
Jan 16(MLK day - no class)
Jan 18 Internet basics [Slides] The Internet
Interdomain routing; BGP; valley-free
Path properties
TCP and UDP
Socket basics; echo server
Jan 23 The Web [Slides] The Web; hyperlinks; history of the Web
Client-server model
HTTP/1, TLS
HTML/CSS basics
HTTP/2
Introduction to HTTP/2 HW0 due; HW1 released
Jan 25 Scalability [Slides] Parallelization
Consistency
Mutual exclusion; locking; deadlocks
NUMA and Shared-Nothing
Frontend-backend, Sharding
Vogels: "Eventually Consistent"
Jan 30 Dynamic content [Slides] Motivation: Dynamic content
Routes
Managing state; cookies; sessions
Tracking; business model of the web
Spark Framework Overview HW1 due; HW2 released
Feb 1 The Client Side [Slides] JavaScript
DOM
Dynamic requests
AJAX
MDN: A reintroduction to JavaScript
Feb 6 Naming [Slides] Name spaces and directories
DNS architecture
Security issues with DNS
DNSSEC, DANE
Globally Distributed Content Delivery HW2 due; HW3 released
Feb 8 The Cloud [Slides] Data centers
Cloud computing
Types of clouds
History of Cloud Computing
Case study: EC2
Armbrust et al.: "A View of Cloud Computing"
Feb 13 RPCs [Slides] Web services; APIs; API examples
Remote procedure calls
Handling RPC failures
Data interchange
XML
Chapter 4.2 in the Tanenbaum book HW3 due; HW4 released
Feb 15First midterm exam
Feb 20Last day to drop
Feb 20 Storage [Slides] Key-value stores
KVS on the Cloud
Sharding and coordination
Case study: S3
Case study: DynamoDB
Cooper et al.: "PNUTS to Sherpa: Lessons from Yahoo!'s Cloud Database" HW4 due; HW5 released
Feb 22 Basic fault tolerance [Slides] Faults and fault models
Primary-backup replication
Availability and Durability
The CAP theorem
Quorum replication
Chapter 7.5 in the Tanenbaum book
Feb 27 Indexing; GFS case study [Slides] Motivation: Cost of operations
B+ tree overview
Motivation; distributed file systems;NFS
GFS architecture
GFS operation
Discussion of GFS
Comer: "The Ubiquitous B-Tree" HW5 due; HW6 released
Mar 1 Scalable Analytics [Slides] Introduction to scalable analytics
MapReduce
The Streams API
Apache Spark
Lambdas and serialization
Zaharai et al.: "Spark: Cluster Computing with Working Sets"
Mar 4-12Spring Break
Mar 13 Spark basics [Slides] Spark jobs
Working with files
Spark transformations
Spark actions
The Structured API
Zaharia et al.: "Resilient Distributed Datasets" HW6 due; HW7 released
Mar 15 Spark continued [Slides] HDFS
Apache Livy
Distributed shared variables
Graph algorithms in Spark
Shvachko: "Apache Hadoop: The Scalability Update" Project handout released
Mar 17Last day to pass/fail
Mar 20 Crawling [Slides] Structure of the Web
Crawling basics
SEO
Crawler etiquette
Heydon and Nayork: "Mercator: A scalable, extensible Web crawler" HW7 due; HW8 released
Mar 22 Information retrieval [Slides] Basic IR model; precision/recall
Boolean model
Vector model
TF/IDF
Stemming and lemmatization
Chapter 1 in "An Introduction to Information Retrieval" Team registrations due; project begins
Mar 27Last day to withdraw
Mar 27 Authoritativeness [Slides] Motivation: off-page features
HITS
PageRank
Sinks and hogs
Brin and Page: "The PageRank Citation Ranking: Bringing Order to the Web" HW8 due; HW9 released
Mar 29 Search engines [Slides] Building a search engine
Case study: Google
Case study: Mercator
Project overview
Modern search
Guest lecture by Raj Singh
Brin and Page: "The Anatomy of a Large-Scale Hypertextual Web Search Engine"
Apr 3 Engineering software systems; virtualization [Slides] Software engineering [Video]
Version control [Video]
Testing [Video]
Debugging [Video]
Effective teams [Video]
Why virtualization? [Video]
Virtualization basics [Video]
Containers [Video]
Serverless computing [Video]
The Agile Coach HW9 due (now on April 8th)
Apr 5 Decentralized systems [Slides] (NO IN-PERSON CLASS - ANDREAS TRAVELING)
Centralization and its effects
Partly centralized systems
Unstructured overlays
Structured overlays
Druschel and Rodrigues: "Peer-to-Peer Systems"
Apr 10 Key-based routing; DHTs [Slides] Consistent hashing and DHTs
Key-based routing
Basic Chord
Fault tolerance in Chord
KBR and security
Stoica et al.: "Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications"
Apr 12 Advanced Fault Tolerance [Slides] Non-crash fault models
State-machine replication
Paxos
The Byzantine Generals Problem
Byzantine Fault Tolerance
Schneider: Implementing Fault-Tolerant Services Using the State Machine Approach
Apr 17 Transactions [Slides] Introduction to transactions
Concurrency control
Log-based recovery
Two-phase commit
Distributed concurrency control
Shute et al.: "F1: A Distributed SQL Database That Scales"
Apr 19 Security [Slides] Threat models
Crypto basics
Digital signatures
Attacks and Defenses (part 1)
Attacks and Defenses (part 2)
OWASP Top 10
Apr 24 Privacy [Slides] Privacy overview
Differential privacy
Sensitivity analysis
Federated Analytics
Honeycrisp
Differential Privacy: The Pursuit of Protections by Default
Apr 26Second midterm exam
Apr 27-30Reading days
May 1-9Finals period (in-person project demos)