- Office hours Wednesday 1-2pm
- With very large amounts of data and processing needs, putting all the capabilities in one machine is not possible
- as such we can chain dozens of machines together to form a cluster of machines each with their own cpu, disks, and ram
- Three exams, October, November, December. not comprehensive
- 40% exams, 55% from assignments, 5% from quizzes
Topics
- Hadoop
- Key value pairs
- map reduction
- good for set sizes of data. Non-stream data
- Spark
- developed in UC Berkeley and used more now
- can handle stream and non-stream data.
- Unsupervised Learning
- Clustering
- Recommendation Systems
- Content based recommendation
- Collaborative Filtering
- Latent Factor Model
- Large Language Models if time in the semester permits
- Types of Data
- Structured
- Un-Structured
- Semi-Structured
- html / xml where things are placed in blocks