Big Data and Large-scale Computing (95-869)

The rate and amount of data being generated in today's world by both humans and machines are unprecedented. Being able to store, manage, and analyze large-scale data has critical impact on business intelligence, scientific discovery, social and environmental challenges. 

The goal of this course is to equip students with the understanding, knowledge, and practical skills to develop big data / machine learning solutions with the state-of-the-art tools, particularly those in the Spark environment, with a focus on programming models in MLlib, GraphX, and SparkSQL. Students will also gain hands-on experience with MapReduce and Apache Spark using real-world datasets. 

This course is designed to give a graduate-level student a thorough grounding in the technologies and best practices used in big data machine learning. The course assumes that the students have the understanding of basic data analysis and machine learning concepts as well as basic knowledge of programming (preferably in Python or Java). Previous experience with Hadoop, Spark or distributed computing is NOT required. 

- understanding of basic machine learning concepts 
(having taken 95-791 Data Mining, 95-828 Machine Learning for Problem Solving, 10-601 Introduction to Machine Learning (Masters), 10-701 Introduction to Machine Learning (PhD), or 10715 Advanced Introduction to Machine Learning) 
- proficiency in Java or Python 

For course related details (syllabus, assignments, etc.) see: 

Learning Objectives: 

Learning Objectives : By the end of this class, students will 
• gain understanding of the MapReduce paradigm and Hadoop ecosystem 
• understand scalability challenges for common ML tasks 
• study distributed machine learning algorithms 
• understand details of SparkSQL, GraphX, and MLlib (Spark's ML library) 

• implement distributed pipelines in Apache Spark using real datasets

  • Units