Infrastructure for Advanced Analytics and Machine Learning
Course Overview
Location: LMU Öttingenstr 67, 80538 München
Instructors:
- Professor Dr. Dieter Kranzlmüller
- Dr. Andre Luckow, andre.luckow@ifi.lmu.de
- Maximilian Höb, hoebm@nm.ifi.lmu.de
The ongoing data deluge driven by the increasing digitalization of science, society and industry, leads to a significant increase in demand for data storage, processing and analytics within several industrial domains. Sciences and industry are overwhelmed by the need to store large amounts of transactional and machine-generated data resulting from the customer, service and manufacturing processes. Examples of machine- generated data are server logs as well as sensor data that is generated in finer granularities and frequencies. Further, datasets are often enriched with web and open data from social media, blogs or other open data sources. The Internet of Things (IoT) will further blur the boundaries between the physical and the digital world causing an even further increase in the digital footprint of the world. In this course, we will learn about data applications and their requirements. Further, we will discuss the core infrastructure necessary to handle the large data volumes and analytical problems. As part of the exercises students will utilize different frameworks, e.g. MapReduce and Spark to implement different algorithms.
Course Topics:
This class will cover the following topics:
- Data Applications in Industry and Sciences
- Resource Management: YARN, Mesos and Kubernetes
- NoSQL and Hadoop: Big Table and HBase
- Hadoop Processing Engines: Spark, Flink
- SQL on Hadoop: Impala, Hive, Spark, Presto
- Stream Processing: Kafka, Spark Streaming, Flink, Heron
- Data Governance and Security
- Fault Tolerance: CAP Theorem, Eventual Consistency, Quorum Protocols, Apache Zookeeper
- Hadoop Architectures
- Data in the Cloud: Elastic MapReduce, Azure HDInsight, Google Cloud Dataflow
- Advanced Analytics and Machine Learning (Apache Mahout, MLLib)
- Graph Processing Frameworks (NetworkX)
- Natural Language Processing
- Deep Learning: Convolutional Neural Networks
The course will be offered as a block lecture.
Pre-Requisites:
- Linux
- Python
- HPC course or equivalent experience
Time:
The course will be given as a block lecture from April 3 - 7, 2018.
Notenvergabe
- Abgabe und Dokumentation Übungen: 50%
- Schriftliche Prüfung: 50%
Material
- Course Web Page, http://www.nm.ifi.lmu.de/teaching/Vorlesungen/2018ss/_data-analytics/
- Exercises, https://scalable-infrastructure.github.io/exercise.html
Attendance
Attendance in all classes is required. Students with more than 3 unexcused absences will not receive a passing grade in the class. An unexcused absence is any absence for which the instructor was not notified before the start of the class. If instructor does not show up on-time for class, students are expected to wait at least 15 minutes before leaving.
Academic Integrity Statement
As members of the Clemson University community, we have inherited Thomas Green Clemson’s vision of this institution as a “high seminary of learning.” Fundamental to this vision is a mutual commitment to truthfulness, honor, and responsibility, without which we cannot earn the trust and respect of others. Furthermore, we recognize that academic dishonesty detracts from the value of a Clemson degree. Therefore, we shall not tolerate lying, cheating, or stealing in any form.