Infrastructure

We will utilize the LRZ Linux Cluster: https://www.lrz.de/services/compute/linux-cluster/
Access via SSH (Windows User can use Putty)
Anaconda/Python 2.7.14: https://www.anaconda.com/download/
Python Documentation: http://docs.python.org/

1. Exercise 1: Data on HPC

Data/Tools:

Use an SSH client of your choice (e.g. Putty for Windows or SSH in your Linux/Mac OS Terminal)
Daten: http://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz

Please login into LRZ Linux Cluster!
Create keyless log in to the LRZ cluster

Start an interactive job and run an interactive Jupyter Notebook

2. Exercise 2: MapReduce

2.1 Command-Line Data Analytics

Data/Tools:

http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html
Commandline data tools https://github.com/bitly/data_hacks
Data: cloud.luckow-hm.de:/data/NASA_access_log_Jul95

Use the commands head, cat, uniq, wc, sort, find, xargs, awk to evaluate the NASA log file:
- Which page was called the most?
- What was the most frequent return code?
- How many errors occurred? What is the percentage of errors?
Implement a Python version of this Unix Shell script using this script as template!

Run the Python script inside an Hadoop Streaming job.

 hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -info

2.2 MapReduce Hello World

Data/Tools:

MapReduce Application: hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount

Run the WordCount example of Hadoop:

Create two test files containing text and upload them to HDFS!
Use the MapReduce program WordCount for processing these files!

3. Exercise 3: Spark

3.1 Spark

Data/Tools:

Spark Programming Guide: https://spark.apache.org/docs/1.1.0/programming-guide.html (use Python API recommended)
Spark API: https://spark.apache.org/docs/1.1.0/api/python/index.html

Implement a wordcount using Spark. Make sure that you only allocate 1 core for the interactive Spark shell:
```
 pyspark --total-executor-cores 1
```
Implement the NASA log file analysis using Spark!

4. SQL Engines

Data/Tools:

Hive User Guide: https://cwiki.apache.org/confluence/display/Hive/GettingStarted
Hive ORC: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html
Hive Parquet: http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-Installation-Guide/cdh5ig_parquet.html

Create a Hive table for the NASA Log files! Use either python or awk to convert the log file to a structured format (CSV) that is manageable by Hive! Use the text format for the table definition!
```
 cat /data/NASA_access_log_Jul95 |awk -F' ' '{print "\""$4 $5"\","$(NF-1)","$(NF)}' > nasa.csv
```
Run an SQL query that outputs the number of occurrences of each HTTP response code!
Based on the initially created table define an ORC and Parquet-based table. Repeat the query!
Run the same query with Impala!

6. Data Analytics

Data/Tools:

Spark MLLib KMeans Example: https://spark.apache.org/docs/1.1.0/mllib-clustering.html

Run KMeans on the provided example dataset!
Validate the quality of the model using the sum of the squared error for each point!

7. Hadoop Benchmarking

Run the program Terasort on 1 GB of data - each record that TeraGen generates is 100 Bytes in size:

 hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen <number_of_records> <output_directory>

 hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort <input_directory> <output_directory>

How many containers are consumed during which phase of the application: teragen, terasort (map phase, reduce phase)? Please explain! See blog post.

Exercises

Infrastructure

1. Exercise 1: Data on HPC

2. Exercise 2: MapReduce

2.1 Command-Line Data Analytics

2.2 MapReduce Hello World

3. Exercise 3: Spark

3.1 Spark

4. SQL Engines

6. Data Analytics

7. Hadoop Benchmarking