Exercises
Infrastructure
- We will utilize the LRZ Linux Cluster: https://www.lrz.de/services/compute/linux-cluster/
- Access via SSH (Windows User can use Putty)
- Anaconda/Python 2.7.14: https://www.anaconda.com/download/
- Python Documentation: http://docs.python.org/
1. Exercise 1: Data on HPC
Data/Tools:
- Use an SSH client of your choice (e.g. Putty for Windows or SSH in your Linux/Mac OS Terminal)
- Daten: http://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz
-
Please login into LRZ Linux Cluster!
-
Create keyless log in to the LRZ cluster
- Start an interactive job and run an interactive Jupyter Notebook
2. Exercise 2: MapReduce
2.1 Command-Line Data Analytics
Data/Tools:
- http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html
- Commandline data tools https://github.com/bitly/data_hacks
- Data:
cloud.luckow-hm.de:/data/NASA_access_log_Jul95
-
Use the commands
head
,cat
,uniq
,wc
,sort
,find
,xargs
,awk
to evaluate the NASA log file:- Which page was called the most?
- What was the most frequent return code?
- How many errors occurred? What is the percentage of errors?
-
Implement a Python version of this Unix Shell script using this script as template!
-
Run the Python script inside an Hadoop Streaming job.
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -info
2.2 MapReduce Hello World
Data/Tools:
- MapReduce Application:
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount
Run the WordCount example of Hadoop:
- Create two test files containing text and upload them to HDFS!
- Use the MapReduce program WordCount for processing these files!
3. Exercise 3: Spark
3.1 Spark
Data/Tools:
- Spark Programming Guide: https://spark.apache.org/docs/1.1.0/programming-guide.html (use Python API recommended)
- Spark API: https://spark.apache.org/docs/1.1.0/api/python/index.html
-
Implement a wordcount using Spark. Make sure that you only allocate 1 core for the interactive Spark shell:
pyspark --total-executor-cores 1
-
Implement the NASA log file analysis using Spark!
4. SQL Engines
Data/Tools:
- Hive User Guide: https://cwiki.apache.org/confluence/display/Hive/GettingStarted
- Hive ORC: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html
- Hive Parquet: http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-Installation-Guide/cdh5ig_parquet.html
-
Create a Hive table for the NASA Log files! Use either
python
orawk
to convert the log file to a structured format (CSV) that is manageable by Hive! Use the text format for the table definition!cat /data/NASA_access_log_Jul95 |awk -F' ' '{print "\""$4 $5"\","$(NF-1)","$(NF)}' > nasa.csv
-
Run an SQL query that outputs the number of occurrences of each HTTP response code!
-
Based on the initially created table define an ORC and Parquet-based table. Repeat the query!
-
Run the same query with Impala!
6. Data Analytics
Data/Tools:
- Spark MLLib KMeans Example: https://spark.apache.org/docs/1.1.0/mllib-clustering.html
-
Run KMeans on the provided example dataset!
-
Validate the quality of the model using the sum of the squared error for each point!
7. Hadoop Benchmarking
-
Run the program
Terasort
on 1 GB of data - each record that TeraGen generates is 100 Bytes in size:hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen <number_of_records> <output_directory> hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort <input_directory> <output_directory>
-
How many containers are consumed during which phase of the application: teragen, terasort (map phase, reduce phase)? Please explain! See blog post.