- We will utilize the LRZ Linux Cluster: https://www.lrz.de/services/compute/linux-cluster/
- Access via SSH (Windows User can use Putty)
- Anaconda/Python 2.7.14: https://www.anaconda.com/download/
- Python Documentation: http://docs.python.org/
1. Exercise 1: Data on HPC
- Use an SSH client of your choice (e.g. Putty for Windows or SSH in your Linux/Mac OS Terminal)
- Daten: http://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz
Please login into LRZ Linux Cluster!
Create keyless log in to the LRZ cluster
- Start an interactive job and run an interactive Jupyter Notebook
2. Exercise 2: MapReduce
2.1 Command-Line Data Analytics
- http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html
- Commandline data tools https://github.com/bitly/data_hacks
- Data:
Use the commands
to evaluate the NASA log file:- Which page was called the most?
- What was the most frequent return code?
- How many errors occurred? What is the percentage of errors?
Implement a Python version of this Unix Shell script using this script as template!
Run the Python script inside an Hadoop Streaming job.
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -info
2.2 MapReduce Hello World
- MapReduce Application:
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount
Run the WordCount example of Hadoop:
- Create two test files containing text and upload them to HDFS!
- Use the MapReduce program WordCount for processing these files!
3. Exercise 3: Spark
3.1 Spark
- Spark Programming Guide: https://spark.apache.org/docs/1.1.0/programming-guide.html (use Python API recommended)
- Spark API: https://spark.apache.org/docs/1.1.0/api/python/index.html
Implement a wordcount using Spark. Make sure that you only allocate 1 core for the interactive Spark shell:
pyspark --total-executor-cores 1
Implement the NASA log file analysis using Spark!
4. SQL Engines
- Hive User Guide: https://cwiki.apache.org/confluence/display/Hive/GettingStarted
- Hive ORC: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-
- Hive Parquet: http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-Installation-Guide/cdh5ig_parquet.html
Create a Hive table for the NASA Log files! Use either
to convert the log file to a structured format (CSV) that is manageable by Hive! Use the text format for the table definition!cat /data/NASA_access_log_Jul95 |awk -F' ' '{print "\""$4 $5"\","$(NF-1)","$(NF)}' > nasa.csv
Run an SQL query that outputs the number of occurrences of each HTTP response code!
Based on the initially created table define an ORC and Parquet-based table. Repeat the query!
Run the same query with Impala!
6. Data Analytics
- Spark MLLib KMeans Example: https://spark.apache.org/docs/1.1.0/mllib-clustering.html
Run KMeans on the provided example dataset!
Validate the quality of the model using the sum of the squared error for each point!
7. Hadoop Benchmarking
Run the program
on 1 GB of data - each record that TeraGen generates is 100 Bytes in size:hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen <number_of_records> <output_directory> hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort <input_directory> <output_directory>
How many containers are consumed during which phase of the application: teragen, terasort (map phase, reduce phase)? Please explain! See blog post.