Infrastructure


1. Exercise 1: Data on HPC


Data/Tools:

  1. Please login into LRZ Linux Cluster!

  2. Create keyless log in to the LRZ cluster

  1. Start an interactive job and run an interactive Jupyter Notebook

2. Exercise 2: MapReduce

2.1 Command-Line Data Analytics


Data/Tools:


  1. Use the commands head, cat, uniq, wc, sort, find, xargs, awk to evaluate the NASA log file:

    • Which page was called the most?
    • What was the most frequent return code?
    • How many errors occurred? What is the percentage of errors?
  2. Implement a Python version of this Unix Shell script using this script as template!

  3. Run the Python script inside an Hadoop Streaming job.

     hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -info
    


2.2 MapReduce Hello World


Data/Tools:

  • MapReduce Application: hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount

Run the WordCount example of Hadoop:

  1. Create two test files containing text and upload them to HDFS!
  2. Use the MapReduce program WordCount for processing these files!



3. Exercise 3: Spark

3.1 Spark



Data/Tools:

  1. Implement a wordcount using Spark. Make sure that you only allocate 1 core for the interactive Spark shell:

     pyspark --total-executor-cores 1
    
  2. Implement the NASA log file analysis using Spark!



4. SQL Engines



Data/Tools:

  1. Create a Hive table for the NASA Log files! Use either python or awk to convert the log file to a structured format (CSV) that is manageable by Hive! Use the text format for the table definition!

     cat /data/NASA_access_log_Jul95 |awk -F' ' '{print "\""$4 $5"\","$(NF-1)","$(NF)}' > nasa.csv
    
  2. Run an SQL query that outputs the number of occurrences of each HTTP response code!

  3. Based on the initially created table define an ORC and Parquet-based table. Repeat the query!

  4. Run the same query with Impala!


6. Data Analytics


Data/Tools:

  1. Run KMeans on the provided example dataset!

  2. Validate the quality of the model using the sum of the squared error for each point!


7. Hadoop Benchmarking


  1. Run the program Terasort on 1 GB of data - each record that TeraGen generates is 100 Bytes in size:

     hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen <number_of_records> <output_directory>
    
     hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort <input_directory> <output_directory>
    
  2. How many containers are consumed during which phase of the application: teragen, terasort (map phase, reduce phase)? Please explain! See blog post.