Big Data platforms is a 5 ECTS Master's level advanced course. This course focuses on big data platforms and on key algorithmic ideas and methods used to implement them. After completing this course you are able to list many of the key technologies used in big data processing and to select suitable methods for solving challenging big data processing tasks using cloud computing technologies. You will also be able to compare the scalability and fault tolerance implications of using the selected methodologies.
Main topics are:
- distributed computing,
- Warehouse-Scale Computers,
- fault tolerance in distributed systems,
- distributed file systems,
- distributed batch processing with the MapReduce and the Apache Spark (PySpark) computing frameworks, and
- distributed cloud based databases.
The course material will consist of lecture materials and exercises provided by the lecturer.
Course Target Audience
The course is suitable to those who are interested in big data platforms employed in cloud computing and have previous knowledge in programming, database systems and command line tools. Optional course in Data Science Master's Program. Also suitable for Computer Science Master's Program students. The course is suitable to University of Helsinki exchange students.
To attend this course, you must have:
- basic programming skills (Python),
- skills to work with command line tools in Linux, and
- basic knowledge in database systems (SQL).
Unfortunately, the lecturer is on long sick leave. All upcoming lessons (from Lecture7) will use lessons recorded from the previous year. The dates in the lectures/slides are stale, so please do not follow them. There will also be no lectures via Zoom anymore, so please do not attend.
The Lectures of the course will be will Zoom based lectures. Slides and video recording of each of the lectures will be made available a
within 24 hours of the live lecture session. The link to the Zoom
lectures is: https://helsinki.zoom.us/j/67373457349?pwd=N3V2dzNrNjg0aWg3Ukk4Wi8yVGN2dz09
|Lecture date||Lecture time (EEST)|
|Lecture 1||Tue 6.9.2022||10:15-11:45|
|Lecture 2||Thu 8.9.2022||12:15-13:45|
|Lecture 3||Tue 13.9.2022||10:15-11:45|
|Lecture 4||Thu 15.9.2022||12:15-13:45|
|Lecture 5||Tue 20.9.2022||10:15-11:45|
|Lecture 6||Thu 22.9.2022||12:15-13:45|
|Cancelled - No Lecture - Lecturer having a flu||Tue 27.9.2022||10:15-11:45|
|Cancelled - No Lecture - Lecturer having a flu||Thu 29.9.2022||12:15-13:45|
|Cancelled - No Lecture - Lecturer having a flu||Tue 4.10.2022||10:15-11:45|
|Lecture 7||Thu 6.10.2022||12:15-13:45|
|Lecture 8||Tue 11.10.2022||10:15-11:45|
|Lecture 9||Thu 13.10.2022||12:15-13:45|
|Lecture 10||Tue 18.10.2022||10:15-11:45|
|Lecture 11||Thu 20.10.2022||12:15-13:45|
Lecture Slides and Videos
The Lecture slides contain all the material needed to pass the course, the videos go through this material and contain no additional information needed for the quizzes.
|Lecture Slides||Lecture Videos|
|Lecture 1||Lecture 1 slides||Lecture 1 video|
|Lecture 2||Lecture 2 slides||Lecture 2 video|
|Lecture 3||Lecture 3 slides||Lecture 3 video|
|Lecture 4||Lecture 4 slides||Lecture 4 video|
|Lecture 5||Lecture 5 slides||Lecture 5 video|
|Lecture 6||Lecture 6 slides||Lecture 6 video|
|Lecture 7||Lecture 7 slides||Lecture 7 video|
|Lecture 8||Lecture 8 slides||Lecture 8 video|
|Lecture 9||Lecture 9 slides||Lecture 9 video|
|Lecture 10||Lecture 10 slides||Lecture 10 video|
|Lecture 11||Lecture 11 slides||Lecture 11 video|
Home Exercise Schedule
The course will contain programming exercises where you will be using the Spark framework to solve Big Data processing tasks. We will be using the Python programming language based PySpark interface and will be doing several database query type analytics queries. Therefore basic programming skills using Python and knowledge about database programming, especially using the SQL query language will be very helpful for completing the home exercises.
The schedule for the home exercises will be announced in the first Lecture:
|Release Date||Due Date (23:59 EEST)|
|Introduction to Spark + RDD Programming||13.9||11.10|
|Machine Learning (MLlib)||27.9||18.10|
|Extras (Optional for extra points)||18.10||1.11|
The Home Exercise System
To complete the exercises you have two options, Option 1 is to use the JupyterHub notebook platform. Option 2 is to download the assignment files and complete them locally.
Option 1: On JupyterHub Server
Note: Currently we are having major server problems, we strongly suggest using Option 2 for now.
To complete the exercises you will have to use the JupyterHub notebook platform, which is here:
A short introduction video on how to use the platform to complete and submit the exercises can be found at:
The assignment grades take a few hours to get published/updated and are not instant at the moment. We are running the assignment grading scripts once per hour, the system doesn't allow instant grading when a student submits an exercise. Therefore, please be aware of that and give some time before you check back on your assignment grades. Once the grades are assigned, you can use the "fetch feedback" feature to find out which exercises failed (if any). Note that there are additional tests run on the server side that affect the grading scores. If you want, you can resubmit each exercise several times, the best obtained score is recorded to the course points for each exercise.
Make sure to use the assignment submission validation feature before your submissions if you want to gain more confidence about your submission's correctness and final grade. Note that the validation feature can be slow, you can also run your tests separately by running the respective test cells. If you can run through the whole notebook without getting validation errors, all the tests are successful.
For any of your generic course-related questions and group discussions, use our Telegram group. Try to use the group to help each other out and discuss exercise and course-related issues. We will also periodically check the group to address the more important unanswered issues.
Option 2: On Local Environment
You can download the released homework from here:
To complete the exercises locally you will have to use the Jupyter Notebook IDE. You can do the assignments using virtual machine. We offer a virtual machine image with all the necessary libraries installed, the image and the instruction can be found here:
(Virtual Box usage is not strictly necessary, completing the assignments on your own Python environment is also doable if you have the experience and knowledge with Jupyter notebook installing PySpark and needed libraries.)
To submit the assignments, you can use the submission box which is at the bottom of this page. Note the submission box is only visible once you logged in.
Course Telegram Group
The course has a Telegram group for helping fellow students. Lecturer and Course Assistant will periodically also join in the conversation. You can join to the groups through the link:
Passing the Course
You need to pass both the lecture quizzes by 1st of November 2022 and home exercises by their respective deadlines listed above. The grading scale will be as follows. Minimum 50% from both home exercise totals and also minimum 50% from quizzes are needed to pass the course. The Extra round points from home exercises will count towards your home exercise points but will not allow home exercises contribution to go over 100%. After this the percentages obtained from home exercises and quizzes are summed together, each weighted with 50% weight. The final total percentage will give grades as follows: 90%-100%: grade 5, 80%-89%: grade 4, 70%-79%: grade 3, 60%-69%: grade 2, 50%-59%: grade 1.