COURSE INFORMATION
Â
Â

COURSE
Spark for Python
DESCRIPTION
With Spark, you can have a single engine where you can explore and play with large amounts of data, run machine learning algorithms and then use the same system to productionize your code.
.png)
PREREQUISITES
-
Knowledge of Python
-
Basic knowledge of Java

CURRICULUM
INTRODUCTION TO SPARK
- What does Donald Rumsfeld have to do with data analysis?
- Why is Spark so cool?
- An introduction to RDDs - Resilient Distributed Datasets
- Built-in libraries for Spark
- Installing Spark
- The PySpark Shell
- Transformations and Actions
- See it in Action : Munging Airlines Data with PySpark - I
- [For Linux/Mac OS Shell Newbies] Path and other Environment Variables
RESILIENT DISTRIBUTED DATASETS
- RDD Characteristics: Partitions and Immutability
- RDD Characteristics: Lineage, RDDs know where they came from
- What can you do with RDDs?
- Create your first RDD from a file
- Average distance travelled by a flight using map() and reduce() operations
- Get delayed flights using filter(), cache data using persist()
- Average flight delay in one-step using aggregate()
- Frequency histogram of delays using countByValue()
- See it in Action : Analyzing Airlines Data with PySpark - II
BASIC SEARCH & OPTIMIZATION ALGORITHMS
- Brute-force search introduction
- Brute-force search example
- Stochastic search introduction
- Stochastic search example
- Hill climbing introduction
- Hill climbing example
ADVANCED RDDS: PAIR RESILIENT DISTRIBUTED DATASETS
- Special Transformations and Actions
- Average delay per airport, use reduceByKey(), mapValues() and join()
- Average delay per airport in one step using combineByKey()
- Get the top airports by delay using sortBy()
- Lookup airport descriptions using lookup(), collectAsMap(), broadcast()
- See it in Action : Analyzing Airlines Data with PySpark - III
ADVANCED SPARK: ACCUMULATORS, SPARK SUBMIT, MAPREDUCE , BEHIND THE SCENES
- Get information from individual processing nodes using accumulators
- See it in Action : Using an Accumulator variable
- Long running programs using spark-submit
- See it in Action : Running a Python script with Spark-Submit
- Behind the scenes: What happens when a Spark script runs?
- Running MapReduce operations
- See it in Action : MapReduce with Spark
JAVA AND SPARK
- The Java API and Function objects
- Pair RDDs in Java
- Running Java code
- Installing Maven
- See it in Action : Running a Spark Job with Java
PAGERANK: RANKING SEARCH RESULTS
- What is PageRank?
- The PageRank algorithm
- Implement PageRank in Spark
- Join optimization in PageRank using Custom Partitioning
- See it Action : The PageRank algorithm using Spark
SPARK SQL
- Dataframes: RDDs + Tables
- See it in Action : Dataframes and Spark SQL
MLLIB IN SPARK: BUILD A RECOMMENDATIONS ENGINE
- Collaborative filtering algorithms
- Latent Factor Analysis with the Alternating Least Squares method
- Music recommendations using the Audioscrobbler dataset
- Implement code in Spark using MLlib
SPARK STREAMING
- Introduction to streaming
- Implement stream processing in Spark using Dstreams
- Stateful transformations using sliding windows
- See it in Action : Spark Streaming
GRAPH LIBRARIES
- The Marvel social network using Graphs
INTERVIEW WITH SINGAPORE EXPERT
- Background of Expert
- Information and Communication Technology in Singapore
.png)
COURSE FEES
.png)
INSTRUCTIONS FOR ENROLMENT
1. Check if you are eligible for subsidies here and follow the instructions.
​
2. Purchase the e-module below:
​
​
​
3. Purchase the face-to-face lesson below:
​
​
​
4. Once you have completed 90% of your e-module, the Gen Infiniti team will contact you to schedule your in-person lesson.
​
​
Drop us an email at connect@geninfinitiacademy.com if you face any issues or problems enrolling in the course.
Â
Â