Consider the following example problems:
One is interested in computing summary statistics (word count distributions) for a set of words which occur in the same document in entire Wikipedia collection (5 million documents). Naive techniques, will run out of main memory on most computers.
One needs to train an SVM classifier for text categorization, with unigram features (typically ~10 million) for hundreds of classes. One would run out of main memory, if they store uncompressed model parameters in main memory.
One is interested in learning either a supervised model or find unsupervised patterns, but the data is distributed over multiple machines. Communication being the bottleneck, naïve methods to adapt existing algorithms to such a distributed setting might perform extremely poorly. In all the above situations, a simple data mining / machine learning task has been made more complicated due to large scale of input data, output results or both. In this course, we discuss algorithmic techniques as well as software paradigms which allow one to develop scalable algorithms and systems for the common data science tasks.
Important For Certification/Credit Transfer:
Weekly Assignments and Discussion Forum can be accessed ONLY by enrolling here
Scroll down to Enroll
Note: Content is Free!
All content including discussion forum and assignments, is free
Final Exam (in-person, invigilated, currently conducted in India) is mandatory for Certification and has INR Rs. 1100 as exam fee
INTENDED AUDIENCE:Computer Science and Engineering
CORE/ELECTIVE: Elective
UG/PG: PG Course
PREREQUISITES: Algorithms, Machine Learning
INDUSTRY SUPPORT: Google, Microsoft, Facebook, Amazon, Flipkart, LinkedIn etc.