Scalable Data Science

ABOUT THE COURSE:

Consider the following example problems: One is interested in computing summary statistics (word count distributions) for a set of words which occur in the same document in entire Wikipedia collection (5 million documents). Naive techniques, will run out of main memory on most computers. One needs to train an SVM classifier for text categorization, with unigram features (typically ~10 million) for hundreds of classes. One would run out of main memory, if they store uncompressed model parameters in main memory. One is interested in learning either a supervised model or find unsupervised patterns, but the data is distributed over multiple machines. Communication being the bottleneck, naïve methods to adapt existing algorithms to such a distributed setting might perform extremely poorly. In all the above situations, a simple data mining / machine learning task has been made more complicated due to large scale of input data, output results or both. In this course, we discuss algorithmic techniques as well as software paradigms which allow one to develop scalable algorithms and systems for the common data science tasks.

Important For Certification/Credit Transfer:

Weekly Assignments and Discussion Forum can be accessed ONLY by enrolling here

Scroll down to Enroll

Note: Content is Free!

All content including discussion forum and assignments, is free

Final Exam (in-person, invigilated, currently conducted in India) is mandatory for Certification and has INR Rs. 1100 as exam fee

INTENDED AUDIENCE:
Computer Science and Engineering

CORE/ELECTIVE: Elective

UG/PG: PG Course

PREREQUISITES: Algorithms, Machine Learning

INDUSTRY SUPPORT: Google, Microsoft, Facebook, Amazon, Flipkart, LinkedIn etc.

5174 students have enrolled already!!

ABOUT THE INSTRUCTOR:

Anirban Dasgupta is currently an Associate Professor of Computer Science & Engineering at IIT Gandhinagar. Prior to this, he was a Senior Scientist at Yahoo! Labs Sunnyvale. Anirban works on algorithmic problems for massive data sets, large scale machine learning, analysis of large social networks and randomized algorithms in general. He did his undergraduate studies at IIT Kharagpur and doctoral studies at Cornell University. He has also received the Google Faculty Research Award (2015), the Cisco University grant (2016), and the ICDT Best Newcomer Award (2016).

Sourangshu Bhattacharya is an Assistant Professor in the Department of Computer Science and Engineering, IIT Kharagpur. He was a Scientist at Yahoo! Labs from 2008 to 2013, where he was working on prediction of Click-through rates, Ad-targeting to customers, etc on the Rightmedia display ads exchange. He was a visiting scholar at the Helsinki University of Technology from January - May 2008. He received the B.Tech. in Civil Engineering from I.I.T. Roorkee in 2001, M.Tech. in computer science from I.S.I. Kolkata in 2003, and Ph.D. in Computer Science from the Department of Computer Science & Automation, IISc Bangalore in 2008. He has many publications in top conferences and journals, including ICML, NIPS, WWW, ICDM, CIKM, etc. His current research interests include modeling influence in social networks, distributed machine learning, and representation learning.

COURSE LAYOUT:

Week 1 : Background: Introduction (30 mins) Probability: Concentration inequalities, (30 mins) Linear algebra: PCA, SVD (30 mins) Optimization: Basics, Convex, GD. (30 mins) Machine Learning: Supervised, generalization, feature learning, clustering. (30 mins)

Week 2 : Memory-efficient data structures: Hash functions, universal / perfect hash families (30 min) Bloom filters (30 mins) Sketches for distinct count (1 hr) Misra-Gries sketch. (30 min)

Week 3 : Memory-efficient data structures (contd.): Count Sketch, Count-Min Sketch (1 hr) Approximate near neighbors search: Introduction, kd-trees etc (30 mins) LSH families, MinHash for Jaccard, SimHash for L2 (1 hr)
Week 4 : Approximate near neighbors search: Extensions e.g. multi-probe, b-bit hashing, Data dependent variants (1.5 hr) Randomized Numerical Linear Algebra Random projection (1 hr)
Week 5 : Randomized Numerical Linear Algebra CUR Decomposition (1 hr) Sparse RP, Subspace RP, Kitchen Sink (1.5 hr)
Week 6 : Map-reduce and related paradigms Map reduce - Programming examples - (page rank, k-means, matrix multiplication) (1 hr) Big data: computation goes to data. + Hadoop ecosystem (1.5 hrs)
Week 7 : Map-reduce and related paradigms (Contd.) Scala + Spark (1 hr) Distributed Machine Learning and Optimization: Introduction (30 mins) SGD + Proof (1 hr)
Week 8 : Distributed Machine Learning and Optimization: ADMM + applications (1 hr) Clustering (1 hr) Conclusion (30 mins)

SUGGESTED READING MATERIALS:

1.J. Leskovec, A. Rajaraman and JD Ullman. Mining of Massive Datasets. Cambridge University Press, 2nd Ed.

2.Muthukrishnan, S. (2005). Data streams: Algorithms and applications. Foundations and Trends® in Theoretical Computer Science, 1(2), 117-236.

3.Woodruff, David P. ""Sketching as a tool for numerical linear algebra."" Foundations and Trends® in Theoretical Computer Science 10.1–2 (2014): 1-157.

4.Mahoney, Michael W. ""Randomized algorithms for matrices and data."" Foundations and Trends® in Machine Learning 3.2 (2011): 123-224.

CERTIFICATION EXAM :

The exam is optional for a fee.
Date and Time of Exam: October 07, 2018 (Sunday):
Time of Exams: Morning session 9am to 12 noon; Afternoon session: 2pm to 5pm.
Exam for this Course will be available in both morning & afternoon sessions.
Registration url: Announcements will be made when the registration form is open for registrations.
The online registration form has to be filled and the certification exam fee needs to be paid. More details will be made available when the exam registration form is published.

CERTIFICATE:

Final score will be calculated as : 25% assignment score + 75% final exam score
25% assignment score is calculated as 25% of average of Best 6 out of 8 assignments
E-Certificate will be given to those who register and write the exam and score greater than or equal to 40% final score. Certificate will have your name, photograph and the score in the final exam with the breakup.It will have the logos of NPTEL and IIT Kharagpur.It will be e-verifiable at http://nptel.ac.in/noc/

Scalable Data Science

5174 students have enrolled already!!

A project of

In association with

Funded by

Powered by