This is a blog for students enrolled in CSCI 8980 special topic course on Big Data Algorithms in Fall 2013 at University of Minnesota to share their feedback on the course, have a forum for discussion and create a repository of datasets and relevant references. All course related information will be posted here.
Here is the course description
For decades researchers across different disciplines of computer science have envisioned the need of techniques to handle dataintensive computing. With the boom of internet and the explosion of data in every socioeconomical aspect, once what was a futuristic research, has now transformed itself into a dire requirement. Big Data comes with immense opportunity, but turning this seriously high volume, high velocity, structured or unstructured, heterogeneous, often noisy and highdimensional data into something one can use is a huge challenge. This course aims at timely dissemination of foundational algorithmic developments for big data analysis and exposing students to cutting edge research in this area. The course will involve deep theoretical analysis with the goal of developing practical algorithms with variety of applications. We will explore tradeoffs among space, time and accuracy for algorithm design. The course will cover different sampling methodologies, streaming algorithms where only a small fraction of data can be stored, semistreaming model that allows a few sequential access to disk and some parallel algorithms, specifically, the mapreduce paradigm. We will study the algorithmic and complexity aspects of these frameworks. A primary focus of this course will be on designing scalable, subquadratic, often nearlinear or even sublinear algorithms. We will explore property testing methodology that allows in sublinear time to test whether a data set has certain property. We will go through the recent progress in developing fast algorithms for basic graph problems such as maxflow mincut, matching etc., sparse transformations such as sparse fast Fourier transform, and hashing methodologies involving minhash and locality sensitive hashing. The other topics will include dimensionality reduction techniques to handle highdimensional data comprising of random projection method, Johnson Lindenstrauss Lemma, metric embedding and graph sparsifiers. We will also explore crowdsourcing–the power of human assisted computing. Most of the algorithms that we will study in this course will crucially use randomization and will give an answer that is a good approximation of the optimal solution.
Prerequisites Students should have basic knowledge of algorithms: running time analysis, graphs algorithms, and must be familiar with discrete probability. Undergraduates are welcome to attend if they satisfy the requirements.
Syllabus There will be no required text book for this course, instead we will use assorted materials from the web. A tentative list of topics is as follows:
 Streaming: Sampling and Sketching
 Dimensionality Reduction

External Memory and Semistreaming Algorithms
 MapReduce Framework and Extensions

Near Linear Time Algorithm Design
 Property Testing

Metric Embedding
 Sparse Transformation
 Crowdsourcing
The class meets every Thursday 6:30 pm–9:00 pm CT at MechE 221.
Office hours: Friday 1pm to 5pm at 6198 KHKH by Appointment.
See pdf flyer of the course with contact information Algorithmic Techniques for Big Data
Relevant Other Courses Taught in Other Universities
 Sublinear Algorithms by Piotr Indyk, Ronitt Rubinfeld at MIT http://stellar.mit.edu/S/course/6/sp13/6.893/
 Algorithms for Massive Data Sets by Piotr Indyk at MIT http://people.csail.mit.edu/indyk/MASS/index.html
 Mining Big Data by Anand Rajaraman and Jeffrey D. Ullman at Stanford University http://infolab.stanford.edu/~ullman/mining/2009/index.html
 Dealing with Massive Data by Sergei Vassilvitskii from Google Research at Columbia University http://www.cs.columbia.edu/~coms699812/
 Data Stream Algorithms by Andrew Mcgregor at UMASS http://people.cs.umass.edu/~mcgregor/courses/CS711S12/index.html
 Algorithms for Big Data Management by Ashwin Machanavajjhala at Duke https://www.cs.duke.edu/courses/spring13/compsci590.2/
 Data Stream Algorithms by Amit Chakrabarti at Dartmouth http://www.cs.dartmouth.edu/~ac/Teach/CS49Fall11/
Lectures
Latex template for writing scribe scribe
Lecture notes will be uploaded unedited whenever available. They might get edited at a later date.
 09/05 Overview Slides Various models and techniques, counting distinct items in different models, no scribe.
 09/05 Introduction to data streams, finding frequent items deterministically, lower bound for deterministic computation of distinct elements, Markov inequality, Chebyshev bound, The Chernoff bound. Scribe by Vivek Mishra. Lecture 1
 09/12 Universal family of hash functions, Analysis of two algorithms for counting distinct items Slides for Lecture 2
 09/12 CountMin sketch and applications Slides from Andrew Mcgregor’s Data Streaming Course, Project discussion Project Discussion
Background on Randomized Algorithms:
 Randomized Algorithms by Motwani and Raghavan
 Probability and Computing by Mitzenmacher and Upfal
 The Probabilistic Method by Alon and Spencer
Relevant Reading
 Sketch Techniques for Approximate Query Processing by Graham Cormode