This is a blog for students enrolled in CSCI 8980 special topic course on Big Data Algorithms in Fall 2013 at University of Minnesota to share their feedback on the course, have a forum for discussion and create a repository of datasets and relevant references. All course related information will be posted here.
Here is the course description
For decades researchers across different disciplines of computer science have envisioned the need of techniques to handle dataintensive computing. With the boom of internet and the explosion of data in every socioeconomical aspect, once what was a futuristic research, has now transformed itself into a dire requirement. Big Data comes with immense opportunity, but turning this seriously high volume, high velocity, structured or unstructured, heterogeneous, often noisy and highdimensional data into something one can use is a huge challenge. This course aims at timely dissemination of foundational algorithmic developments for big data analysis and exposing students to cutting edge research in this area. The course will involve deep theoretical analysis with the goal of developing practical algorithms with variety of applications. We will explore tradeoffs among space, time and accuracy for algorithm design. The course will cover different sampling methodologies, streaming algorithms where only a small fraction of data can be stored, semistreaming model that allows a few sequential access to disk and some parallel algorithms, specifically, the mapreduce paradigm. We will study the algorithmic and complexity aspects of these frameworks. A primary focus of this course will be on designing scalable, subquadratic, often nearlinear or even sublinear algorithms. We will explore property testing methodology that allows in sublinear time to test whether a data set has certain property. We will go through the recent progress in developing fast algorithms for basic graph problems such as maxflow mincut, matching etc., sparse transformations such as sparse fast Fourier transform, and hashing methodologies involving minhash and locality sensitive hashing. The other topics will include dimensionality reduction techniques to handle highdimensional data comprising of random projection method, Johnson Lindenstrauss Lemma, metric embedding and graph sparsifiers. We will also explore crowdsourcing–the power of human assisted computing. Most of the algorithms that we will study in this course will crucially use randomization and will give an answer that is a good approximation of the optimal solution.
Prerequisites Students should have basic knowledge of algorithms: running time analysis, graphs algorithms, and must be familiar with discrete probability. Undergraduates are welcome to attend if they satisfy the requirements.
Syllabus There will be no required text book for this course, instead we will use assorted materials from the web. A tentative list of topics is as follows:
 Streaming: Sampling and Sketching
 Dimensionality Reduction

External Memory and Semistreaming Algorithms
 MapReduce Framework and Extensions

Near Linear Time Algorithm Design
 Property Testing

Metric Embedding
 Sparse Transformation
 Crowdsourcing
The class meets every Thursday 6:30 pm–9:00 pm CT at MechE 221.
Office hours: Friday 1pm to 5pm at 6198 KHKH by Appointment.
See pdf flyer of the course with contact information Algorithmic Techniques for Big Data Analysis
Grading Policy
Scribe 20% + Class and Blog Participation 20% + Project 60% (Survey 30% + Presentation and Write up 30%)
For each Thursday class,scribe is due by the following Monday. Late submission will be penalized.
* Oct 3rd : submit names of 5 survey papers
* Oct 17th: project proposal due – Max 2 pages + references, must be in latex, single column, article 11 pt.
*Nov 14th : a one page writeup describing progress on project due
* Dec 5th: project final writeup due
No late submission will be accepted unless there is compelling reason.
Relevant Other Courses Taught in Other Universities
 Sublinear Algorithms by Piotr Indyk, Ronitt Rubinfeld at MIT http://stellar.mit.edu/S/course/6/sp13/6.893/
 Algorithms for Massive Data Sets by Piotr Indyk at MIT http://people.csail.mit.edu/indyk/MASS/index.html
 Mining Big Data by Anand Rajaraman and Jeffrey D. Ullman at Stanford University http://infolab.stanford.edu/~ullman/mining/2009/index.html
 Dealing with Massive Data by Sergei Vassilvitskii from Google Research at Columbia University http://www.cs.columbia.edu/~coms699812/
 Data Stream Algorithms by Andrew Mcgregor at UMASShttp://people.cs.umass.edu/~mcgregor/courses/CS711S12/index.html
 Algorithms for Big Data Management by Ashwin Machanavajjhala at Duke https://www.cs.duke.edu/courses/spring13/compsci590.2/
 Data Stream Algorithms by Amit Chakrabarti at Dartmouth http://www.cs.dartmouth.edu/~ac/Teach/CS49Fall11/
Lectures:
Latex template for writing scribe scribe
Lecture notes will be emailed to the students unedited and will be posted here after I have time to edit them at least once.
 09/05 Overview Slides Various models and techniques, counting distinct items in different models, no scribe.
 09/05 Introduction to data streams, finding frequent items deterministically, lower bound for deterministic computation of distinct elements, Markov inequality, Chebyshev bound, The Chernoff bound. Scribe by Vivek Mishra. Lecture 1
 09/12 Universal family of hash functions, Analysis of two algorithms for counting distinct items Slides
 09/12 CountMin sketch and applications Slides from Andrew Mcgregor’s Data Streaming Course, Project discussion Project Discussion
 09/19 Frequency moment estimation, JohnsonLindenstrauss Lemma, pstable distribution. l_p norm via maxstability “High frequency moments via maxstability”. Slides Scribe
 09/19 Approximate Near Neighbor Search, Locality Sensitive Hashing Slides Scribe
 09/26 Locality Sensitive Hashing continued, Minwise Independent Hashing , note: http://cseweb.ucsd.edu/~dasgupta/254embeddings/lawrence.pdf
 09/26 Sampling: Reservoir, AMS, Priority sampling Slides
 10/03 Clustering: Small space clustering: K center, K median; Kmeans++ Slides, kmeans++
 10/03 Semistreaming: Graph connectivity, Spanners, Sparsifiers Slides by Andrew Mcgregor
 10/10 Graph Sparsifiers Continued
 10/10 Graph Sketch http://people.cs.umass.edu/~mcgregor/711S12/lec22.pdf
 10/17 Making Dynamic Programmings Fast: Amnesic Approximation. A tour through Data Quality. Data Quality Slides
 10/17 Introduction to Map Reduce, MST, Matching, Dense Subgraph Computation MapReduce Intro Slides
 10/24 Counting Triangles in Massive Graphs
Student presentation: Radius and Clustering on Map Reduce
Student presentation: Data Deduplication, Load balancing in Map reduce  10/31 Random Walks on Graphs
Student presentation: Distributed Random Walks  10/31 Recommender System: Low Rank Matrix Completion
 11/07 Student Presentation: Survey on Recommender System, Real time recommendation
 11/07 Student Presentation: Managing Uncertain Data: Ranking with uncertainty
 11/14 Crowdsourcing: Ranking and Clustering
 11/14, 11/21 Student Presentation: Topics on Social Streams
Tag Recommendation, Story Detection in Twitter  12/5, Property Testing & Sublinear Algorithms: Testing Distance Between Distributions, Counting Connected Components, MST
 Project Presentation
Background on Randomized Algorithms:
 Randomized Algorithms by Motwani and Raghavan
 Probability and Computing by Mitzenmacher and Upfal
 The Probabilistic Method by Alon and Spencer
Relevant Reading
 Sketch Techniques for Approximate Query Processing by Graham Cormode
 The space complexity of approximating the frequency moments, Noga Alon, Yossi Matias, Marios Szegedy
 Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation, Piotr Indyk
 An elementary proof of a theorem of Johnson and Lindenstrauss, Sanjoy Dasgupta, Anupam Gupta
 Approximate nearest neighbors: towards removing the curse of dimensionality by Piotr Indyk and Rajeev Motawani
 NearOptimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions by Alex Andoni and Piotr Indyk.
 Localitysensitive hashing scheme based on pstable distributions, Mayur Datar, Nicole Immorlica, Piotr Indyk, Vahab S. Mirrokni
 Priority sampling for estimation of arbitrary subset sums by Nick Duffield, Carsten Lund, Mikkel Thorup
 Streaming Algorithms from Precision Sampling, Alexandr Andoni, Robert Karuthgamer, Krzysztof Onak
 Tight Results for Clustering and Summarizing Data Streams, Sudipto Guha
 Better streaming algorithms for clustering problems, Moses Charikar, Liadan O’Callaghan, Rina Panigrahy
 kmeans++: The Advantage of Careful Seeding, Sergei Vassilvitskii and David Arthur
 On Graph Problems in SemiStreaming Model, Joan Feigenbaum, Sampatha Kannan, Andrew Mcgregor, Siddharth Suri.
 Graph Distances in the DataStream Model, Joan Feigenbaum, Sampatha Kannan, Andrew Mcgregor, Siddharth Suri, Jian Zhang.
 Graph sparsification in the semistreaming model, Kook Jin Ahn, Sudipto Guha
 Bahman Bahmani, Ravi Kumar, Sergei Vassilvitskii. Densest Subgraph in Streaming and MapReduce. In VLDB 2012
 Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, Sergei Vassilvitskii. Filtering: A Method for Solving Graph Problems in MapReduce. In SPAA 2011
 Howard Karloff, Siddarth Suri, Sergei Vassilvitskii. A Model of Computation for MapReduce. In SODA 2010 (Austin, Texas) [ pdf]
Students, please introduce yourself in one line: name, major, degree studying, years in the degree and if you already have a thesis topic, which area you are working on.
Also please indicate which of the followings you are familiar with:
1. Algorithm running time analysis: big O, omega notations
2. Sorting algorithms and their complexity
3. Basic graph algorithms: finding MST, shortest path
4. Dynamic programming
5. Basic probability: random variables, expectation, variance
6. Tail Bounds: Markov, Chebyshev, Chernoff etc.
For example, if you know 15 but not 6, just mention familiar with 15 but not 6.
Hello everyone,
I am Kaustubh and I am a second year graduate student in computer science. I am not writing a thesis; but, planning to work on a project in distributed computing under plan B. I am familiar with points 15 with very abstract to almost none knowledge of 6th.
Best,
Kaustubh
Hi All,
My name is Matt Nohelty. I am pursuing an MS in Computer Science and am in my second year. I am not planning to write a thesis and will instead complete the plan C option. I’m familiar with items 15 but not 6.
Matt
Hello,
My name is Matthew, in terms of credits I am finishing my first year as a grad student pursuing a MS. I do not have a thesis topic and might take the plan B route as well. My area of interest has shifted from distributed robotics to distributed computing in general. I am familiar with topics 13 and somewhat with 4.
Hello Everyone,
My name is Koshy P. Rajan and I am a second year Master’s student in Computer Science. I am planning to graduate with plan C option. My area of interest is systems and I am familiar with topics 14 but not 5 and 6.
Koshy
My name is Shikher Sitoke. MS in Computer Science, 2nd year. I am also taking the plan C route as of now.
I’m familiar with items 15 but not 6.
Hello,
I am Fan and I am a second year graduate student majoring in computer science. I am not planning to write the thesis. I am familiar with topics 15 but not 6.
Hello everyone,
Myself Saurabh Verma and i am a first year master’s student in computer science. I haven’t decide upon my plan option but interested in machine learning and data mining. I am familiar with topics 13,5 and little bit of 4.
–Saurabh
My name is John Maloney. I am pursuing the plan C option for a masters degree in computer science. I am focusing on machine learning and data mining. I am a part time student and have completed the following courses: pattern recognition, robotics, artificial intelligence II, data mining and parallel programming. I am familiar with topics 15 but know very little about topic 6.
Hi folks,
My name is Shuya. I am a first year master student, major in CS. I am not going to write the thesis, but most likely to do a research project with plan B. I am familiar with topics 14, somewhat with 5, but 6 is a totally new concept for me.
And look forward to study together with you in this interesting course~
Hi everyone.
My name is Aarti. I am a second year PhD student and am familiar with 16, except for Chernoff.
Hi,
My name is Vivek Mishra. I am in my 3rd year of Ph.D in ECE. I work in design automation for circuits and algorithms for robust, variationtolerant, aging aware VLSI design. I am currently working on the Electromigration problem for VLSI circuits.
I am familiar with
15 but not familiar with 6
Hi everyone,
My name is Zhiqi Chen. I am a second year master student in computer science and I am not planning to write a thesis. I am familiar with 15 but not 6.
Zhiqi
My name is Mark Valovage. I’m a 3rd year Ph.D. student. I’m working on applying predictive learning methods to disaggregate electrical data for my WPE. I’m familiar with topics 15 (although a little rusty with topic 3), and new to #6.
I am Karthik Subbian, Computer Science major, 3rd year Ph.D, familiar with 15 but not 6, mainly interested in “Analyzing large networks with content”.
Hi All,
I am Jin Zhao, a 2nd year master student of computer science. I am familiar with topics 1 – 5 but not 6.
Hi, I am Jinfei Yin, a 2nd year master in CS, and I will choose plan C to obtain my degree. I am familiar with topics 1 to 5, but not 6.
Hi, I am Ali Mahdavi, 3rd year PhD in Information & Decision sciences (at Carlson). I am familiar with topics 15 but not so much with 6.
Hi everyone, I’m a 2nd year MS in CS student. I’m familiar with 15.
Hello ,
I am Rohit Deshpande. I am a second year Master’s student in Computer Science. My area of interest is Machine Learning. I am familiar with 15.
Hi everyone,
My name is Roman. I’m in my senior year of my BS in Computer Science. I’m familiar with 15 and had a course which covered 6 about two years ago, so I likely remember all inequalities, but without any proofs.
Hello everyone,
I am Haritha. I am a second year Masters student in computer science. I am interested in algorithms and computational geometry fields. I am familiar with 15.
Feedback:
I found the covered topics and contents interesting and useful.
You were well prepared for the lectures (indicated mastery of the offered material) and the contents were well transferred. However, I had a sense that during some of the sessions, part of the class (at least) were losing track. Of course, this is sth students themselves are accountable for as well.
Having a set of exercises would be very helpful for a better grasp of some of the discussed topics. You mentioned it, but due to some reason it was seen unnecessary (not provided) later in the course.
Regarding scribes, it would be better if they were available several days after each lecture, as they almost were in the first part of the semester. It seemed like there was a decline in this regard towards the end of the course. Rather a problem on the students’ side I would say.
Overall it was a great course, with variety of relevant and interesting material. Would recommend it to other students in my cohort if it would be offered again.
Thank you Barna!
Course feedback:
* Content was new, no other course which covers this area/topics.
* Would like to have graded programming with few derivations assignments in course.
* Good flexibility in terms of project.
@bigdataalgo Have you considered offering this course online through Coursera? In my opinion, that would be highly interesting and successful also.