25

CSCI 8980: Algorithmic Techniques for Big Data Analysis

This is a blog for students enrolled in CSCI 8980 special topic course on Big Data Algorithms in Fall 2013 at University of Minnesota to share their feedback on the course, have a forum for discussion and create a repository of datasets and relevant references. All course related information will be posted here.

Here is the course description

For decades researchers across different disciplines of computer science have envisioned the need of techniques to handle data-intensive computing. With the boom of internet and the explosion of data in every socio-economical aspect, once what was a futuristic research, has now transformed itself into a dire requirement. Big Data comes with immense opportunity, but turning this seriously high volume, high velocity, structured or unstructured, heterogeneous, often noisy and high-dimensional data into something one can use is a huge challenge. This course aims at timely dissemination of foundational algorithmic developments for big data analysis and exposing students to cutting edge research in this area. The course will involve deep theoretical analysis with the goal of developing practical algorithms with variety of applications. We will explore trade-off s among space, time and accuracy for algorithm design. The course will cover different sampling methodologies, streaming algorithms where only a small fraction of data can be stored, semi-streaming model that allows a few sequential access to disk and some parallel algorithms, specifically, the map-reduce paradigm. We will study the algorithmic and complexity aspects of these frameworks. A primary focus of this course will be on designing scalable, sub-quadratic, often near-linear or even sub-linear algorithms. We will explore property testing methodology that allows in sub-linear time to test whether a data set has certain property. We will go through the recent progress in developing fast algorithms for basic graph problems such as max-flow min-cut, matching etc., sparse transformations such as sparse fast Fourier transform, and hashing methodologies involving min-hash and locality sensitive hashing. The other topics will include dimensionality reduction techniques to handle high-dimensional data comprising of random projection method, Johnson Lindenstrauss Lemma, metric embedding and graph sparsi fiers. We will also explore crowdsourcing–the power of human assisted computing. Most of the algorithms that we will study in this course will crucially use randomization and will give an answer that is a good approximation of the optimal solution.

Prerequisites Students should have basic knowledge of algorithms: running time analysis, graphs algorithms, and must be familiar with discrete probability. Undergraduates are welcome to attend if they satisfy the requirements.

Syllabus There will be no required text book for this course, instead we will use assorted materials from the web. A tentative list of topics is as follows:

  1.  Streaming: Sampling and Sketching
  2.  Dimensionality Reduction
  3. External Memory and Semi-streaming Algorithms

  4.  Map-Reduce Framework and Extensions
  5.  Near Linear Time Algorithm Design

  6.  Property Testing
  7.  Metric Embedding

  8.  Sparse Transformation
  9. Crowdsourcing

The class meets every Thursday 6:30 pm–9:00 pm CT at MechE 221.

Office hours: Friday 1pm to 5pm at  6-198 KHKH by Appointment.

See pdf flyer of the course with contact information Algorithmic Techniques for Big Data Analysis

Grading Policy
Scribe 20% +  Class and Blog Participation 20% + Project 60% (Survey 30% + Presentation and Write up 30%)

For each Thursday class,scribe is due by the following Monday. Late submission will be penalized.
* Oct 3rd : submit names of 5 survey papers
* Oct 17th: project proposal due – Max 2 pages + references, must  be in latex, single column, article 11 pt.
*Nov 14th : a one page write-up describing progress on project due
* Dec 5th: project final write-up due

No late submission will be accepted unless there is compelling reason.

Relevant Other Courses Taught in Other Universities

  1. Sub-linear Algorithms by Piotr Indyk, Ronitt Rubinfeld at MIT http://stellar.mit.edu/S/course/6/sp13/6.893/
  2. Algorithms for Massive Data Sets by Piotr Indyk at MIT http://people.csail.mit.edu/indyk/MASS/index.html
  3. Mining Big Data by Anand Rajaraman and Jeffrey D. Ullman at Stanford University http://infolab.stanford.edu/~ullman/mining/2009/index.html
  4. Dealing with Massive Data by Sergei Vassilvitskii from Google Research at Columbia University http://www.cs.columbia.edu/~coms699812/
  5. Data Stream Algorithms by Andrew Mcgregor at UMASShttp://people.cs.umass.edu/~mcgregor/courses/CS711S12/index.html
  6. Algorithms for Big Data Management by Ashwin Machanavajjhala at Duke https://www.cs.duke.edu/courses/spring13/compsci590.2/
  7. Data Stream Algorithms by Amit Chakrabarti at Dartmouth http://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11/

Lectures:

Latex template for writing scribe scribe

Lecture notes will be emailed to the students unedited and will be posted here after I have time to edit them at least once.

  1. 09/05 Overview Slides   Various models and techniques, counting distinct items in different models, no scribe.
  2. 09/05 Introduction to data streams, finding frequent items deterministically, lower bound for deterministic computation of  distinct elements, Markov inequality, Chebyshev bound, The Chernoff bound.  Scribe by Vivek Mishra. Lecture 1
  3. 09/12 Universal family of hash functions, Analysis of two algorithms for counting distinct items  Slides
  4. 09/12 Count-Min sketch and applications   Slides from Andrew Mcgregor’s Data Streaming Course, Project discussion Project Discussion
  5. 09/19 Frequency moment estimation, Johnson-Lindenstrauss Lemma, p-stable distribution.  l_p norm via max-stability “High frequency moments via max-stability”Slides   Scribe
  6. 09/19  Approximate Near Neighbor Search, Locality Sensitive Hashing Slides Scribe
  7. 09/26 Locality Sensitive Hashing continued, Min-wise Independent Hashing ,  note: http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf
  8. 09/26 Sampling: Reservoir, AMS, Priority sampling Slides
  9. 10/03 Clustering: Small space clustering: K center,  K median; K-means++ Slidesk-means++
  10. 10/03 Semi-streaming: Graph connectivity, Spanners, Sparsifiers Slides by Andrew Mcgregor
  11. 10/10 Graph Sparsifiers Continued
  12. 10/10 Graph Sketch http://people.cs.umass.edu/~mcgregor/711S12/lec-2-2.pdf
  13. 10/17 Making Dynamic Programmings Fast: Amnesic Approximation. A tour through Data Quality. Data Quality Slides
  14. 10/17 Introduction to Map Reduce, MST, Matching,  Dense Subgraph Computation Map-Reduce Intro Slides
  15. 10/24 Counting Triangles in Massive Graphs
    Student presentation: Radius and Clustering on Map Reduce
    Student presentation: Data Deduplication, Load balancing in Map reduce
  16. 10/31  Random Walks on Graphs
    Student presentation: Distributed Random Walks
  17. 10/31 Recommender System: Low Rank Matrix Completion
  18. 11/07 Student Presentation: Survey on Recommender System, Real time recommendation
  19. 11/07 Student Presentation: Managing Uncertain Data: Ranking with uncertainty
  20. 11/14 Crowdsourcing: Ranking and Clustering
  21. 11/14, 11/21 Student Presentation: Topics on Social Streams
    Tag Recommendation, Story Detection in Twitter
  22. 12/5, Property Testing & Sublinear Algorithms: Testing Distance Between Distributions, Counting Connected Components, MST
  23. Project Presentation

Background on Randomized Algorithms:

  • Randomized Algorithms by Motwani and Raghavan
  • Probability and Computing by Mitzenmacher and Upfal
  • The Probabilistic Method by Alon and Spencer

Relevant Reading

  • Sketch Techniques for Approximate Query Processing by Graham Cormode
  • The space complexity of approximating the frequency moments, Noga Alon, Yossi Matias, Marios Szegedy
  • Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation, Piotr Indyk
  • An elementary proof of a theorem of Johnson and Lindenstrauss, Sanjoy Dasgupta, Anupam Gupta
  • Approximate nearest neighbors: towards removing the curse of dimensionality by Piotr Indyk and Rajeev Motawani
  • Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions by Alex Andoni and Piotr Indyk.
  • Locality-sensitive hashing scheme based on p-stable distributions, Mayur Datar, Nicole Immorlica, Piotr Indyk, Vahab S. Mirrokni
  • Priority sampling for estimation of arbitrary subset sums by Nick Duffield, Carsten Lund, Mikkel Thorup
  • Streaming Algorithms from Precision Sampling, Alexandr Andoni, Robert Karuthgamer, Krzysztof Onak
  • Tight Results for  Clustering and Summarizing  Data Streams, Sudipto Guha
  • Better streaming algorithms for clustering problems, Moses Charikar, Liadan O’Callaghan, Rina Panigrahy
  • k-means++: The Advantage of Careful Seeding, Sergei Vassilvitskii and David Arthur
  • On Graph Problems in Semi-Streaming Model, Joan Feigenbaum, Sampatha Kannan, Andrew Mcgregor, Siddharth Suri.
  • Graph Distances in the Data-Stream Model, Joan Feigenbaum, Sampatha Kannan, Andrew Mcgregor, Siddharth Suri, Jian Zhang.
  • Graph sparsification in the semi-streaming model, Kook Jin Ahn, Sudipto Guha
  • Bahman Bahmani, Ravi Kumar, Sergei Vassilvitskii. Densest Subgraph in Streaming and MapReduce. In VLDB 2012
  • Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, Sergei Vassilvitskii. Filtering: A Method for Solving Graph Problems in MapReduce. In SPAA 2011
  • Howard Karloff, Siddarth Suri, Sergei Vassilvitskii. A Model of Computation for MapReduce. In SODA 2010 (Austin, Texas) [ pdf]
1

Project Groups

Group                                                                                            Topic

1. Evangelia Christakopoulou and Mohit Sharma                              Crowd-based Recommender System

2. Aarti , Ravindra, Uriah                              Tag Recommendation in Question-Answering System

3. Fan Zhang and Zhiqi Chen                                                          Data Deduplication

4. Haritha Bellam, Vivek mishra and Koshy Rajan           Random Walk Methods to solve Linear Systems in Electrical Networks

5. Rohit Deshpande,  Kaustubh Duraphe and Shikher Sitoke                 Ranking Algorithms

6. Matthew Baudino and Neelabjo Shubhashis Choudhury    Predicting and tracking news using social streams

7. Ali Mahdavi Adeli and Shuya Zhang         Real time Recommender System

8. Matt Nohelty, Roman Dovgopol      Hash-Tag Recommendation in Twitter

9. Mark Andrew Valovage                                           Pattern Mining on Energy Data

10.Huiqi Xu, Jinfei Yin and Shuo Zhou                  Graph Algorithms in Map Reduce

Some Interesting Data Sets

http://aws.amazon.com/publicdatasets/
http://snap.stanford.edu/data/
http://www.infochimps.com/datasets/twitter-census-stock-tweets
http://socialcomputing.asu.edu/datasets/Twitter
http://dumps.wikimedia.org/other/pagecounts-raw/

6

Repository of Relevant Papers

Here we will collect recent papers on  topics that are covered in the class. It will serve as a place to find relevant papers for the projects. This will get populated over the semester.

  • Streaming, Semi-streaming, Sparsity

Daniel M. Kane, Jelani Nelson,  David P. Woodruff. An Optimal Algorithm for the Distinct Elements Problem. PODS’10  [ps][pdf]

Jelani Nelson,  David P. Woodruff, Fast Manhattan Sketches in Data Streams. PODS’10 [ps][pdf]

The Johnson-Lindenstrauss Transform: An Empirical Study, Suresh Venkatasubramanian and Qiushi Wang, ALENEX’11.

Sketching Structured Matrices for Faster Nonlinear Regression, Haim Avron,  Vikas Sindhwani,  David Woodruff, NIPS’13.

A Tight Lower Bound for High Frequency Moment Estimation with Small Error
pdf , Yi Li, David Woodruff, RANDOM’13.

Low Rank Approximation and Regression in Input Sparsity Time, Ken Clarkson, David Woodruff, STOC’13. arXiv 

Faster Robust Linear Regression pdf , K.L. Clarkson, P. Drineas, M. Magdon-Ismail, M.W. Mahoney, ,X. Meng and David Woodruff. SODA’13.

Rectangle-Efficient Aggregation in Spatial Data Streams pdf  Srikanta Tirthapura, David Woodruff.

Space-Efficient Estimation of Statistics over Sub-Sampled Streams pdf  Andrew McGregor, A. Pavan, , Srikanta Tirthapura and David Woodruff.

A General Method for Estimating Correlated Aggregates over a Data Stream, Srikanta Tirthapura, David Woodruff, ICDE’12

Counting and Sampling Triangles from a Graph Stream, A. Pavan, Kanat Tangwongsan, Srikanta Tirthapura, Kun-Lung Wu, VLDB’14.

Applications of the Shannon-Hartley Theorem to Data Streams and Sparse Recovery,  Eric Price, David Woodruff, ISIT’12.

Streaming Algorithms with One-Sided Estimation pdf  Joshua Brody and David Woodruff, RANDOM’11

“Homomorphic Fingerprints under Misalignments: Sketching Edit and Shift Distances”Alexandr Adoni, Assaf Goldberger, Andrew McGregor, and Ely Porat. In STOC’13.

“Shift Finding in Sub-linear Time”, Alexandr Adoni, Haitham Hassanieh, Piotr Indyk, and Dina Katabi. In SODA’13.

“Eigenvalues of a Matrix in the Streaming Model”,  Alexandr Adoni, Huy L. Nguyen. In SODA’13. 
 “Width of Points in the Streaming Model”  , Alexandr Adoni, Huy L. Nguyen. SODA’12.  journal version.

“Near Linear Lower Bound for Dimension Reduction in L1”  , Alexandr Adoni, Moses Charikar, Ofer Neiman, and Huy L. Nguyen, FOCS’11.
conference version.

“Streaming Algorithms via Precision Sampling” , Alexandr Adoni, Robert Krauthgamer and Krzysztof Onak. In FOCS’11. Full version on arXiv:1011.1263.
arXiv versionconference version.
“Polylogarithmic Approximation for Edit Distance and the Asymmetric Query Complexity”  , Alexandr Adoni, Robert Krauthgamer and Krzysztof Onak). FOCS’10arXiv versionconference version.

Quantiles over data streams: An experimental study, L. Wang, G. Luo, K. Yi, and G. Cormode.SIGMOD’13.

Finding interesting correlations with conditional heavy hitters. G. Cormode, K. Mirylenka, T. Palpanas, and D. Srivastava. ICDE’13.

On unifying the space of l0-sampling algorithms.G. Cormode and D. Firmani. ALENEX’13.

  • Map Reduce and Distributed Computing


Max-Cover in Map Reduce
,  Flavio Chierichetti, Ravi Kumar, Andrew Tomkins, WWW’10.

‎Fast greedy algorithms in mapreduce and streaming, Ravi Kumar Benjamin Moseley, Sergei Vassilvitskii, Andrea Vattani, SPAA’13.

Densest Subgraph in Streaming and MapReduce. Bahman Bahmani, Ravi Kumar, Sergei Vassilvitskii. VLDB 2012

A Model of Computation for MapReduce. Howard Karloff, Siddarth Suri, Sergei Vassilvitskii, SODA’10.

Filtering: A Method for Solving Graph Problems in MapReduce. Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, Sergei Vassilvitskii. SPAA 2011 [ pdf ]

Counting Triangles and the Curse of the Last Reducer. Siddharth Suri, Sergei Vassilvitskii. WWW 2011  [ pdf ]

Optimal Random Sampling from Distributed Streams Revisited (paper in pdf) (talk slides in pdf)
Srikanta Tirthapura and David Woodruff , DISC’11.

  • External Memory Algorithms

  • Crowdsourcing

  Leveraging Transitive Relations for Crowdsourced Joins Jiannan Wang, Guoliang Li,  Tim Kraska,  Michael Franklin, Jianhua  Feng, SIGMOD’13

Crowd Mining Yael Amsterdamer, Yael Grossman, Tova Milo, Pierre Senellart, SIGMOD’13

Human-powered Sorts and Joins. A. Marcus, E. Wu, D. Karger, S.Madden, R. Miller, VLDB’12

CrowdER: Crowdsourcing Entity Resolution. J. Wang, T. Kraska, M. J. Franklin, J. Feng, VLDB’12

Crowdsourcing with Endogenous Entry. Arpita Ghosh and Preston Mcafee, WWW’12

Answering Search Queries with CrowdSearcher. A. Bozzon, M. Brambilla and S. Ceri, WWW’12

ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking. G. Demartini, D. Eddine Difallah and P. Cudre’-Mauroux, WWW’12

CrowdScreen: Algorithms for Filtering Data with Humans, A. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, J. Widom, SIGMOD 2012

Max Algorithms in Crowdsourcing Environments. P. Venetis, H. Garcia-Molina, K. Huang and N. Polyzotis, WWW’12

So Who Won? Dynamic Max Discovery with the Crowd, S. Guo, Aditya Parameswaran, H Garcia-Molina, SIGMOD 2012

Pushing the Boundaries of Crowd-enabled Databases with Query-driven Schema Expansion. J. Selke, C. Lofi, W. Balke, VLDB’12

CDAS: A Crowdsourcing Data Analytics System. X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, M. Zhang, VLDB’12

An Online Cost Sensitive Decision-Making Method in Crowdsourcing Systems  Jinyang Gao,  Xuan Liu,  Beng Chin Ooi,  Haixun Wang, Gang Chen, SIGMOD’13

  • Others

Big data integration, Xin Luna Dong, Divesh Srivastava. ICDE’13, VLDB’13 Tutorial.

Diverse Set Selection Over Dynamic Data,  Marina Drosou, Evaggelia Pitoura, TKDE’13

DisC Diversity: Result Diversification based on Dissimilarity and Coverage,  Marina Drosou, Evaggelia Pitoura, VLDB’13

Dense Subgraph Maintenance under Streaming Edge Weight Updates for Real-time Story Identification , Albert Angel, Nick Koudas, Nikos Sarkas, Divesh Srivastava, VLDB’12.

Streaming first story detection with application to twitter Saša Petrović, Miles Osborne, Victor Lavrenko, HLT’10

On Maximum Coverage in the Streaming Model & Application to Multi-topic Blog-Watch Barna Saha, Lise Getoor, SDM’09.

‎Set Cover Algorithms for Very large Datasets, Graham Cormode, Howard Karloff, Anthony Writh, CIKM’10.

Scalable k-Means++. Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, Sergei Vassilvitskii. VLDB’12.

Sorting and Selection with Imprecise Comparisons, Miklós Ajtai, Vitaly FeldmanAvinatan Hassidim, Jelani Nelson. ICALP ’09, [ps][pdf][full_ps][full_pdf]

Sketching Techniques for Large-Scale NLP Amit Goyal, Jagadeesh Jagarlamudi, Hal Daume and Suresh Venkatasubramanian 6th Web as Corpus Workshop (in conjunction with NAACL-HLT 2010)

Streamed Learning: One-Pass SVMs Piyush Rai, Hal Daume III, and Suresh Venkatasubramanian, IJCAI-09

Streaming for large scale NLP: Language Modelling, Amit Goyal, Hal Daume and Suresh Venkatasubramanian, NAACL -HLT’09