Repository of Relevant Papers

Here we will collect recent papers on  topics that are covered in the class. It will serve as a place to find relevant papers for the projects. This will get populated over the semester.

  • Streaming, Semi-streaming, Sparsity

Daniel M. Kane, Jelani Nelson,  David P. Woodruff. An Optimal Algorithm for the Distinct Elements Problem. PODS’10  [ps][pdf]

Jelani Nelson,  David P. Woodruff, Fast Manhattan Sketches in Data Streams. PODS’10 [ps][pdf]

The Johnson-Lindenstrauss Transform: An Empirical Study, Suresh Venkatasubramanian and Qiushi Wang, ALENEX’11.

Sketching Structured Matrices for Faster Nonlinear Regression, Haim Avron,  Vikas Sindhwani,  David Woodruff, NIPS’13.

A Tight Lower Bound for High Frequency Moment Estimation with Small Error
pdf , Yi Li, David Woodruff, RANDOM’13.

Low Rank Approximation and Regression in Input Sparsity Time, Ken Clarkson, David Woodruff, STOC’13. arXiv 

Faster Robust Linear Regression pdf , K.L. Clarkson, P. Drineas, M. Magdon-Ismail, M.W. Mahoney, ,X. Meng and David Woodruff. SODA’13.

Rectangle-Efficient Aggregation in Spatial Data Streams pdf  Srikanta Tirthapura, David Woodruff.

Space-Efficient Estimation of Statistics over Sub-Sampled Streams pdf  Andrew McGregor, A. Pavan, , Srikanta Tirthapura and David Woodruff.

A General Method for Estimating Correlated Aggregates over a Data Stream, Srikanta Tirthapura, David Woodruff, ICDE’12

Counting and Sampling Triangles from a Graph Stream, A. Pavan, Kanat Tangwongsan, Srikanta Tirthapura, Kun-Lung Wu, VLDB’14.

Applications of the Shannon-Hartley Theorem to Data Streams and Sparse Recovery,  Eric Price, David Woodruff, ISIT’12.

Streaming Algorithms with One-Sided Estimation pdf  Joshua Brody and David Woodruff, RANDOM’11

“Homomorphic Fingerprints under Misalignments: Sketching Edit and Shift Distances”Alexandr Adoni, Assaf Goldberger, Andrew McGregor, and Ely Porat. In STOC’13.

“Shift Finding in Sub-linear Time”, Alexandr Adoni, Haitham Hassanieh, Piotr Indyk, and Dina Katabi. In SODA’13.

“Eigenvalues of a Matrix in the Streaming Model”,  Alexandr Adoni, Huy L. Nguyen. In SODA’13. 
 “Width of Points in the Streaming Model”  , Alexandr Adoni, Huy L. Nguyen. SODA’12.  journal version.

“Near Linear Lower Bound for Dimension Reduction in L1”  , Alexandr Adoni, Moses Charikar, Ofer Neiman, and Huy L. Nguyen, FOCS’11.
conference version.

“Streaming Algorithms via Precision Sampling” , Alexandr Adoni, Robert Krauthgamer and Krzysztof Onak. In FOCS’11. Full version on arXiv:1011.1263.
arXiv versionconference version.
“Polylogarithmic Approximation for Edit Distance and the Asymmetric Query Complexity”  , Alexandr Adoni, Robert Krauthgamer and Krzysztof Onak). FOCS’10arXiv versionconference version.

Quantiles over data streams: An experimental study, L. Wang, G. Luo, K. Yi, and G. Cormode.SIGMOD’13.

Finding interesting correlations with conditional heavy hitters. G. Cormode, K. Mirylenka, T. Palpanas, and D. Srivastava. ICDE’13.

On unifying the space of l0-sampling algorithms.G. Cormode and D. Firmani. ALENEX’13.

  • Map Reduce and Distributed Computing

Max-Cover in Map Reduce
,  Flavio Chierichetti, Ravi Kumar, Andrew Tomkins, WWW’10.

‎Fast greedy algorithms in mapreduce and streaming, Ravi Kumar Benjamin Moseley, Sergei Vassilvitskii, Andrea Vattani, SPAA’13.

Densest Subgraph in Streaming and MapReduce. Bahman Bahmani, Ravi Kumar, Sergei Vassilvitskii. VLDB 2012

A Model of Computation for MapReduce. Howard Karloff, Siddarth Suri, Sergei Vassilvitskii, SODA’10.

Filtering: A Method for Solving Graph Problems in MapReduce. Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, Sergei Vassilvitskii. SPAA 2011 [ pdf ]

Counting Triangles and the Curse of the Last Reducer. Siddharth Suri, Sergei Vassilvitskii. WWW 2011  [ pdf ]

Optimal Random Sampling from Distributed Streams Revisited (paper in pdf) (talk slides in pdf)
Srikanta Tirthapura and David Woodruff , DISC’11.

  • External Memory Algorithms

  • Crowdsourcing

  Leveraging Transitive Relations for Crowdsourced Joins Jiannan Wang, Guoliang Li,  Tim Kraska,  Michael Franklin, Jianhua  Feng, SIGMOD’13

Crowd Mining Yael Amsterdamer, Yael Grossman, Tova Milo, Pierre Senellart, SIGMOD’13

Human-powered Sorts and Joins. A. Marcus, E. Wu, D. Karger, S.Madden, R. Miller, VLDB’12

CrowdER: Crowdsourcing Entity Resolution. J. Wang, T. Kraska, M. J. Franklin, J. Feng, VLDB’12

Crowdsourcing with Endogenous Entry. Arpita Ghosh and Preston Mcafee, WWW’12

Answering Search Queries with CrowdSearcher. A. Bozzon, M. Brambilla and S. Ceri, WWW’12

ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking. G. Demartini, D. Eddine Difallah and P. Cudre’-Mauroux, WWW’12

CrowdScreen: Algorithms for Filtering Data with Humans, A. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, J. Widom, SIGMOD 2012

Max Algorithms in Crowdsourcing Environments. P. Venetis, H. Garcia-Molina, K. Huang and N. Polyzotis, WWW’12

So Who Won? Dynamic Max Discovery with the Crowd, S. Guo, Aditya Parameswaran, H Garcia-Molina, SIGMOD 2012

Pushing the Boundaries of Crowd-enabled Databases with Query-driven Schema Expansion. J. Selke, C. Lofi, W. Balke, VLDB’12

CDAS: A Crowdsourcing Data Analytics System. X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, M. Zhang, VLDB’12

An Online Cost Sensitive Decision-Making Method in Crowdsourcing Systems  Jinyang Gao,  Xuan Liu,  Beng Chin Ooi,  Haixun Wang, Gang Chen, SIGMOD’13

  • Others

Big data integration, Xin Luna Dong, Divesh Srivastava. ICDE’13, VLDB’13 Tutorial.

Diverse Set Selection Over Dynamic Data,  Marina Drosou, Evaggelia Pitoura, TKDE’13

DisC Diversity: Result Diversification based on Dissimilarity and Coverage,  Marina Drosou, Evaggelia Pitoura, VLDB’13

Dense Subgraph Maintenance under Streaming Edge Weight Updates for Real-time Story Identification , Albert Angel, Nick Koudas, Nikos Sarkas, Divesh Srivastava, VLDB’12.

Streaming first story detection with application to twitter Saša Petrović, Miles Osborne, Victor Lavrenko, HLT’10

On Maximum Coverage in the Streaming Model & Application to Multi-topic Blog-Watch Barna Saha, Lise Getoor, SDM’09.

‎Set Cover Algorithms for Very large Datasets, Graham Cormode, Howard Karloff, Anthony Writh, CIKM’10.

Scalable k-Means++. Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, Sergei Vassilvitskii. VLDB’12.

Sorting and Selection with Imprecise Comparisons, Miklós Ajtai, Vitaly FeldmanAvinatan Hassidim, Jelani Nelson. ICALP ’09, [ps][pdf][full_ps][full_pdf]

Sketching Techniques for Large-Scale NLP Amit Goyal, Jagadeesh Jagarlamudi, Hal Daume and Suresh Venkatasubramanian 6th Web as Corpus Workshop (in conjunction with NAACL-HLT 2010)

Streamed Learning: One-Pass SVMs Piyush Rai, Hal Daume III, and Suresh Venkatasubramanian, IJCAI-09

Streaming for large scale NLP: Language Modelling, Amit Goyal, Hal Daume and Suresh Venkatasubramanian, NAACL -HLT’09

6 thoughts on “Repository of Relevant Papers

  1. Here is link to some other papers on crowdsourcing that could be relevant:

    To Crowdsource Or Not to Crowdsource?

    Efficient crowdsourcing for multi-class labeling

    Analytic Methods for Optimizing Realtime Crowdsourcing

    Iterative Learning for Reliable Crowdsourcing Systems

    Learning From Crowds

    Click to access raykar_JMLR_2010_crowds.pdf

    Crowdsourced Enumeration Queries

    Click to access ICDE13_conf_full_690.pdf

    Incremental Relabeling for Active Learning with Noisy Crowdsourced Annotations

    Click to access socialcom2011-rahuls.pdf

    Budget-optimal Crowdsourcing using Low-rank Matrix Approximations

    Click to access paper_crowdsourcing_allerton.pdf

    A Collaborative Mechanism for Crowdsourcing Prediction Problems

    Truthful Incentives in Crowdsourcing Tasks using Regret Minimization Mechanisms

    Click to access singla13truthful.pdf

    Crowdresearch blog

  2. As a continuation to Mohit’s post:

    A few extra papers on crowdsourcing:

    *Counting with the crowd:

    Click to access p109-marcus.pdf

    *Pairwise Ranking Aggregation in a Crowdsourced Setting

    Click to access crowd_pairwise.pdf

    *Evaluating the Crowd with Confidence

    Click to access confidence.pdf

    *Cross-Task Crowdsourcing

    Click to access kdd2013-ctc%205June-1.pdf

    *Harnessing the Wisdom of the Crowds for Accurate Web Page Clipping

    Click to access kdd210-Lei.pdf

    *Learning from Crowds in the Presence of Schools of Thought

    Click to access kdd2012-tian.pdf

  3. Below are two blog posts that provide a nice overview of the major Google Big Data related papers from the last 10 years. They cover some of the original and well know contributions like MapReduce and BigTable as well as some of the newer and lesser known ideas like Megastore. There’s also links to the actual papers if you want more detail.

  4. Open Calais (mentioned in the first presentation of tonight’s class) is an entity extraction tool built by Thomson Rueters. You can read more about it here –

    It has an API that you would use in a real application but you can test it out using a web interface here ( if you want to see how it works without having to register for an API key.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s