CSE6331 Student Presentations

Instead of a large project, a student can do a 20-minute in-class presentation related to a big data system (not covered in class) or related to a research topic, such as a specific Data Mining method for big data, a DFS indexing method, an optimization technique for big data queries, etc. The purpose of the talk is to give a first taste of what it is to do research on a particular research area. You would find this experience valuable if you are planning to do a PhD or get involved in research in some way.

Students who wish to do a presentation must send email to the GTA, Soumyava Das, at soumyava.das@mavs.uta.edu before the midnight of Thursday October 13 with at least 3 paper titles, ranked from 1 to 3 (where 1 means your top choice). Please use "cse6331 paper selection" in the subject line of your email. Your selected papers can be from the list below or can be a paper not listed here. If the paper is not listed below, it must be related to Big Data on a topic that has not been covered in class and must be at least 16 pages single-column or 10 pages double-column. If the paper is not listed below, you should also include a full citation (author names, journal/conference, year) AND a link to the PDF of the paper (or you may attach the PDF to your email). If you fail to email the GTA your paper choices by the posted deadline and you have not selected the final large project option, then automatically you will be assigned to present a random paper from the list below. Otherwise, the GTA will assign you a paper based on your preferences. Each student, of course, will be assigned a different paper. Note that, before your talk, you need to upload your slides (PowerPoint or PDF) using the form at the bottom.

Research Papers

Here are some papers to choose from (in random order):

  1. Design Patterns for Efficient Graph Algorithms in MapReduce
  2. A Comparison of Approaches to Large-Scale Data Analysis
  3. Shark: fast data analysis using coarse-grained distributed memory
  4. Lustre: Building a File System for 1,000-node Clusters
  5. Distributed Approximate Spectral Clustering for Large-Scale Datasets
  6. Large-scale machine learning at Twitter
  7. Sparrow: distributed, low latency scheduling
  8. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
  9. ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models
  10. Large-scale Incremental Processing Using Distributed Transactions and Notifications
  11. Apache Hadoop YARN: yet another resource negotiator
  12. The Google File System
  13. The Hadoop Distributed File System
  14. Cassandra: a decentralized structured storage system
  15. Adapting Microsoft SQL Server for Cloud Computing
  16. Extreme scale with full SQL language support in Microsoft SQL Azure
  17. Megastore: Providing Scalable, Highly Available Storage for Interactive Services
  18. Spanner: Google's Globally-Distributed Database
  19. Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
  20. Mesos a platform for fine-grained resource sharing in the data center
  21. The Chubby lock service for loosely-coupled distributed systems
  22. ZooKeeper: Wait-free coordination for Internet-scale systems
  23. Dremel: Interactive Analysis of Web-Scale Datasets
  24. Dryad: distributed data-parallel programs from sequential building blocks
  25. Druid: a real-time analytical data store
  26. Kafka: a Distributed Messaging System for Log Processing
  27. Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications
  28. Large-scale cluster management at Google with Borg
  29. Fast crash recovery in RAMCloud
  30. Gorilla: A Fast, Scalable, In-Memory Time Series Database
  31. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
  32. GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
  33. Low Latency Analytics of Geo-distributed Data in the Wide Area
  34. Dynamo: Amazon's highly available key-value store
  35. HaLoop: Efficient Iterative Data Processing on Large Clusters
  36. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing using a High-Level Language
  37. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud
  38. The Stratosphere (Flink) platform for big data analytics
  39. SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures
  40. MLbase: A Distributed Machine Learning System
You may also find papers at 100 open source Big Data architecture papers for data professionals, by Anil Madan.

There are many places to find the latest and best papers related to your topic. The UTA Library has accounts/passwords for many digital libraries and you can now download most publications for free (from the uta.edu domain only):

The ACM Digital Library: All ACM publications in PDF.
The IEEE Xplore Digital Library: All IEEE publications in PDF.
Google Scholar
Many papers online.
Many papers online.

Research Talk

The GTA will schedule the talks during the last 3-4 weeks of the semester. Each class lecture will have 3 student talks. Students will present their assigned research paper in class on the date scheduled. Each student must prepare PowerPoint or PDF slides (about 15-20 slides) for the talk. The duration of each presentation is 20 minutes. It is not permitted to copy any text from the paper or copy slides from the author's web page or from any web site. It is OK to copy figures (not formulas) from the paper as long as you give a reference. The presentations will be graded based on the following criteria:

For more information on how to give a research talk, read:

Submit your Slides

You need to submit your slides (PowerPoint or PDF) before your talk.

Submit your talk slides:

Last modified: 09/29/2016 by Leonidas Fegaras