CSE6331 Student Presentations
Instead of a large project, a student can do a 20-minute in-class presentation related to a big data system (not covered in class) or related to a research topic, such as a specific Data Mining method for big data, a DFS indexing method, an optimization technique for big data queries, etc.
The purpose of the talk is to give a first taste
of what it is to do research on a particular research area. You
would find this experience valuable if you are
planning to do a PhD or get involved in research in some way.
Students who wish to do a presentation must send email to the GTA, Soumyava Das, at email@example.com before the midnight of Thursday October 13
with at least 3 paper titles, ranked from 1 to 3 (where 1 means your top choice).
Please use "cse6331 paper selection" in the subject line of your email.
Your selected papers can be from the list below or can be a paper not listed here.
If the paper is not listed below, it must be related to Big Data on a topic that
has not been covered in class and must be at least 16 pages single-column or 10 pages
If the paper is not listed below, you should also
include a full citation (author names, journal/conference, year) AND a link to the
PDF of the paper (or you may attach the PDF to your email).
If you fail to email the GTA your paper choices by the posted deadline
and you have not selected the final large project option, then automatically
you will be assigned to present a random paper from the list below.
Otherwise, the GTA will assign you a paper based on your preferences.
Each student, of course, will be assigned a different paper.
Note that, before your talk, you need to upload your slides (PowerPoint or PDF) using the form at the bottom.
Here are some papers to choose from (in random order):
You may also find papers at
100 open source Big Data architecture papers for data professionals, by Anil Madan.
- Design Patterns for Efficient Graph Algorithms in MapReduce
- A Comparison of Approaches to Large-Scale Data Analysis
- Shark: fast data analysis using coarse-grained distributed memory
- Lustre: Building a File System for 1,000-node Clusters
- Distributed Approximate Spectral Clustering for Large-Scale Datasets
- Large-scale machine learning at Twitter
- Sparrow: distributed, low latency scheduling
- HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
- ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models
- Large-scale Incremental Processing Using Distributed Transactions and Notifications
- Apache Hadoop YARN: yet another resource negotiator
- The Google File System
- The Hadoop Distributed File System
- Cassandra: a decentralized structured storage system
- Adapting Microsoft SQL Server
for Cloud Computing
- Extreme scale with full SQL language support in Microsoft SQL Azure
- Megastore: Providing Scalable, Highly Available Storage for Interactive Services
- Spanner: Google's Globally-Distributed Database
- Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
- Mesos a platform for fine-grained resource sharing in the data center
- The Chubby lock service for loosely-coupled distributed systems
- ZooKeeper: Wait-free coordination for Internet-scale systems
- Dremel: Interactive Analysis of Web-Scale Datasets
- Dryad: distributed data-parallel programs from sequential building blocks
- Druid: a real-time analytical data store
- Kafka: a Distributed Messaging System for Log Processing
- Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications
- Large-scale cluster management at Google with Borg
- Fast crash recovery in RAMCloud
- Gorilla: A Fast, Scalable, In-Memory Time Series Database
- The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
- GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
- Low Latency Analytics of Geo-distributed Data in the Wide Area
- Dynamo: Amazon's highly available key-value store
- HaLoop: Efficient Iterative Data Processing on Large Clusters
- DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing using a High-Level Language
- Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud
- The Stratosphere (Flink) platform for big data analytics
- SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures
- MLbase: A Distributed Machine Learning System
There are many places to find the latest and best papers related to your topic.
The UTA Library has accounts/passwords for many digital libraries and you can now download
most publications for free (from the uta.edu domain only):
The ACM Digital Library: All ACM publications in PDF.
The IEEE Xplore Digital Library: All IEEE publications in PDF.
- Google Scholar
Many papers online.
Many papers online.
The GTA will schedule the talks during the last 3-4 weeks of the semester.
Each class lecture will have 3 student talks.
Students will present their assigned research paper in class on the date scheduled.
Each student must prepare PowerPoint or PDF slides (about 15-20 slides) for the talk.
The duration of each presentation is 20 minutes.
It is not permitted to copy any text from the paper or copy slides from the
author's web page or from any web site.
It is OK to copy figures (not formulas) from the paper as long as you give a reference.
The presentations will be graded based on the following criteria:
For more information on how to give a research talk, read:
- Quality of slides (25%):
Your slides should be well-prepared, neat,
and understandable. They should emphasize the main concepts of the paper and illustrate the
concepts with examples. The introduction should be short and to-the-point,
and the conclusion should summarize problems, what has been done, and what needs to be done.
- Clarity of presentation (20%): Try to make the presentation clear and to spend
most of the presentation time on your topic, explaining it as clearly as possible.
- Understanding of your research papers (20%): Try to understand your assigned paper so that you can
answer any questions that are asked during and after the presentation.
- Main points of the paper (20%): Try to clearly identify the main points of
(what they are trying to do; the approach taken to accomplish this; how this
approach differs from other approaches, etc). Try to avoid formulas and theorems.
- Time of presentation (15%): Practice your presentation so that it takes the
appropriate amount of time (20 minutes).
Submit your Slides
You need to submit your slides (PowerPoint or PDF) before your talk.
Last modified: 09/29/2016 by Leonidas Fegaras