Spring 2017


Team Lead    Jerry Lin

Ideal Skills/Qualifications    Python, R, web development experience, data visualization experience

Average time commitment    4 hrs/wk

This semester, Data Science Society at Berkeley will be partnering with FairVote, "a non-partisan, 501(c)(3)h non-profit organization that seeks to make democracy fair, functional, and more representative." Many of our tasks will depend on our collective skill set, but, broadly speaking, we aim to find ways to educate the public about reforms that make our democracy mathematically sane in a way that is accessible, transparent, and engaging. Possible projects include an interactive website comparing instant runoff voting, the leading alternative to first past the post, to the current system as well as other proposed alternatives like score voting, approval voting, and quadratic voting. If you think our democracy is broken and you want to use your data science skills for social good this semester, this is the project for you.

Lending Behavior on Kiva

Team Lead    Divyansh Agarwal

Ideal Skills/Qualifications    Machine Learning, Time Series Modeling, Natural Language Processing, Image Processing, Social Network Analysis (Instead of having experience in all these areas, a single individual is expected to have deep experience in ONE or TWO of these areas only)

Average time commitment    5-6 hrs/wk

Identifying factors that influence popularity of a loan on Kiva's peer-to-peer microlending platform. The analysis will look into various types of data. We will investigate influence of factors such as features of images of borrowers, lender motivations, written description of uses of a loan by borrowers, characteristics of borrowers, time, etc. on the popularity of loans. We will also study the behavior of lending teams using social network analysis techniques.

How Zebrafish Larvae Decide to Capture Prey

Team Lead    Irene Grossrubatscher

Ideal Skills/Qualifications    Matlab, Microscopy and neural science interest, Image processing experience

Average time commitment    5-10 hrs/wk

Prey capture in zebrafish is an innate behavior where sensory information (mostly visual) is transformed into a motor output. Every capture is a stereotyped sequence of movements and starts with an eye convergence (increase of depth perception by increasing the angle between eyes. The zebrafish model allows for whole-field imaging brain activity through the transparent head of the larva and to manipulate brain activity through optogenetics, and is therefore a good model to explore the circuit behind the decision to capture prey. Tasks for the project will be:

(1) Developing software to automatically and rapidly detect eye convergences from video files
(2) Develop software for a prey counting assay to assess number of successful captures (the zebrafish's prey is a unicellular organism called paramecium).
(3) Develop or implement existing software to classify stages of prey capture from video files.

You may watch a couple of sample videos here

Mental Health Crises

Team Lead    Orianna DeMasi

Ideal Skills/Qualifications    Python: Jupyter notebooks, matplotlib/seaborn, Pandas, Intro-level SQL, Patience, Creativity, Communication skills

Average time commitment    10 hrs/wk

The student leader is working with the Mental Health Crisis Team in the city of Berkeley to answer some questions around mental health crises, i.e. those that require emergency response, in Berkeley. Eventually, we would like to move to broader questions, e.g., looking for patterns across cities of the Bay Area. Preliminary results showed that there are patterns of when mental health crisis occur and there may be patterns of crises with socioeconomic status. While these results are unsurprising, they are useful to the team for better allocating resources and can lead to policy changes.

This project is exciting because it has the potential for real-world relatively “immediate” impact. However, it is challenging in the amount of creativity it requires to ask interesting questions with seemingly unrelated, small datasets. This project is also centered around real datasets, which are obviously messy. A good undergraduate match will be interested in learning new tools and working with municipal employees to improve mental health crisis response. The majority of work is tracking down datasets, visualizing those datasets, and communicating findings.

DSEP X DSSB | Data Science Course Mapping Website Development

Team Lead    Subhiksha Mani

Ideal Skills/Qualifications    JavaScript (and preferably some visualization framework), Web Development experience, User Interface Design/Development

Average time commitment    10 hrs/wk

With the first upper division Data Science course, CS/STAT C100, being offered, the formation of new Division of Data Science have moved on to a further step. A course map, consisting of all prerequisites and requirements, that students could follow to perceive a path exploring the power and amazingness of Data Science has been asked by more and more students. Therefore, we are collaborating with the Course Mapping Team under BIDS URAP program to develop such a website offering students course/academic path guides.

We are looking for 2-3 students interested in contributing to the development who mets the requirement above.

Fall 2016 (click to expand/close)

Google Chatbot Metrics

Team Lead    Jerry Chen

A chatbot is a computer program designed to simulate conversation. Over time, chatbots have become more sophisticated. Call the earliest version of a chatbot Version 0 (V0). Call the next version of that chatbot V1. And so on. The difference between Vn and Vn+1 could be various things, including, but not limited to, additional features, bug fixes, personality, and knowledge. The goal of this project is to construct a function that maps Vn and Vn+1 to a number that quantifies their differences. Was there a minor bug fix or did the new version add tons of new capabilities? Such a function will contribute to metrics at Google.

Extracting Facts From Wikipedia Articles Using Deep Learning (Research)

Team Lead    Jamie Murdoch

'WikiReading' is a dataset introduced by Google last month designed to extract facts from Wikipedia articles. Each of the 18.58 million entries contains a Wikipedia document, a property and a value. The task is to automatically read through the document while conditioning on the property, in order to extract the correct value. However, the skewed distribution of propoerties makes predicting the rare properties quite hard. However, we would expect that, for instance, the patterns associated with the "place of birth" and "date of birth" properties would be similar. And an idea to tackle this problem is to use deep learning architectures that could improve fact extraction accuracy by allowing these relationships to be reflected in the learned model.

Jamie Murdoch is a second year PhD student working with Bin Yu on using deep learning for extracting facts from documents. He has ongoing collaborations with the AI research team at Facebook, and Joan Bruna at NYU, and has worked with Facebook's data science research team in the past.

Data Mining for Applications in Mental Health (Research)

Team Lead    TBD

Closed - Continuing from last semester

Data Science 100 (DS100) Course Development

Team Lead    TBD

Next spring, DS 100 will be a new upper division course directly following DATA 8 and its associated connector courses. It is a transitional course between lower DS/CS and upper CS/Stats with the goal of giving students real, hands-on projects that reflects real world challenges. The prerequisites to this course are: DS8/CS61A and MATH54. It is recommended to have also taken at least one connector course.

The DS 100 organizers are looking for students interested in contributing to the development of this course and who: 1) Have taken 186, 188 and/or 189 before, and earned A- or above 2) Are willing to commit a lot, and stay to become the TAs in the Spring 3) Have prior industrial experience (not required but great to have) 5) Will contribute to the design of projects, and design a few HWs 6) Be available to meet once a week.

Data-X Course Development

Team Lead    Ikhlaq Sidhu

This fall we will be developing a new course called Data–X. It is a project course at the intersection of CS Tools, Math Models, and Real Life Problems for undergraduate and masters level students. We believe the course can potentially bring some difficult math concepts to life and also make them understandable. The course is also complementary with the SCET Innovation Collider.

If you have background in CS tools and would like to help in the development of this course, in addition to submitting a brief statement of interest, please send your resume to dss.berkeley@gmail.com.

Requirements: Python and ideally various interfaces including SQL, Numpy/Pandas, Spark pipelines, and/or any of the popular ML frameworks including Tensor Flow. No one single person needs experience with all of these tools. We are recruiting for 2-3 students to work on the project over fall. Work study eligibility is preferred.

Ikhlaq Sidhu is the Chief Scientist and Founding Director of UC Berkeley’s Sutardja Center for Entrepreneurship & Technology. He received the IEOR Emerging Area Professor Award from his department at Berkeley. Prof. Sidhu founded the Fung Institute for Engineering Leadership. He serves as the faculty director of the Engineering Leadership Professional Program (ELPP).

Data Science 8 (DS8) Homework Party Tutor

Team Lead    Jerry Chen

We are continuing with the DS 8 HW Party this Fall semester. As a HW Party tutor you will be working with students enrolled in DS 8 and guiding them throughout their projects and assignments. Tutors are required to have taken DS 8.

Spring 2016 (click to expand/close)

Analyzing User Bidding Behaviour on Yahoo Dataset

Team Lead    Joao Carreira

Yahoo has recently released a big and diverse dataset that aggregates information from many Yahoo products and we will be analysing the bids graph dataset to understand the following questions:

1. What is the seasonality/periodicity of ads bids?

2. How efficient is the bidding? Can participants bid on similar queries to "cover" a "space" of similar queries? How much of a ""query space"" does a participant cover?

OpenRide Data Modeling

Team Lead     Owen Scott, OpenRide

OpenRide is an app and web-based ride share platform that connects drivers with available seats and passengers seeking to travel. It targets at long distance travel. We will be looking using the anonymized user data to:

1. Develop a model for measuring proximity between each pair of ride post and ride request.

2. Effectively sync supply and demand to improve user retention.

Data Mining in the Energy Sector

Team Lead     Roel Dobbe PhD Researcher, Hybrid Systems Lab

The project involves using data mining methods to understand optimal inverter control settings for minimizing the impact distributed energy resources (such as PV panels, electric vehicles and battery storage) on the grid.

We aim to develop deep learning algorithms to learn these settings in an online setting.

Project in AMPLab

Team Lead     AMPLab Researcher

We are currently unable to disclose project information to the public.You will learn the details after you are selected for the project.

Basic Requirements: Fluency in Spark and Scala.

Kaggle Competition

Team Lead     Pending

In collaboration with undergraduate statistics association, we form teams to participate in kaggle competitions.

We offer external guidance from top-ranking kaggle player.

Bay Area-Oriented Data Analysis

Team Lead     Vincent Zuo & Rose Zhao

Interested in learning more about San Francisco and enhance your data science skills to a higher level? This project is a perfect match for you!

We'll explore SF with the data on its economy, culture, environment, city management and etc. You may choose one specific topic that interests you the most and do a comprehensive analysis on it. Possible Topics inlude:

Economy and Community, City Management and Ethics, Transportation, Public safety, Health and Social Services,Geographic locations and boundaries, Energy and environment, Housing and Buildings, City Infrastructure, Culture and Recreation.

Data Mining for Applications in Mental Health

Team Lead     Orianna DeMasi

We will be exploring various datasets related to human brain activities, and looking for data driven solutions to improve mental health.