






























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
This report provides a pedagogically oriented overview of the program to date. It gives context for the Berkeley Data Science curriculum in terms of student ...
Typology: Study notes
1 / 38
This page cannot be seen from the preview
Don't miss anything!
Cathryn Carson, History David Culler, Electrical Engineering and Computer Sciences Bob Jacobsen, Dean of Undergraduate Studies University of California, Berkeley November 2016
needs, both at the entry level and in follow-on formats. Faculty will offer three new advanced Data Science courses in Spring 2017 (Data 100 [CS/Stat C100], Stat 28, and Stat 140), as well as incorporating data science into a growing number of courses and programs of study. Teams of faculty are now forming to develop proposals for major and minor programs of study in Data Science, which can be expected to be available to students once they are approved by the Academic Senate. This report provides a pedagogically oriented overview of the program to date. It gives context for the Berkeley Data Science curriculum in terms of student needs, experiences, and course-taking patterns, so these can be integrated into the university’s ongoing planning. It addresses key lessons drawn from the first year of experience in the Data Science education program, notably: ● The pedagogical success of the entry-level offerings (Data 8 and connectors) ● The value of a broad-ranging, modular, and still integrated program ● The significant effort required make the program connect across the campus ● The extraordinary opportunity of designing for diversity and inclusion in student populations, interests, and support mechanisms. Taking this first-year start-up as establishing proof of concept for a significant part of the student body, the curriculum can be refined and extended to larger populations of students, its effects can be followed into their ensuing course-taking, and its integration with programs of study across campus can be moved forward, assuming adequate institutional structure and support are provided.
Berkeley faculty across many disciplines have collaboratively created a model for a comprehensive undergraduate data science curriculum. Starting from the blueprint in the January 2015 report of the Data Sciences Education Rapid Action Team (DSERAT), the curriculum is built around a modular core-and-connections structure that can serve as a platform on which many academic programs can build. The Data Science curriculum was launched at the entry level in 2015- 16 with an innovative introductory course and a suite of connector courses that relate to students’ areas of interest, now ranging from neuroscience to civil engineering to demography to ethics. The entry-level courses are designed to provide the base for later classes in a broad range of departments that will be able to leverage and extend what students have learned. The upper tiers of the program are now being developed and will provide additional depth and connect across the campus with major and integrated minor offerings. As previewed in the DSERAT report, the program engages with societal and ethical issues around data science not only in course content, but also throughout the program design, incorporating best practices around diversity, equity, and inclusion so that the curriculum is welcoming to students of many backgrounds and interests. The curriculum that is now being created aims at a comprehensively integrated program. It responds to the experience of faculty of the transformation of their own fields of research and teaching by the cross-cutting possibilities of data science, and to fast-growing student demand for courses in computing, inference, and hands-on work with real data, as reflected in very large numbers of students enrolling in preexisting courses covering parts of this material in separated fashion. The curriculum aims to integrate a full appreciation of the lifecycle of working with data with the computational and mathematical knowledge that underlies it. It follows a modular design that allows it both to leverage common teaching of exceptional quality and shared infrastructure in a highly cost-effective manner, and to create tailored offerings designed and “owned” by departments. In staying strongly coupled to student interests and diverse programs’ needs, it must operate flexibly and responsively even as it scales up fast. Through its start-up phase the data science curriculum has been very lightly staffed (~1.0 staff FTE through 2015-16 across several units, providing both technical infrastructure and programmatic support) and temporarily shepherded by the L&S Dean of Undergraduate Studies. In addition to the dean’s resources and a start-up allocation of programmatic, TAS, and capital renovation funding, it has drawn heavily on individual faculty investment, staff commitment, and provision of additional resources by multiple departments and support units. Part of the program has been driven ahead by strong
Berkeley’s data science education program starts at the introductory level , with a 4 - unit foundational course, Foundations of Data Science, familiarly Data 8 (CS C8 / Info C8 / Statistics C8) that teaches core computational and inferential concepts while enabling students to work constructively with real data. The course was developed in spring and summer 2015 by a coalition of faculty across multiple disciplines and taught in two offerings in 2015-16 (a pilot in fall 2015, followed by the first regular offering in spring 2016). It has so far been collaboratively taught by Distinguished Teaching Award recipient Professor Ani Adhikari (Statistics) and Professor John DeNero (EECS), recipient of the Diane McEntyre Award for Excellence in Teaching. Other faculty have expressed interest in teaching it in the future. The Foundations course is built on three interrelated perspectives : inferential thinking, computational thinking, and critical engagement with questions of real-world relevance. As Prof. Adhikari describes the intent of the course, “All students should have access to a course that develops data literacy, so that they can use modern data analysis as an approach to any problem or investigation that they encounter in any discipline.” At the same time, Data 8 students develop a strong conceptual understanding of the mathematical structures underlying statistical thinking and learn ways of thought to work effectively with a modern programming language (Python) with modern data analysis frameworks, starting on the first day of class and continuing through each homework, project, and lab. In addition to teaching critical computing concepts, programming skills, and statistical inference, the course is based on hands-on analysis of a variety of real-world datasets, including economic and spatial data, and it delves into social and legal issues surrounding data analysis, including issues of privacy and data ownership. From a data analysis perspective , students understand: ● Visualization for understanding and communication (graphs, histograms, bars, scatters, maps) ● Distributions and random sampling (with and without replacement) ● Properties of several statistics (median, mean, max, total variation distance) ● Testing statistical hypotheses ● Estimation, prediction, and assessing predictions and models ● Regression and correlation ● Clustering and classification
● Comparison, causality, and decisions In the process students gain a solid understanding of classical statistical concepts : ● Probability theory, e.g., complements and multiplication rule, birthday surprise, permutations ● Distributions of data (categorical and numerical) and of probabilities ● Empirical distributions ● Law of averages, Central Limit Theorem ● Sampling variability, standard errors of estimates ● p - values and error probabilities in tests of hypotheses ● Bootstrap, permutation tests, null hypothesis ● Bayes’ Rule and the probability of false positives Much of this they discover computationally and then codify symbolically. In that process, they master computational concepts : ● Data types and data structures (tuples, lists, arrays, tables) ● Representation, operators, interpretation ● Sequencing, conditionals, iteration, comprehensions ● Use and definition of functional abstractions ● Data parallel programming techniques ● Higher-order functions ● Database operations (select, filter, join) ● Repetition, convergences, searching, sorting ● Testing, debugging, exceptions ● Objects and modules Data 8 pedagogy is centered on a single powerful data structure that is as natural to use as a spreadsheet, but allowing students to grow from simple manipulation and visualization through to sophisticated statistical techniques used widely today in industry and research. Rather than a traditional exposure to programming that is focused on learning syntax, dealing with files and tools, and working idealized problems, students learn how to construct sound analysis processes in a computational document, starting from acquiring data and proceeding through a series of steps in a modern programming language, to arrive at a meaningful observation. The class syllabus is available at data8.org; the online textbook authored by the instructors is available at inferentialthinking.com. In Data 8 there is a natural back-and-forth of the mathematical concepts, the computational processes, and the experience in applying them to data. For example, students sample from the null hypothesis repeatedly through computational simulations to form and assess distributions, rather than relying on asymptotic properties. (All of the statistical operators are available as primitives in the computing environment, along with numerical operators, database operators, and visualization methods.)
Enrollments (census data, Cal Answers) Semester Course offering Enrollment Fall 2015 CS 94 / Stat 94 pilot 109 Spring 2016 CS C8 / Info C8 / Stat C8 447 Fall 2016 (mid-sem) CS C8 / Info C8 / Stat C8 509 After the Fall 2015 pilot, Data 8 expanded to the largest available classroom in its first regular offering in Spring 2016. In Fall 2016, the Foundations course reached its short-
term capacity limit (~500 seats/semester). It will be offered at the same scale in Spring
Data 8 provides a new entry point into lower-division statistics , alongside Stat 2, Stat 20, Stat 21, and courses in other departments. The Statistics Department has strongly recommended Data 8 to all majors that have relied on Stat 2, noting that for many students Data 8 is a better option. The department has likewise encouraged Data 8 plus Stat 88 as an alternative to Stat 20 for majors that require statistics based on calculus. For orientation, combined enrollment in Stat 2, Stat 20, and Stat 21 has been largely stable in recent years at roughly 2,000 students annually. (Significant numbers of Berkeley students choose to take introductory statistics at community college; these students are not included in campus enrollment counts.) Data 8 at its first-year level is roughly half the scale of Berkeley’s other introductory statistics offerings combined. Data 8 likewise provides a new, large-capacity point of entry for introductory computing , alongside CS 10 (Beauty and Joy of Computing), CS 61A (Structure and Interpretation of Computer Programs, the prerequisite for more advanced CS classes and required for CS and EECS majors), as well as courses in other programs, notably E7 (Introduction to Computer Programming for Scientists and Engineers). Among introductory CS offerings, Data 8 is most directly targeted at providing key capacities for
for students who come in with more background. Majors ● Students come from a broad range of undergraduate units , including ○ College of Letters & Science: Divisions of Arts & Humanities, Biological Sciences, Mathematical & Physical Sciences, Social Sciences, Undergraduate Studies (all 5 divisions) ○ College of Engineering, College of Natural Resources, College of Environmental Design, College of Chemistry ○ Haas School of Business, School of Public Health, School of Social Welfare ● 56 majors and intended majors are represented in fall 2016. ● The largest majors represented have existing computing or statistics requirements (computer science, economics, psychology, statistics, business administration, and cognitive science). Other majors among the top 20 include public health, molecular and cell biology, environmental economics and policy, political science, mathematics, and media studies. ● More than 80 combinations of majors (double and triple majors) are included, pointing to the multidisciplinary interests of Data 8 students. The class serves a broad population and does not track students into particular areas of study. Instead it provides a common foundation on which other programs can build, with customization provided by departmentally-designed connectors. In addition to students in technical majors, Prof. DeNero observes, “We’ve had strong engagement from students in literature, history, ethnic studies, areas that aren’t traditionally seen as related to Data Science—since today, studying any of these fields will also involve computing with data.” Prof. Adhikari adds, “Students from non-CS and Stats majors have skills that are very important—they ask different questions of the data. What I learned is that our students had an ability to generalize, and they were able to ask the broader questions in a way that they don’t in a regular introductory stats class.” Student learning is significant. Students’ answers to conceptual and analytical questions have impressed instructors and observers in class Q&A, in lab settings, and on exams. Early reports from instructors of subsequent courses (e.g., Stat 134, Concepts of Probability) suggest that Data 8 students can perform at a high level in classes requiring mastery of statistical knowledge. Integration with other programs of study will take additional thought and attention as Data 8 scales. Statistics is being taught in a new way in Data 8, and instructors of follow-on classes in other programs of study should be engaged around their expectations of student preparation and learning. In some cases integration is simple, as in the Department of Economics, which has determined that Data 8 plus Stat 88 meets its needs. In other cases, as more students from different majors take Data 8, more
pedagogical dovetailing will be required, as in programs of study that draw heavily on familiar statistical methodology in their “methods” courses (for instance, in the social, behavioral, and environmental sciences; Psychology and Public Health are two current examples). In addition to discussion among instructors, mechanisms such as connectors and “translation” processes may be helpful. It will also be important to work through modes of integration and sequencing with those programs of study that draw on computing in their requirements, as for statistical computing, simulation, modeling, etc. Support systems : Students are provided with a strong support network beyond the core staff of faculty, GSIs, and Undergraduate GSIs, including access to lab assistants, supplemental office hours, tutoring by members of student groups, and, as of Fall 2016, a new Data Scholars program for students from underrepresented groups (see below). Student community : The peer community around Data 8 extends into supporting the next offerings of the course. The passage from Data 8 student to tutor to lab assistant to UGSI is strongly mentored by course instructors and is modeled on the pipeline approach used in EECS to scale large CS classes (mostly 8-hour-a-week appointments to fit with students’ demanding programs). There is substantial engagement by previous students in developing the Data 8 technical infrastructure. Students have made a video about their experience: https://www.youtube.com/watch?v=D5W7Zu15WjA Student interest in Data 8 seems to be as much a viral phenomenon (word of mouth and social media) as the outcome of official circulation through formal university channels. Student survey responses (Spring 2016) ● 85% of students said they were happy or very happy about their decision to take the course (4 or 5 on a 5-point scale, instructor survey) ● 77% of students said they learned a lot in terms of skills and ideas by taking the course (4 or 5 on a 5-point scale, instructor survey) ● 84% of students formally enrolled in the CS offering of the class, when asked how worthwhile this course was compared to others they’d taken, rated the course either a 6 or 7 on a 7-point scale (Eta Kappa Nu student survey) Some student reactions ● “One of the things I most enjoy about data science is the diversity—my classmates range from English majors to bio majors to computer science majors —all looking at data from our different perspectives.” ● “This class puts theory into practice. I was able to use data to tell powerful visual stories about the struggles I experienced growing up in southeast LA.” ● “Out of all the classes I’ve taken, this class gave me the most practical knowledge. I’m applying it in my internship at Google already.”
model in Spring 2015. The “combined package,” in the words of another group of peers, “will soon become a model for the rest of the world.”^2
Roughly 50% of Data 8 students have chosen to take a connector so far. An overview of enrollments is given in the table below; more details are given in an appendix. Semester Connectors Enrollment
Fall 2015 6 6 59 54% Spring 2016 11 6 217 49% Fall 2016 (mid-sem) 10 5 277 54% Course construction ● Connectors are mostly numbered 88, though this is not a required designation. ● Multiple departments have had connector courses approved by COCI. An “incubator” function is provided by L&S 88 for first-time offerings. ● Connectors (e.g., Statistics 88) can be made part of a set of options or a required sequence in one or more programs of study. ● Connectors can offer a more focused or smaller learning setting for students. In the initial stage, some pilot connectors have been small as the student pool grows and instructors gain experience. It should be anticipated that some connectors will remain small to medium size, while others will need to become quite large. ● Connectors can have prerequisites (e.g., calculus) as appropriate. ● So far, connectors have been offered as 2-unit courses. Observation suggests that connectors of 3 (or 4) units may also be important (see below). Connector offerings ● Connectors can be coordinated with the Data 8 syllabus in a variety of ways, as makes sense for different fields of study. ● Connector instructors typically draw on assistance from previous Data 8 students in designing exercises and supporting lab instruction. ● Connector instructors have been drawn so far from ladder faculty, visitors, lecturers, and postdoctoral fellows. ● Connector instructors have access to the standardized computing environment, lab space, and student support available to the Foundations class. ● The program offers several modes of support to pilot a connector. Some seed funding is available to stand up a course before it is regularized as part of regular departmental
curriculum planning and TAS budgeting.
Many students are excited about connectors , some of them exceptionally so, making strong statements that they find it the most inspiring part of the program.^3 Some students come back for multiple classes in a broad range of subjects. They see applications and connections as an essential element in a data science curriculum and connector classes as a way to explore new areas. Combined with the high level of enrollment, qualitative feedback has been a significant confirmation as the program has gotten off the ground. There are open questions about how to appropriately shape connectors a) for this student audience, b) in connection with the Foundations course. The present cohort of connector instructors have been working their way through pedagogical questions about course content and approach, including: ● creating course goals and exercises appropriate for entry-level students ● providing assistance with programming challenges ● deciding how to align with material covered in the Foundations course ● managing domain-specific customization of Data 8’s pedagogical approach Unit value of connectors is a key open question. While there continues to be a lot of enthusiasm for 2-unit connectors, some student and faculty input suggests that they can be over-packed with content and take preparation time out of proportion to their unit value. Some departments also report that 2-unit courses fit awkwardly into faculty teaching expectations or do not integrate with breadth requirements. For some programs of study, it will be valuable to try out 3-unit (or possibly 4-unit) connectors (in the form of new classes or redesign of current classes) and see how they can be integrated into student pathways. Faculty engagement is critical. Because the Foundations course is a novel way to approach teaching data analysis, many connector instructors look for additional preparation to get up to speed. Some sit in on Data 8 or take up self-study of Data 8 materials. As described in the appendix, the faculty short course on data science pedagogy and practice, which offered a 30-hour program of instruction and lab work in early summer 2016, was broadly welcomed and highly valued by participants. Coordination of connectors takes significant work. Even with standardization on a common platform, support for connector offerings takes a lot of coordination. Compared to the rest of the entry-level data science curriculum, it is considerably more staff- (^3) Because classes are sponsored by multiple departments, it is not simple to collect and standardize course evaluations. The assessments given here come from qualitative interviews and surveys done by the program and most connector instructors in Spring 2016.
The Jupyter notebook environment is critical to the entry-level data science curriculum. Generous assistance from the Berkeley-based Jupyter team (partially housed in BIDS) has been essential to the program. In moving quickly to scale, the JupyterHub infrastructure that supports the curriculum has attracted significant interest outside of Berkeley and has immediately pressed up against the limits of the technology now available. The Tables abstraction (http://data8.org/datascience/tables.html) developed specifically for Data 8 provides a simple, pedagogically centered “dataframe” abstraction that allows students to transition easily from spreadsheets to a full programming environment. It integrates fully with the Jupyter environment and provides a stepping stone to complex dataframe environments, such as R and Pandas. The cloud infrastructure that supports the program is being actively developed by a cooperative team of students, staff, faculty, and volunteers across multiple units, in the open source community, and among industry collaborators. At this pilot stage, industry contributions equivalent to several hundred thousand dollars have been essential to allowing the program to grow. Approaches used in the data science instructional labs may be adapted for computation-intensive teaching in other areas of campus. The program’s cloud-based infrastructure allows instructional labs to be built without desktop computers and associated overhead. To provide computing capacity for students without laptops, a semester-long laptop loan program has been piloted with the Library with generous donations from supporters. Inexpensive renovations in three instructional lab spaces in Summer 2016 have provided full-time space for Data 8 and connector labs and office hours at the current level of enrollment. Layout is 30 seats each, clustered into groups around shared table space. In addition to dedicated instructional labs, students have created spaces for collaborative work by making extensive use of overflow space in BIDS, the Library Data Lab, and D-Lab. If other space can be provided for office hours, each instructional lab can support 375 students enrolled in Data 8 and a proportional number in connectors spaced over the week. This space allocation will need to be increased no later than 2017 - 18 for the curriculum to grow.
4.2. Short course for faculty on Data Science pedagogy Faculty have asked for support learning Data 8 material and incorporating it into their teaching and practice. Beyond encouraging them to audit Data 8 (as several have each semester), the Data Science program has experimented with several mechanisms to satisfy faculty requests. In June 2016, 35 faculty and instructors from a broad range of disciplines devoted a week of summer to a 30-hour course on Pedagogy and Practice of Data Science. (An additional 35 registered but could not be served at this time due to capacity and scheduling constraints.) Participants included faculty from a range of departments across campus, including American Cultures, Demography, Economics, Haas School of Business, History, Linguistics, Math, Near Eastern Studies, Neuroscience, Optometry, Physics, Political Science, Rhetoric, and Sociology. The faculty short course , co-taught by Foundations of Data Science instructors Adhikari and DeNero, covered the key teaching methodologies for Berkeley’s data science education program and its new way of thinking statistically, and gave participants hands-on experience programming in Python using the Jupyter notebook environment. The course also offered panel discussions with connector faculty and a group of Data 8 and connector students. Participant satisfaction in a follow-up survey was high. The material in the short course will be offered to faculty again as soon as planning and teaching capacity allows. Selected quotes from survey respondents: ● “I loved the integration of the statistical concepts and the computing; seeing really is believing, and I think this will motivate students to study statistics in a more rigorous way as well (especially those who are more reluctant to engage the math).” ● “I definitely want to work on developing a ‘module’ for my Ethnic Studies courses to both animate the existing course material from another dimension, and also to bring into our students' lives a taste of other methodological techniques which can complement our field of study.” ● “Seeing how you guys ‘un-AP’ the students was great. I found the way in which the early lectures simultaneously motivated fundamental basic stats concepts and learning to write code illuminating. As a relatively new member of the Berkeley community, I found the interactions with instructors from all over campus incredibly beneficial.” ● “I learned a lot both from the class itself and the other faculty in the room. We as a campus should do more of this. How many other initiatives are out there that we could all build from?”