VOYTEKlab

Lab management

2024-05-23T00:00:00+00:00

May 23, 2024 by Bradley Voytek

Several months ago there was an email thread going around among some neuroscience professors asking for advice on lab management software to “help us get more organized and productive.” Quite a few people reached out to me after I wrote my response to tell me how helpful it was. So I figured it would be nice to share it more broadly. Below is an edited version of my response to that email chain regarding my lab’s processes for using GitHub to manage open-source projects, scientific research, administrative tasks, and teaching.

In my lab we manage a lot of our work in GitHub. While traditionally thought of as a platform for hosting and maintaining computer code in a shareable manner, GitHub is agnostic to the kinds of files it will track. This means that you can use it for a lot more than just software development. Over the years we’ve become more reliant not only on its file hosting and tracking capabilities, but also its built-in project management features.

We have a main org – voytekresearch – and within that org there are many repositories, one for each project. This means that every ongoing scientific research project has its own repo so that we can collaborate on the data analyses. This is in separate from the independent orgs that we run for our open-source Python packages: specparam, bycycle, and neurodsp.

In addition to our lab’s project-centric repos, we have various “helper” repos, like a main repo that collates basic information for the lab, a resources repo that has some helpful links for lab processes, and a tutorials repo that has interactive notebooks showing how to use the lab’s data analysis tools.

We also have an onboarding repo that states exactly what everyone who joins the lab needs to do: how to get a key to get into the lab, how to join the lab’s Slack, links to tutorials and human subjects training, etc. This repo also includes basic instructions for writing papers and making posters, writing code, cloud computing, and so on.

That page also includes a link to the lab’s Google Docs paper template that I made to help folks overcome the “staring at a blank white screen” problem that can make writing a paper for peer-review feel overwhelming. This template is organized like a standard manuscript, broken into sections: Abstract, Introduction, Methods, Results, Discussion. These are then broken into subsections, each of which is labeled to describe what should be written there. This turns the task of writing a paper into a series of smaller, fill-in-the-blanks tasks, which are much more manageable.

Each project also has its own repo. For example this one from PhD student Sydney Smith, which allows anyone to step through all the code to run the analyses and create the figures for a project. Some journals, in particular eLife, even allow you to create an “executable code” version of a paper, like this one from former lab PhD student Richard Gao.

Within each repo, you can open Issues, which are a place to document things to look into / problems to fix. An Issues page looks like this:

Issues are helpful for keping track of things you want to change and improve, as well as for commenting back-and-forth about how to implement those changes, as well as for assigning specific tasks to specififc individuals.

You can also create “Projects” within any given repo, which is used for project management. I’m a fan of the Kanban board style, which looks like this:

Here, we have columns for “to-do”, “in progress”, and “done”. When someone starts tackling one of the “to-do” cards, they drag it to the “in progress” column. Cards can also be assigned to specific individuals, and they can be commented on. Once they’re in the “done” column, tney can be reviewed. The Kanban board approach requires a fair amount of planning up front in order to break a huge project down into individual, manageable components. But just like the paper template above, this act of planning ahead and breaking a huge task into smaller pieces makes everything much easier and less overwhelming.

To keep track of everything everyone is doing, we also have a private repo, which looks like this:

Here, each lab member maintains their own page that is broken up into sections for each project they’re working on, any grants / awards they’re applying to, etc. This includes links to individual project repos, links to google docs for paper drafts and notes, any posters they’ve presented on the research, stuff like that.

I also maintain one for myself, which is one of four tabs that are always open in my browser along with my emails and calendar. For my repo, I break it up into sections for notes for myself to keep track of student rotations and lab member vacations and internships, links for courses I teach, links to ongoing funding and prospective funding applications, links to google docs for grant and manuscript drafts, reminder notes for myself on who to contact for admin stuff, etc. It looks like this:

I’ve even gone so far, for the courses I teach, to have templates within each course for what emails to write to students and TAs at which points during the quarter. This ensures that everyone has all of the right materials and instructions given to them at the right times. This also makes it super easy for me to hand off my courses to new folks and to let those new teachers fold in their improvements so that, the next time I teach the class it’s easier for me to get reacquainted with it, and it’s slightly improved with every iteration.

This whole process works really, really well for our needs, and I hope it helps other folks out, too!

2024 Berkeley Cognitive Science Commencement

2024-05-12T00:00:00+00:00

May 13, 2024 by Bradley Voytek

This weekend I had the honor of giving the keynote address at the UC Berkeley Cognitive Science undergraduate commencement ceremony. There were approximately 150 students, along with their family and friends, in attendance at Zellerbach Hall. Fourteen years ago, in the same location, I was awarded my PhD in neuroscience. And, just as then, my PhD advisor, friend, and mentor, Bob Knight was on stage with me. It was a great homecoming.

I’ve given a lot of talks and taught a lot of classes over the years. I’m normally a semi-extemporaneous speaker, meaning I have a clear vision of what I want to say / teach, but that I come up with the exact wording on the fly.

This is the first time I’ve ever written out the script of what I was going to say. For those who are interested, below is the full text of my address (minus some improvising along the way).

Finally, I want to thank all of the students and family who came up to speak with me afterward! It was a genuine pleasure to meet so many people who are beginning their exciting careers. And a huge thank you to Bob Glushko for nominating me as speaker, David Whitney for doing such an amazing job as master of ceremonies, and Terry Regier for supporting the students.

Congratulations, class of 2024! Cognitive Science is different at Berkeley compared to UC San Diego: you don’t have a Cognitive Science department, but rather a distributed program across neuroscience, linguistics, data science, design, computer science, statistics, and so on. That poses unique challenges for you, I’m sure, because it’s such a broad and diverse program.

Of course the biggest challenge is explaining to your parents exactly what Cognitive Science is…

But there is strength in that intellectual diversity! My first real introduction to Cognitive Science was here at Berkeley when I was working on my PhD in neuroscience. I was teaching the MCB 163 neuroanatomy lab. I was initially in WAY over my head – MCB 163 was mostly full of premeds who thought it was funny when the non-biology guy made ridiculous mistakes about basic human physiology. Like the time I confidently proclaimed that humans have two livers… apparently that’s incorrect.

That class also had a lot of students from the CSSA - the Cognitive Science Student Association. As I got to know them better, they began inviting me to their meetings to talk about my research and to ask about general life and academic advice. Which was funny because: A) I wasn’t a Cognitive Scientist, and, B) I for SURE was not a good person to ask academic advice of.

As an undergrad, I began as a physics major… I wanted to study cosmology and astrophysics. But I don’t come from a college family. I was a “whoopsie” baby born to teenaged parents. My dad worked his butt off when I was a kid to make sure I could be where I needed to be to thrive. So the fact that his son wanted to be an astrophysicist was something he delighted in. He loves to tell a story about how, when we were on a road trip between San Diego and Phoenix where I bounced around as a kid, he asked me what I wanted to be. When I told him “astrophysicist” he laughed because he didn’t know what that was.

Even though I don’t come from an “academic” family, I was a great student in high school. So much so that I got accepted to the University of Southern California directly out of my junior year of high school to study astrophysics. I was elated. As part of my scholarship package — USC is an expensive private university outside my family’s reach! – I received work study as a research assistant in a lab studying Bose-Einstein condensates. Don’t worry about what that is… if you don’t know it’s okay. I didn’t either and, well YOU’RE not going to fail out of university for not knowing.

I DID fail out though because WOW was I terrible at physics. I mean really, truly, genuinely terrible. When I went to register for classes at the end of my second year, I was told I’d been on academic probation for too long and was no longer a student there!

But I wasn’t ready to give up. I still loved science, but I had to reassess myself. I basically forced my way back into university by talking to everyone I could, explaining my situation – life was complicated at the time, taking summer classes at community college to prove I could get my grades, and finding people who were willing to listen and help.

This was back in 2000 or so, and several of my closest friends were older than me and had graduated with degrees in computer science. Straight out of undergrad, they landed jobs that paid more than I could have ever imagined.

When I reenrolled, I knew I had to give up physics. I couldn’t switch to computer science, because I wouldn’t be able to graduate within four years. So I switched to Psychology, which I could still complete within the four years, thus minimizing my debt (because by this point I’d lost my scholarship). Psychology was a flexible major, so for my electives I took some programming classes to help fill out my resume and give myself some marketable skills. But, for fun I took courses like Philosophy of Mind, intro to AI programming (in LISP!), sensation and perception… it turns out I accidentally made a Cognitive Science curriculum for myself without even knowing what Cognitive Science was!

I also got another work study job, this time in a neuroscience lab. This lab had decades worth of data that had been collected. A lot of the data were surveys that were digitized in the form of text files. One of the first jobs I was given there was wild: I supposed to open up each file, copy the data out of the file, and paste them into a Excel spreadsheet so that the Postdoc could analyze the results. Based on how many files there were, they estimated it would take two weeks or so.

I HATE this kind of repetitive task and I’m always looking for ways to get the work done without having it be so boring. So I flexed my super crude C++ skills and wrote a quick regex script to pull the data into a CSV. I showed up the next day and was like “here you go”. And it was like I performed some kind of magic trick. They didn’t believe me. So I explained to them how “I wrote some code,” and they just didn’t know what that was.

Magic!

And thus, I became the “tech guy” in the lab, and began digitizing a lot of their processes. I used my knowledge of web development, which I learned from my girlfriend at the time (and wife now!), and coding to help with neuroscience experimental design and data analysis. I realized that this diverse skillset was a huge strength and that, although I wasn’t as good of a programmer as my computer science friends or as good of a researcher as my neuroscience friends, I was better programmer than the neuroscientists and a better neuroscientist than the programmers!

It’s this diverse and distributed nature of Cognitive Science that is its strength. And what an amazing time to be a Cognitive Scientist! The last few years have seen an explosion in deep learning, data science, and AI, which all have their roots in Cognitive Science (many of them at Berkeley and UC San Diego!).

And just like how my crude C++ regex skills seemed like magic to my boss back in 2001, the technologies that are coming out now, that have their roots in Cognitive Science, seem like magic to the rest of the world.

If I’m allowed to revel in my sci-fi and comic book roots for a minute, I’ll quote Arthur C Clarke’s third law: “Any sufficiently advanced technology is indistinguishable from magic.”

And I’ll expand on that and say to all of the parents and guardians here, I’m happy to give you the real explanation of what Cognitive Science is: your students just completed their degrees in magic.

Do we have a sorting hat or how does this work?

It’s silly I know, but I genuinely believe in this idea that we can create technologies that seem like magic. But for me, this metaphor isn’t restricted just to creating technologies. It also applies more broadly to the act of creating new ideas and, with these ideas, shaping our world and ourselves.

After I graduated, I knew I wanted to become a neuroscientist, but I also knew that my GPA was too awful for that to be a realistic shot. So I spent a lot of time talking with my girlfriend (at the time, my wife now) and my closest friends, trying to figure out how to turn my wishful, magical thinking into reality. The first part of the magical spell was to figure out a way to prove that my undergraduate GPA wasn’t an accurate reflection of who I am. Because of my programming skills, I was able to land a job as a full-time researcher running the Positron Emission Tomography scanner at UCLA working for Dr. Edythe London.

This job was awesome. For the most part. What wasn’t exactly made clear to me during the interview was that the radiotracer we used to study brain activity, F18 FDG, though very short lived, is cleared through the bladder. And because we want to minimize the effective radioactive dose to all the participants in our studies, we had to have them use the restroom immediately after the scanning was complete. Buuuut, because some people are… messier… than others, part of my job was to crawl around the bathroom floor with a Geiger counter looking for little radioactive pee pee spots to clean up.

Science!

But I made a name for myself at UCLA’s Brain Mapping Center. Because PET scanning works by injecting a radioactive substance, it requires a nuclear medicine physician to perform the injections. But something like 4-6 months into the job, I was waiting with a participant at the scanner with the rest of the staff, and our physician didn’t show. Finally I went back to my boss’s office, and apparently the guy left a message on her voicemail saying he moved back to New York, that he didn’t want to work there anymore.

So all of these scientists with huge amounts of grant funding were stuck and I was one of the only people who knew how to run the scanner. So I learned that it’s not just a nuclear medicine physician who could perform the injections, but that credentialed nuclear medicine technicians, non-physicians, could also do so. So I called all around LA to find techs who could do this. I proved that I could solve problems. That I could publish research and train others. And in 2004 I was accepted into Berkeley’s new neuroscience PhD program.

When I began my PhD I remember looking at the CVs of researchers whom I admired and seeing page after page after page of amazing awards and scientific achievements. It was daunting. During my very first year here, I even asked my PhD Advisor, and now friend and mentor, Bob Knight, how I got into Berkeley when no one else even gave me an interview (including my home department of UC San Diego Cogsci!). He said – and I’m editing his wonderful bluntness here a bit – “it was obvious you were a… screw up… but a screw up with potential.”

A few years into my PhD, during one of those career discussions with the CSSA students here, someone asked me something like “how is it you’ve been so successful in your research?”

I was like, ARE YOU KIDDING ME?!

But it hit me: hidden behind the plethora of successful outcomes on those CVs I admired were an even greater number of failures. The CSSA students here didn’t know that I was kicked out of undergrad, or that I was immediately placed on academic probation – again! – my first semester as a PhD student here. They didn’t know that Prof. Rich Ivry almost failed me in my qualifying exams. Because we keep all of those things hidden.

So for the last 15 years or so I’ve kept a section at the end of my CV titled, “Rejections and Failures”. I list every grant or fellowship I did not receive, how many times each paper I have published was initially rejected by journal editors.

This was perhaps my most powerful magic spell yet. This act of revealing my otherwise invisible failures has brought into my world some of the most amazing scientists who saw this and said, “I want to work with that guy.”

That’s magic! Taking what was once a real source of shame for me and turning it into a source of strength. It was like a summoning spell for kind, brilliant people who may not have had the most traditional scientific paths. Or, as Bob Knight might call us… “screw ups with potential”. This spell was so powerful that, years later I was even told in private by a very well known senior scientist: “part of the reason the committee decided to give you this award was because we didn’t want our name associated with the failures section of your CV!” I’m still not entirely sure if she was joking or not.

I’ve caught a ton of breaks to help me get to where I am today, and had a lot of excellent mentorship. A lot of it has felt like luck, but I’d be lying if I said it was JUST luck. Because a lot of it was a willful application of my probably naively optimistic view that we can both be good people AND good scientists. That we can push at the boundaries of technology until what we create seems magical, and that we can also summon amazing people to work alongside us to bring about cultural change in science and technology.

So yes “any sufficiently advanced technology is indistinguishable from magic.” I believe that. But I also believe that, “any sufficiently thoughtful application of optimism is indistinguishable from magic.” I have no doubt that your quirky, Berkeley education here in Cognitive Science has armed you with the diversity of skills to create the technologies of the future that will truly feel magical.

And all I can hope is that some of what I’ve said today has resonated with you, and that the stories I’ve told help focus your skills not just toward building amazing magical technologies, but also in building a more optimistic, magical version of yourselves.

Teaching Neural Data Science

2021-04-06T00:00:00+00:00

April 6, 2021 by Bradley Voytek

A few weeks ago marked the end of my first run of our new Neural Data Science course here at UC San Diego. This course was an experiment to see how far our undergraduate students could push at the boundaries of neuroscience, armed with a ton of publicly available data and new ways of thinking about how we can approach research.

In short: they absolutely exceeded my expectations—especially during the pandemic—and I’m pretty sure at least one peer-reviewed paper will come out of the research they did.

While I’m wildly excited to talk about those projects, I’m getting ahead of myself. I’ll post about the students’ projects over the next week, but first I want to begin with what I think Neural Data Science is?

For years now I’ve been arguing that Data Science is more than just (Computer Science) + (Statistics). I truly believe Data Science is something different. I’ve been slowly formalizing my thoughts on the matter, starting with two articles: The Virtuous Cycle of a Data Ecosystem (PLOS Computational Biology, 2016) and Social Media, Open Science, and Data Science Are Inextricably Linked (Neuron, 2017).

I summarize Neural Data Science with this:

We know a lot about the brain. For any given arbitrary brain region, for example, there are thousands of studies describing its architecture: the thickness of the cortex, what its inputs and outputs are, myelination, cellular density, etc. We know about the cell types there and their morphology; as well as how strongly the 20,000 or so different genes are expressed in that region; what the receptor densities are; what neuro-transmitters / -modulators are present; and so on. We also have information about the electrophysiology of the cells in that region, as well as the macroscale field potential properties including aperiodic activity, oscillation frequency and waveform shape, and so on. Finally, we also have decades of human and animal studies hinting at what cognitive functions are associated with that region, how different diseases and disorders affect that region, and how all of these things change with development and aging.

All of this information, however, is spread across tons of different datasets, papers, etc. Really amazing early data science projects, like Tal Yarkoni’s NeuroSynth, pioneered the mining these datasets to bring them together.

One of the goals of my group is to try and create methods and tools to allow people to more easily bring all of this information together. What if, instead of limiting ourselves to analyzing our one limited, biased set of data, we can also bolster our research with the huge amounts of other data that already exist?

This ethos can be seen in our recent paper, led by former PhD student Richard Gao, Neuronal timescales are functionally dynamic and shaped by cortical microarchitecture (eLife, 2020) where we combined theory and simulation with several open datasets, including:

Human MRI data
Several large dataset of human intracranial electrophysiology
Human gene expression
Non-human primate single-unit spiking

to show that we can infer population neuronal timescales from field potential / intracranial EEG data, and link the cortical topography of those timescales to brain structure and gene expression.

So when Ashley Juavinett and I started to brainstorm this course, we wanted to know if we could teach this kind of stuff to undergraduates. Now, certainly we’re not the only ones thinking about neural data science (Pascal Wallisch wrote a great book about his take!) Ours is simply a different perspective.

What if we could teach the next generation of neuroscientists to think from a “data first” perspective: what data would you need to answer your scientific questions of interest, without limitations on having to collect every piece of data yourself?

Here’s how I frame things in the course syllabus:

Neuroscience is a rapidly changing field that is increasingly moving towards ever larger and more diverse datasets that are analyzed using increasingly sophisticated computational and statistical methods. There is a strong need for neuroscientists who can think deeply about problems that incorporate information from a wide array of domains including psychology and behavior, cognitive science, genomics, pharmacology and chemistry, biophysics, statistics, and AI/ML. With its focus on combining many large, multidimensional, heterogeneous datasets to answer questions and solve problems, data science provides a framework for achieving this goal.

Determining what data one needs, and how to effectively combine datasets, is a creative process. For example, a neural data scientist might be tasked with combining: 1) demographic information and 2) multiple cognitive and behavioral measures, from people from whom we might collect; 3) biometric data, 4) motion capture data to understand motor control, and 5) eye-tracking to study attention, along with; 6) structural connectomic and 7) functional brain imaging data collected using methods with different spatial and temporal resolution (such as fMRI and EEG), and then place those results into context relative to; 8) average human brain gene expression patterns and 9) the existing knowledge embedded within the peer-reviewed neuroscience literature (>3,000,000 papers).

These types of data are very different: continuous and ordinal, time-series, video and images, directed graphs, spatial, high-dimensional categorical / nominal, and unstructured natural language. What is the appropriate way to aggregate and synthesize these data? What are the benefits and caveats for, say, aggregating spatially versus temporally? Being able to conceptualize how to carry out this integration is necessary before leveraging any technical skills will even be useful.

This focus in Data Science on creativity and integrating large, multidimensional, heterogeneous datasets is something Tom Donoghue, Shannon Ellis, and I really coalesced around in the Data Science in Practice course we created here at UC San Diego (a course that I’ve talked about on here before). We learned a lot from putting that course together, and wrote about that in our article, Teaching Creative and Practical Data Science at Scale (Journal of Statistics and Data Science Education, 2021)).

After the success of that course—which has 300-500 students every quarter now—it seemed like adapting that to neuroscience, specifically, made a lot of sense.

Before the quarter began, my Course Objectives were for the students to learn how to:

think from a “data first” perspective: what data would you need to answer your scientific questions of interest?
develop hypotheses specific to big data environments in neuroscience.
work with many different neuroscience data types that might include data on brain structure and connectivity, single-unit spiking, field potential, gene expression, and even text-mining of the peer-reviewed neuroscientific literature.
read and analyze data stored in standard formats (e.g., Neurodata Without Borders and Brain Imaging Data Structure).
integrate multiple heterogeneous datasets in scientifically meaningful ways.
choose statistical model(s) informed by the underlying data.
design a big data experiment and excavate data from multiple open data sources.
consider alternative hypotheses and assess for spurious correlations and results.

Certainly I didn’t meet all of these objectives. Adapting the course from my original vision to the pandemic-induced, online environment was imperfect, with one student calling the zoom-based project interactions “difficult” and “schizophrenic”. I was worried about how it all would “work”, but then there were comments like this:

I cannot speak highly enough about my experiences in this course and how well-prepared I feel for my continuing education at UCSD. This course’s material had significantly more personal connection and real-world application than other courses I have completed, and I am sure other students feel the same.

Seeing feedback like this, combined with the quality of the projects themselves, tells me that there’s something really cool and exciting here with Neural Data Science, and I can’t wait to teach it again and share more with you all.

In the meantime, you can look at the course materials on GitHub, here, and adapt them as you see fit. If you want to try and teach a version of this course, please let me know! I’d be happy to chat.

Git/Github Introduction

2021-03-04T00:00:00+00:00

March 4, 2021 by Bradley Voytek

Version control for software used to be a huge pain. Imagine co-writing a document with someone. We use Google docs now, which lets multiple people edit the document at once, and it saves a history of each edit. You can open up a browser and look at that history and if you don’t like an edit (or whole set of edits) you can “roll back” (git reset) to that previous version.

But it used to be that we’d email a Word doc to someone when we wanted their edits, and then we’d wait for them to email their edits back. And that sucks and means multiple people can’t easily work on the same docs at the same time.

So the modern solution is Git and GitHub.

Git is the software that is used for tracking edits to files. Every time a tracked file is edited and saved (git commit), this edit is marked and given a unique identifier. That’s how Git tracks edits: modified file(s) are selected (git add), and snapshots are created (git commit) with a unique identifier to track it.

GitHub is a website where all those edits can be hosted in the cloud, instead of just living on your computer. This makes it easier for other folks to edit them, too.

Let’s say you’re working on a group project, and that project has 10 different files. Those 10 files live in a repository (or “repo”) on GitHub. A repo is just a folder containing all the files for your project. It also contains all the history of your project, too.

Normally when you start a project you’d create a new empty repo, with no history or files, but because this project is already in progress you don’t want to start a new one. Instead you’ll fork and clone the existing project repo, which means you copy everything in that GitHub repo—all the files and their history—onto your computer. This version that you now have is called your local copy or working copy.

So now, with this kind of version control in place, whenever you make edits to a file that’s part of the repo that you just cloned, you can edit those files in your working copy. At the same time someone else from your team can also edit those same files, but on their working copy. No more waiting to send emails of files back and forth.

Let’s say the project has been going for a while, and there’s like, 230 revisions (commits) that have been made to the files over the course of its history. A revision is the state of the repository at a certain point in time. If someone breaks everything by uploading bad code, you can roll back (git reset) to a prior revision.

Okay, so you’re editing files in that project. Git notices you’ve changed your working copy files, but it’s only noticed that the files have changed, it hasn’t done anything with that information yet. In order for Git to save a snapshot of all the changes, so that you can roll back to that save state or alert your colleagues that you’ve changed them, you have to commit those edits.

Once you’ve committed, you can then choose to share those files with the world, and get those edits/commits off of just your computer and put them onto GitHub. In that case, you’ll push those commits (all the files you’ve tracked during edits) to GitHub, where you save your changes back to the repository.

Usually this occurs without any problems.

Sometimes, however, you’ll make changes to the same file that someone else also changed, committed, and pushed.

In the example case we’re using here, you were working off revision #230, but I was editing it too, and I pushed my code before you did so the official GitHub repo for our project is actually now on revision #231. And the official Github revision #231 looks different from the edits you made to #230… so your revision #231 is different from the official repo #231.

So you now have to pull the new revision #231 from the GitHub repo, which then tries to merge your edits with the #231 version on GitHub. That is, you combine two sets of changes to the files in your project. (Note that pull grabs the files and tries to merge them, in one step. Alternatively, you can fetch the files, which just downloads the new data, but doesn’t merge (integrate) those data with your working copy. So some people prefer to fetch-then-merge instead of pull.)

If you happened to edit the same exact lines of the same exact files as the version #231 I created, this creates a merge conflict, and someone has to step in and look at what those differences are and decide how to combine them together.

Once you do this, you can push this new revision which is now #232, because it combines the official #231 with your #231 to create a whole new version, which is a child of both #230 and #231. This idea of being a “child of” is part of what’s called a dependency graph, where you can only push code that is a direct descendant of the official repo. Because you had version #230, but I changed #230 to become #231 on the official repo, your edits were not a direct descendant of the official repo number (now #231), so they had to be folded in.

If you want folks on your team to look at your edits, make comments on them, review your code, etc. you can initiate a pull request on GitHub, which tells people “hey, pull this new revision, test it out, and let me know what you think.”

If you want to try out some side experimental thing—like adding some new feature—without breaking the official code base for everyone else, you can create a branch off of the repo at some point in its revision history. This lets you and your team mess around and try things out without breaking the official version. Once you all are happy with things, you can then merge that branch back onto the main branch to fold in all of that branch’s changes, history, etc. into the main repo.

Or, you might decide it was all terrible and no good, so you can change back to, or checkout, the main repo again, abandoning your changes.

Okay, that’s basically all of Git and GitHub that you’ll likely ever need to know. But the time you’d need to know more, you’ll be expert enough that learning it will be much easier.

The lingo:

Git
GitHub
repository (repo)
add
clone
working copy
commit (revision)
reset (roll back)
push
pull
merge
fetch
merge conflict
child
dependency graph
descendant
pull request
branch
checkout

Oscillating Organoids

2019-09-27T00:00:00+00:00

September 27, 2019 by Bradley Voytek

Lab PhD student Richard Gao has written about our latest paper, Complex Oscillatory Waves Emerging from Cortical Organoids Model Early Human Brain Network Development, published in Cell Stem Cell. This work was recently featured in the New York Times!

It’s an amusing, honest look at what has been a long—but scientifically fascinating—collaboration.

Give it a read over on his blog!

Inferring synaptic excitation/inhibition balance from field potentials

2017-09-13T00:00:00+00:00

September 13, 2017 by Lab Manager

Highlights (tl;dr)

The overarching goal of our recent NeuroImage paper (PDF) is to make inferences about the brain’s synaptic/molecular-level processes using large-scale (very much non-molecular or microscopic) electrical recordings. In the following blog post, I will take you through the concept of excitation-inhibition (EI) balance, why it’s important to quantify, and how we go about doing so in the paper, which is the novel contribution. It’s aimed at a broad audience, so there are a lot of analogies and oversimplifications, and I refer you to the paper itself for the gory details. At the end, I reflect a little on the process and talk about the real (untold) story of how this paper came to be.

A Tale of Two Forces

Inside all of our brains, there are two fundamental and opposing forces – no, not good and evil – excitation and inhibition. Excitatory input, well, “excites” a neuron, causing it to depolarize (become more positively charged internally) and fire off an action potential if enough excitatory inputs converge. This is the fundamental mechanism by which neurons communicate: shorts bursts of electrical impulses. Inhibitory inputs, on the other hand, do exactly the opposite: they hyperpolarize a neuron, making it less likely to fire an action potential. Not to be hyperbolic, but since before you were born these two forces were waging war with and balancing one another through embryonic development, infancy, childhood, adulthood, and till death. There are lots of molecular mechanisms for excitation and inhibition, but for the most part, “excitatory neurons” are responsible for sending excitation via a neurotransmitter called glutamate, and “inhibitory neurons” are responsible for inhibition via GABA.

Like all great rivalries (think Batman and Joker, Celtics and Lakers), these two forces cannot exist without each other, but they also keep each other in check: too much excitation leads the brain to have run-away activity, such as what happens in seizure, while too much inhibition shuts everything down, as happens during sleep, anesthesia, or being drunk. This makes intuitive sense, and scientists have empirically validated this “excitation-inhibition balance” concept numerous times. This EI balance, as it’s called, is ubiquitous under normal conditions, and has been proposed to be crucial for neural computation, the routing of information in the brain, and many other processes. Furthermore, it’s been hypothesized, with some experimental evidence in animals, that an imbalance of excitation and inhibition is the cause (or result) of many neurological and psychiatric disorders, including epilepsy, schizophrenia, and autism, just to name a few.

Finding Balance

Given how important this intricate balance is, it is actually quite difficult to measure at any moment the ratio between excitatory and inhibitory inputs. I mentioned above that there is empirical evidence for balance and imbalance. However, in the vast majority of these cases, measurements are done by poking tiny electrodes into single neurons and, via a protocol called voltage clamping, scientists record within a single neuron how much excitatory and inhibitory input that neuron is receiving. Because the setup is so delicate, it’s often done in slices of brain tissue kept alive in a dish, or sometimes in a head-fixed, anesthetized mouse or rat – basically, in brain tissue that can’t move much, but not in humans. I mean, imagine doing this in the intact brain of a living human – yeah, I can’t either. And as far as I know, it’s never been done. This presents a pretty big conundrum: if we want to link a psychiatric disorder to an improper ratio between excitation and inhibition in the human brain directly, but we can’t actually measure that thing, how can we corroborate that EI (im)balance matters in the way we think it does?

Our Approach: Parsing Balance From “Background Noise”

This is exactly the problem we try to solve in our recent paper published in NeuroImage: how might one estimate the ratio between excitation and inhibition in a brain region without having to invasively record from within a brain cell (which is not something most people would like to happen to them)?

Well, recording inside a brain cell is hard, but recording outside brain cells – extracellularly – is a LOT easier. It’s still pretty invasive, depending on the technique, but much safer and more feasible in moving, living, behaving people. Of course, recording outside the brain cell is not the same as recording the inside – when we record electrical fluctuations in the space around neurons, rather from within or right next to a single neuron, we’re picking up the activity of thousands to millions of cells all mixed up together.

The first critical idea of our paper was that this aggregate signal – often referred to as the local field potential (LFP) – reflects excitatory and inhibitory inputs onto a large population of local cells, not just a single one. Therefore, we should be able to get a general estimate of balance by decoding this aggregate signal. The second piece of critical information was the realization that (for the most part) excitatory inputs are fast and inhibitory inputs are slow so that, even when they are mixed together from millions of different sources like in the LFP signal, we are still able to separate their effects: not in time, but in the frequency-domain (see our frequency domain tutorial).

(A: LFP model with excitatory and inhibitory inputs; B: the time course of E and I inputs from a single action potential; C: simulated synaptic inputs (blue and red) and LFP (black); F: LFP index of E:I ratio)

Combining Computational Modeling and Empirical (Open) Data

Pursuing this line of reasoning, we simulated populations of neurons in silico and looked at how their activity would generate a local field potential recording. What this means is that we can generate, in a computer simulation, different ratios of excitatory or inhibitory inputs into a brain region and see how that influences the simulated LFP. Through this computational model we found an index for the relative ratio between excitation and inhibition.

For those of you that are into frequency-domain analysis of neural signal, this index is the 1/f power law exponent of the LFP power spectrum. Let’s unpack that a bit. In the figure above (panel B) you can see that the excitatory electrical currents (blue) that contribute to the LFP shoot up in voltage really quickly—within a few thousands of a second—and then slowly decay back down to zero. In contrast, the inhibitory currents (red) also shoot up pretty quickly—but not as quickly—and then decay back to zero much more slowly than the excitatory inputs. When you add up thousands of these currents happening all at different times, the simulated voltage (panel C, black) looks to us humans a lot like noise. But through the mathematical magic of the Fourier transform, when we look at this same signal’s frequency representation, they’re clearly distinguishable!

More technically, the idea is that the ratio between fast excitation and slow inhibition should be represented by the relative strength between high-frequency (rapidly fluctuating) and low-frequency (slowly fluctuating) signals. With this hypothesis in hand, we were able to make use of several publicly available databases of neural recordings to validate the predictions made by our computational models in a few different ways. One example from the paper: we were lucky enough to find a recording from macaque monkeys undergoing anesthesia, and the anesthetic agent, propofol, acts through amplifying inhibitory inputs in the brain at GABA synapses. Therefore, we predicted that when the monkey goes under, we should see a corresponding change in the power law exponent, and that’s exactly what we found! As you can see below, our EI index remains relatively stable during the awake state, then immediately shoots down toward an inhibition-greater-than-excitation regime during the anesthetized state before coming back to baseline after the anesthesia wears off.

Takeaways and Disclaimers

So to summarize, we were able to make predictions, borne out of observations from previous physiological experiments and our own computational modeling, and then validate these predictions using data from existing databases to draw a link between EI balance, which is a cellular-level process, and the local field potential, which is an aggregate circuit-level signal. Personally, I think that bridging the gap between these different levels of description in the brain is super interesting, and it’s one way for us to confirm our understanding of how the brain gives rise to cognition and behavior at multiple scales. Furthermore, we can now make use of the theoretical idea of EI balance in places where it was previously inaccessible, such as a human patients responding to treatments.

Before I wrap up, I just want to point out that this paper does not conclusively show that EI balance directly shifts the power law exponent – what we show is a suggestive correlation. Nor does the correlation hold under all circumstances. We had to make a lot of assumptions in our model and the data we found, such as the noise-like process by which we generated the model inputs. I’m not throwing this out here to inhibit the excitement (hah, hah), but rather to limit the scope of our claim, especially for a public-facing blog piece like this.

Rather, ours is the first step of an ongoing investigation, and although we will probably find evidence that corroborates and contradicts our findings later on, it’s important that anyone reading this and getting excited (hah) about it understands that we do not, and likely will not, have the last word on this. Ultimately, though, I believe we stumbled onto something pretty cool and we’ll definitely follow up on those assumptions one by one, and hopefully have more blog posts to come!

Some Personal Reflection

This project was my first real scientific research project in grad school, and it definitely created in me a lot of joy and excitement, as well as caused a fair amount of brooding. As a whole, I really enjoyed the process of building a computational model, even if it was quite simple, and using the predictions from that to inform further empirical investigations. As I mentioned, I think we really need to bridge the gap between molecular-level mechanisms in the brain and circuit/organism-level “neural markers”, and computational modeling work allows us to do that in situations where it would be intractable for many reasons. I certainly subscribe to the notion that combining theoretical/computational work with empirical data is an exciting and fruitful line of research, because it fills a space between two successful but largely non-overlapping subfields in neuroscience (though that trend is now changing).

Also, the fact that we were able to test our predictions on publicly available data was such a blessing, as we simply did not have the capacity, as a new lab, to do those in vitro and in vivo experiments ourselves. However, that meant combing through tons and tons of data where there might have been unlabeled or badly labeled information, only to reach the conclusion that the data is unusable for our purposes. There was some (a lot) of headbanging due to this, but ultimately, we found useful (FREE!) data and I’m very grateful for the people that made them available: CRCNS and Buzsaki Lab, Neurotycho and Fujii Lab, as well as many friends and collaborators that donated data for us to test different routes. To support this open-access endeavor, all code used to produce the analysis and figures are on our lab GitHub, found here.

The Untold Story

One last note, for those of you that find the process of scientific discovery interesting: in this blog post, I tried to write the story as a lay-friendly CliffsNotes version of the paper, starting with the importance of EI balance and the motivation to find an accessible index of it in the LFP, then outlining how we went about solving that problem. That’s the scientific story, and while not false, it’s not chronological.

The actual story began with Brad’s 2015 paper showing that aging is associated with 1/f changes. That was actually what first interested me back when I started in 2014 – this seemingly ubiquitous phenomenon (1/f scaling) in neural data. After digging a bit to find various accounts for how 1/f observations arise in nature, we decided to just simulate the LFP ourselves and see what happens. Turns out, the 1/f naturally falls out of the temporal profile of synaptic currents, which both have exponential rise and decay.

Our model contained what I thought to be the bare minimum: excitatory and inhibitory currents. At that point, I didn’t have a clue about what EI balance was and what it has been linked to. I think I was twiddling parameters one day, and realized that changing the relative weight of E and I inputs will cause the 1/f exponent (or slope) to change because of their different time constants. Then, like any modern-day graduate student, I Googled to see if this is something that actually happens in vivo, and the rest was history. This little anecdote really just speaks to the serendipity of science, and it couldn’t have happened without the many hours of spontaneous discussions in the lab, which I’m also very grateful for, and Google. I think these little stories really liven up the otherwise logical world of science, and I’d love to read about such stories from other people!

More evidence 1/f LFFP “noise” indexes excitation/inhibition balance

2017-09-05T00:00:00+00:00

September 5, 2017 by Bradley Voytek

Earlier this year my lab published a somewhat controversial claim: that a traditionally unmeasurable (in humans), but critical, aspect of brain functioning—the relative balance of neuronal excitation (E) and inhibition (I)—could in fact be inferred from the brain’s electrical activity. While there are methods (such as MRS) that can get rough E/I estimates in humans, those have no temporal resolution, they only measure where. Methods that do have temporal resolution require recording from single neurons, making large-scale measurements in humans impossible.

This is unfortunate, because there’s massive amounts of modeling and animal work that suggests E/I balance is critical for information gating, working memory maintenance, and so on, while E/I imbalances have been implicated in nearly every major neurological and psychiatric disorder, at this point.

Anyway, I’m particularly proud of this paper, “Inferring synaptic excitation/inhibition balance from field potentials,” published in NeuroImage (PDF), especially because the argument first arose out of some computational modeling that then informed the in vivo analyses.

One reason I love the paper is because it really was an effort by lab Cognitive Science PhD student Richard Gao and former post-doc Erik Peterson to try and figure out, physiologically, what was driving the results I’d found in my 2015 J Neurosci paper “Age-related Changes in 1/f Neural Electrophysiological Noise” (PDF), and to put to the test some assertions I’d made in my 2015 Biol Psychiatry review, “Dynamic network communication as a unifying neural basis for cognition, development, aging, and disease” (PDF). My lab wasn’t satisfied with my arguments, so we shifted gears a bit and really decided to nail this thing down.

What we found is that the 1/f “noise” in the LFP/ECOG signal actually reflects balance of the E/I inputs to that region, due to the fact that excitatory glutamate and inhibitory GABA currents that dominate the LFP/ECOG have different temporal profiles that manifest in the frequency spectra in stereotypical ways. Specifically, more excitation leads to a “flatter”1/f while more inhibition leads to a more negatively-sloped 1/f.

In spite of (or perhaps because of) my love for this paper, I’m always trying to find evidence to discredit or temper my enthusiasm. Even though we had supporting evidence from modeling, as well as from publicly available recordings from rat hippocampus LFP and macaque cortical ECOG, we’re making quite a bold claim, and of course extraordinary claims require extraordinary evidence.

A paper posted to bioRxiv this week by Simon Farmer’s group, and lead by Timothy West, at UCL caught my attention. This paper, “Propagation of Beta/Gamma Rhythms in the Cortico-Basal Ganglia Circuits of the Parkinsonian Rat,” is amazing because they simultaneously recorded the ECOG and LFP from multiple different parts of the motor basal ganglia circuit. This piqued my interest, because this circuit is pretty well understood, with each part of it getting different amounts of excitatory and/or inhibitory input, with the concomitant E/I changes pretty well understood in Parkinson’s disease.

Back in 2006 I published a mini-review in J Neurosci called “Emergent Basal Ganglia Pathology within Computational Models”, that shows this circuit in healthy and Parkinsonian brains:

In this, figure, arrows indicate excitatory inputs, dots inhibitory. You can see that in the healthy state (left) the subthalamic nucleus (STN) gets excitatory inputs from the motor cortex, the striatum gets mixed, the external globus pallidus (GPe) gets inhibitory, and the motor cortex excitatory from the thalamus (though it’s strongly inhibited at rest).

In Parkinson’s disease, the striatum loses nigral (Snc) inputs, shifting its balance toward over-excitation from the cortex, over-inhibiting the GPe, reducing excitation to the cortex, making it more difficult for people with Parkinson’s disease to initiate movements.

Check out the power spectra from the West pre-print:

As predicted by our 1/f-reflects-EI model, and known basal ganglia circuitry, the motor cortex ECOG (purple) is the most negatively-sloped (thus, most inhibitory), which makes sense given the rats are anesthetized here (this replicates the macaque cortical ECOG from our NeuroImage paper). In the 6-OHDA lesioned animals (a model for Parkinson’s disease, right) the motor cortex appears to be even more inhibited, also following our EI predictions and the known circuitry.

EI balance is slightly shifted toward E, though still more inhibitory, in the GPe (blue and green), which again makes sense as the GPe gets massive inhibitory inputs. Then the striatum (STR) gets mixed EI, so it’s less sloped and more flat while the STN, which gets mostly excitatory inputs, is the most flat which, in our model, suggests this EI balance is tipped most toward E.

Importantly, this STR/STN dynamic changes in Parkinson’s where the loss of nigral input shifts the STR to over-excitatory. Note in the spectra that the flattest (and thus most excitatory) are no longer STN, but STR.

This is awesome!

These results are exactly what we would predict if, indeed, the 1/f slope indexes EI balance, given both the known basal ganglia circuitry, the state of the animal (anesthetized), and the effects of Parkinson’s disease on that circuit.

Score one for computational modeling!

UC San Diego Data Science projects

2017-07-15T00:00:00+00:00

July 15, 2017 by Bradley Voytek

In my last post I outlined how and why I taught my Data Science in Practice (Cognitive Science 108) class here at UC San Diego. The sole purpose of this post is to show off the incredible talent of my students and to showcase the work they put into their Final Projects. The purpose of the project was to find real-world problems and datasets that can be analyzed with the techniques learned in class. Here are the instructions I gave the students:

It is imperative that by doing so you believe extra information will be gained – that you believe you can discover something new! Your question could be just for fun ( e.g. , “What are commonly misheard song lyrics?”), scientific (e.g., “How do different cultures perceive different colors?”), or, ideally, aimed at civic or social good (e.g., “What parts of San Diego are most in need of dedicated bike lanes?”)

Edit: The rationale behind why I chose to have the students put together a Final Project like this was so that they could build a Data Science portfolio to show off their skills. Data Science is multifaceted, and requires programming, statistics, data visualization, communication, information aggregation, and so on. When I was recruiting for Data Science positions back in my industry days, students with a portfolio of projects absolutely jumped out at us. In addition, for this first offering group had the option of submitting their projects early in order to participate in a special judging panel lead by former (and first) US Chief Data Scientist and friend, DJ Patil (twitter, wikipedia). DJ is a UC San Diego alumnus (Math, 1996) and happened to be in town the weekend before finals week to receive an Honored Alumni Award from the Chancellor. (You can also watch an interview I did with DJ that weekend below.) While he was in town, he agreed to watch eight of the best groups each give a 5-minute presentation on their final projects. To be considered for this challenge, the following rules needed to be followed:

Communicate your results effectively to both experts and laypersons.
Use data scientific approaches to address questions specifically concerning civic utility and social good.

Based on these simple criteria, the panel selected the top three projects. In addition to DJ, the panel included Brandon Freeman (UC San Diego Alumni Board of Directors and Leidos Engineering Solutions Architect), Arnaud Vedy (Data Analytics, City of San Diego), and Liz Izhikevich (UC San Diego Computer Science undergrad, President and Co-founder of the UC San Diego Data Science Student Society, and Voyteklab star!)

The panel judgeth.

You'll note in many of the below notebooks the students make sometimes rudimentary statistical, logical, and/or visualization errors. That's okay! Those are learning experiences... and they learned a lot in the short 10 weeks we have in each quarter. Without further ado, here are the eight finalists: First Place "The Road to DJ Patil is Filled with a Multitude of Potholes" Lee Anne Mercado, Maggie Chan, Tim Lee, Vinh Doan, Young Jin Yun The panel unanimously agreed that this project—which took a data-driven approached to understand how San Diego should allocate their pothole workers and resources—was strongest in adhering to the spirit of the competition: they took a problem relevant to civic good and, through diligent and careful data analysis, came to some actionable conclusions.

The winning group!

Second Place GPA Distributions Diego Saldonid, Roger Ruan, Shu-Wei (Lucas) Hsu, James Mata This group's project was well-loved by us all. The primary reason it didn't win first place was because it wasn't quite as closely related to the civic/social good focus of the competition. Nevertheless it is super interesting. In brief, the students scraped class reports from UC San Diego's Course And Professor Evaluations (CAPE) website to look at how average grades differ across different professors teaching the same classes at different quarters. They then look at how grades might differ between an average student who has a "harder grader" path versus an average student who takes the same classes with with an "easier grader". Amazingly they find it can be almost and entire grade point different: 2.49 v 3.42 overall GPA! Third Place Transportation Infrastructure Hudson Cooper, Vlad Bakhurinskiy, Muhammad Islam, Marco Rivera, Wenshuo Li This group begins by asking: "Can we quantify the quality of life with just a point on the map?" This group really took my advice to heart and pulled in data from several different resources to "try to measure the ‘liveability’ of a neighborhood by looking at different aspects of the availability of methods of transportation rather than just more metrics such as an individual's level of education or income." This was a beautiful project with serious civic infrastructure implications. While they focus on San Diego, their methods are easily replicable across different cities. Do yourself a favor and look at their entire analysis notebook.

The upper left quadrant contains all census tracts that are below median income and above median public transportation use. Because we have identified that lower income neighborhoods use more public transportation in general, it makes sense to us that this quadrant would represent the neighborhoods that are most reliant on public transportation infrastructure. Of course improved infrastructure would likely increase public transportation use, bringing some census tracts from below the median public transit usage line into this quadrant, but use the information contained in this plot as a preliminary measure of reliance. This population is actually the best served by public transportation. From our multiple linear regression, even when you hold one fixed and vary the other, low income and high public transportation use each predict high transit score. You can see this plainly from the above plot since most census tracts with 'good' transit score (in blue) lie in this region. However, there are still lots of census tracts in this most reliant population with low transit scores. The census tracts that are in this quadrant and have transit scores of less than 50 (we will refer to these census tracts as 'underserved') account for 20.6% of San Diego's population. 41.2% of this underserved population spends over 30% of their income on rent, a critical indicator of poverty.

I wish I had the time here to expound the virtues of each of the below projects in detail, sadly I do not. But please do not take my lack of detailed comments as an unspoken commentary on the amazingness of their work. Please do take the time to look at each of these, as they are truly remarkable!

Finalists
"Crime 'n' Booze"
Jenny Hamer, Aparna Rangamani, Jairo Chavez
Chicago Traffic Violations
Arun Sugumar, Zichao Wu, Xiaoxin Xu, Qixin Ding, Lijiu Liang
San Diego Infrastructure
Anaelle Kim, Grant Sheagley, Dylan Christiano, Shawn Le
Flu Demographics
Vincent Tierra, Adrian Herrmann, Lynley Yamaguchi
San Diego Gentrification
Megan Chang, Abena Bonsu, Raymond Arevalo, Lauren Liao

Noteworthy Other Projects
Recurrent Neural Networks for Protein Secondary Structure Prediction
David Wang, Michelle Franc Ragsac, Jimmy Quach, Dhaivath Raghupathy, Shih-Cheng Huang
Exercise vs. Food Environment: Obesity Classification
Bryant Lin, Swarnakshi Kapil, Hendrik Hannes Holste
A Valuation of Public Parks Chad Atalla, Alicia Chen, Nadah Feteih, Alan Chen, Joshua Van Gogh, Anjali Verma
Crime and Public Recreation Areas
Tyler Ly, Reginald Wu, Karen Ma, Ho Tsun Matthew Ho, Erika Morozumi
Causes of Car Accidents
Andy Thai, Johnson Pang, Ronald Baldonado, Haoyuan Wang
Are Universities Worth the Opportunity Cost?
Sharmaine Manalo, Madeline Hsia, Nathanyel Calero, Tianyu Zhang
Using Python to Analyze Billboard Top 100 Pop Songs
Christopher Lo, Ryan Yang, Ken Truong, Kevin Tan, Vivian Mach
Media Violence
Hazel Baker-harvey, Ting Lin, Pratyusha Meka
Optimization of Police Car Placement in San Diego
Emma Roth, Eric Mauritzen, Keven Nguyen, Taralyn Mcnabb
Exploring Desalination Plant Numbers
Christina Cook, Dominic Suares, Youxi Li, Linzhi Xie, Erik Mei
Chicago Crime
Kevin Li, Prithvi Narasimhan, Rajiv Pasricha, Andy Zhang, Matthew Ho

Patil says smart words whilst seated next to me:

Teaching Data Science at Scale

2017-07-14T00:00:00+00:00

July 14, 2017 by Bradley Voytek

A few weeks back I outlined our vision for Data Science at UC San Diego. A major part of that vision is to help define and shape the field. While one way we can do that is through direct research, the most impactful way (in my opinion) is through educating the students who follow our curriculum.

In this post I'll share some of the things I learned this past quarter while teaching Cognitive Science 108—Data Science in Practice. In a follow up post I'll share some of the final projects submitted by my students.

This was the first-ever offering of this course, so I had to design it from the ground up. To add complexity, the enrollment was massive—403 students total!

Cogsci 108 - my new Data Science in Practice class. I think students are interested in Data Science here at @UCSanDiego... pic.twitter.com/DdGa2nHXVP
— Brad Voytek (@bradleyvoytek) April 3, 2017

Before I go any further, I need to thank several people. First is Lead TA and Voyteklab PhD student Tom Donoghue, who put an enormous amount of work into setting up and maintaining the course GitHub account, helping write and test the homeworks, putting together all the Section Materials (see below), and so on.

Additionally I want to thank my dear friends Kirstie Whitaker (Turing Institute Research Fellow and Mozilla Fellow for Science) and Chris Holdgraf (Berkeley Institute for Data Science Fellow). Both of them stayed with me in San Diego for a few days while I picked their brains on how best to set this up.

Finally, I want to thank the students, who all took a huge chance signing up for a new course and being guinea pigs while we worked out all the kinks.

Course Philosophy

As I said in the course Syllabus, the goal of my other class (Cognitive Science 9—Introduction to Data Science) was to give an appreciation for what can be done with data and where data can even lead you astray. In contrast, for COGS 108 I adopted the educational view that “sometimes the best way to learn something is by doing it,” or, more importantly as author Neil Gaiman says, “sometimes the best way to learn something is by doing it wrong and looking at what you did.”

I wanted to teach students the joys and frustrations of the practice of Data Science. We didn't dive deeply into the methods or proofs of machine learning, clustering, etc. on purpose. The reasons for that were :

There are entire classes on pretty much each of the topics we covered if students want in-depth details (which I hope they do after taking this class!)
Those classes are taught by true experts in each of those domains.
My expertise is not machine learning, big data, etc. It is in knowledge discovery and data intuitions.
I take an open view to learning: data literacy is critical for modern society, and I don’t believe learning these topics should be limited to only those who excel at math, computation, and so on.

So we had students try and implement various methods. At times we asked them to implement techniques we explicitly hadn't even taught them yet, as there may be times in their data science careers where they'll be asked to do just that. We wanted them to build a technical toolkit as well as a skeptical mindset and “data intuition”—that nebulous sense that something in a dataset is “off”.

Course Mechanics

I decided from day one to make all of my lectures publicly available via UC San Diego's podcast and videocasting system (here). Faults, bad jokes, mistakes, and all.

To handle the massive deluge of questions, we used Piazza. This let me, the TAs, the undergrad instructional assistants, and the other students help one another anonymously or otherwise.

During sections, Tom used the Software Carpentry/Data Carpentry sticky note method:

We give each learner two sticky notes of different colors, e.g., red and green. These can be held up for voting, but their real use is as status flags. If someone has completed an exercise and wants it checked, they put the green sticky note on their laptop; if they run into a problem and need help, the put up the red one. This is better than having people raise their hands because:

it’s more discreet (which means they’re more likely to actually do it),

they can keep working while their flag is raised, and

the instructor can quickly see from the front of the room what state the class is in.

This also allowed us to "deputize" students who had more experience, so they could potentially help students who were stuck at various points in the exercises. This act both increases our ability to teach, by focusing on bigger questions and issues rather than smaller technical ones, while also provided the "deputized" students experience in teaching.

For the first assignment we had each student create a GitHub account, make a pull request of Assignment 1, make the required simple changes, and then push their chsa

All homework assignments, as well as the final project, were done using Jupyter notebooks and graded using Jess Hamrick's nbgrader tools. The Jupyter toolkit included numpy, scipy, pandas, scikit-learn, patsy, beautifulsoup, and scrapy (among others).

I chose to use GitHub and Jupyter because:

For many software and data science jobs, a person's GitHub account acts as their de facto resume, and;
Jupyter notebooks are excellent for combining code, narrative text, images, and so on in one clear document. I wanted the students to think through each aspect of the data analysis workflow, including the background rationalization, methodological choices, and discussion of the outcomes.

Another added bonus of using GitHub was that it helped make clear who wasn't turning in homeworks on time (can't trick version control systems!) and who contributed what to the final projects.

Finally, using GitHub let us place a lot of extra materials for students to leverage, including code!

For example, in addition to the Assignments, we had Section Materials, which Tom put together to teach basic concepts and how to implement them in Python, such as this notebook on ordinary least squares (OLS):

Or this walkthrough I put together to explain the utility of the Central Limit Theorem:

We had Lecture Materials, Workbooks, and Extra Materials to help students with things like Git.

Finally, to keep things interesting, I had a few guest lecturers, including UC San Diego professors Kamalika Chaudhuri from Computer Science and Eran Mukamel from Cognitive Science, as well as industry speakers Claire Dorman from Pandora Data Science and Matt White from Sony Data Science.

Hilarious data science presentation by @matthewmwhite for my Data Science in Practice class today. pic.twitter.com/RP8my3jfDh
— Brad Voytek (@bradleyvoytek) June 2, 2017

There were some things that worked, and some things that did not. And the students were savvy and picked up on that. For example, these two anonymous comments from my student evaluations nailed it:

It seems like this class tried to bridge the experience gap between CS and COGS majors with regards to python, and desperately failed. As a CS major, the assignments were too easy, and yet many of those with no python experience struggled immensely. I'm not sure how this should be solved; perhaps have an introductory python course as a prerequisite which may also be fulfilled by a CS class?

The class had a mix of Cogs, and CS students. (For the most part) Many felt it was too hard, and many felt it was too easy. It's hard to decide what direction the class should go. It was the perfect difficulty for me. It was hard enough to have the material be intellectually stimulating, and easy enough for me to not want to change my major and rethink life decisions. Good luck with improving the class. And thank you for offering this great class! It truly is what you make of it.

The class only had one intro-level programming requirement, and it lead to a nearly bimodal distribution of number of study hours required. I heard over and again that Computer Science students were finishing assignments in 15 minutes while Cognitive Science students were taking 5-10 hours. That's too big of a spread, so as the Data Science curriculum here matures, we'll require an introduction to Python class.

That said, I'm not beating myself up too much, as the comments from the students were constructive (showing they cared!) and also very much motivating for me to continue to put in the time and effort to build classes such as this one. Yes, I'm singing my own praises by posting the below comments, but they show that teaching Data Science at scale can work—and even be personal!—in a way they students really connect with:

For first time teaching this class and there are not many Data Science in courses ANYWHERE kudos to Mr.Voytek. He maintains a relaxed demeanor even though I am sure it is very stressful. He is flexible in the course content even though we had a clear class guide at the beginning. I don't know any professors who could have handled it this well. Very approachable and has a clear passion.

He is the best professor at UCSD. Going to his class always felt like a place where an instructor is truly in love what he does. Prof Voytek made me realize one should be happy at what they are doing. Thats what matters in the end - 'having a relationship with your work'.

This course has really opened up Data Science as a possibility for my future. It's a fascinating field, and to hear Professor Voytek and the guest lecturers speak so passionately of it makes me really excited to participate.

Professor Voytek is one of a kind, very generous, kind, and respectable. I wish most professors that taught here had the same enthusiasm as Voytek, I never had a professor that is just an overall great person like Voytek. The way he runs lectures and his overall class was great, it seemed very well-structured, organized, and not too challenging but not to easy as well. If i recall, this was also his first time teaching Cogs 108. I would say that Voytek showed us a great amount of topics needed for data science. I couldn't say much more other than Professor Voytek left me with an impression that there is still hope for ME and also OTHERS to do well, you can really tell that Voytek truly cares for his students to do well. If you're reading these evaluations professor Voytek, I want to truly thank you for everything that you were able to teach in the span of 10 weeks!

In the next post, I'll show off the Final Projects the students put together!

Data Science at UC San Diego

2017-05-24T00:00:00+00:00

May 24, 2017 by Bradley Voytek

What is Data Science?

I've been somewhat obsessed with this question for years now. In this post I outline my views, as well as the semi-consensus view being adopted here at UC San Diego.

As someone who's held the job title of Data Scientist, who teaches Data Science classes, and who is a Founding Faculty member of the Data Science Major and the new Data Science Institute, I take very seriously the idea that Data Science can and should be an independent, novel field of scientific inquiry. And that it will be a massively important one at that. My musings on this recently came to a head with the release of a White Paper I helped write for the UC San Diego Division of Social Sciences (detailed below).

Most critically, I am strongly of the opinion that Data Science does not just equal Machine Learning. This is an opinion I expound on below, but is nicely summarized by friend Josh Wills, Data Engineer at Slack:

Rule #1 of Hiring Data Scientists: Anyone who wants to do machine learning isn't qualified to do machine learning.
— Josh Wills (@josh_wills) February 18, 2017

Rule #2 of Hiring Data Scientists: You can get a data scientist to do anything if they believe that what they are doing is machine learning.
— Josh Wills (@josh_wills) February 18, 2017

There's a confluence of a lot of Data Science things happening at UCSD right now that make this document timely:

UC San Diego recently received $75M from early Facebook employee and UCSD alumnus Taner Halicioglu to start a new Data Science Institute here.
Fall 2017 will be the first year that the new Data Science major will be offered here at UCSD—a joint major between my home department of Cognitive Science with the departments of Computer Science and Math.
On June 8 I'm hosting a two hour fireside chat with DJ Patil, former Chief Data Officer of the United States under Obama, former LinkedIn Data Scientist, co-coiner of the phrase "Data Scientist", and UCSD alumnus.

When I first arrived at UCSD in 2014, there was not a lot happening here in Data Science. So I decided to teach a class —(COGS 9) Introduction to Data Science; I had about 24 students. It was the first-ever class I taught as a professor and the students of that class went on to found the UCSD Data Science Student Society (DS3).

I've since taught that class three more times, and most recently it had 280 students. This quarter I'm teaching an entirely new upper-division version of that class—(COGS 108) Data Science in Practice—to about 420 students. (The syllabus for Introduction to Data Science is here; Data Science in Practice is here along with the public podcasts here as well as some tutorials on GitHub.)

The demand for the theory and skills of Data Science is skyrocketing, which I joke about in my lectures:

But joking aside, there are serious concerns that Data Science is a fad. This is Very Bad News when one is trying to establish a new major, Institute, and potentially even field!

Personally I'm of the opinion that Data Science is simply plummeting toward the hype cycle Trough of Disillusionment:

I consider this to be a good thing, as it's a time for moving past the hype and doing the boring academic work of carving a niche and laying the foundations for a new field. (See also, 50 years of Data Science by David Donoho.)

When I first began conceiving of Data Science as an independent field of scientific inquiry, I saw that New York University’s Initiative in Data Science website has a “What is Data Science?” page, on which it is stated that:

There is much debate among scholars and practitioners about what data science is, and what it isn’t. Does it deal only with big data? What constitutes big data? Is data science really that new? How is it different from statistics and analytics?

I believe that this ambiguity about the scope, aims, and distinctive characteristics of a Data Science discipline has (correctly!) given rise to skepticism.

But honestly, this skepticism strikes me as reminiscent of criticisms of Computer Science as a distinct discipline back in the 1950s, with some decrying it as “impossible that computers themselves could actually be a scientific field of study”.

I believe that the skepticism of Data Science is similarly misplaced, albeit understandable given the lack of clarity as to what Data Science is, what it can be, and what scientific and social problems are unique to the modern proliferation of massive amounts of contextual and personal data.

To address the general lack of clarity, we wrote a White Paper laying out what we believe are the core establishing questions that define Data Science, and lay the groundwork for what it can and should be. I recommend you read the whole thing (I'm proud of it!) but will highlight pieces here.

Foundational Questions in Data Science

Why are some problems more amenable to purely data-driven approaches using generic learning algorithms than to domain-specific structure, or vice versa?
What factors determine data quality with respect to a particular question?
Can we ascertain a priori whether a particular question can be answered via a particular data source?
Can we estimate the data requirements for adequate algorithmic performance in a given domain?
When can training on synthetic data substitute for training on real-world data?
How do we combine unstructured, data-driven machine learning algorithms with human domain expert knowledge?
More generally, how can we design systems that integrate human intelligence with algorithmic data-science predictions, leveraging humans’ rich understanding of the world to improve predictions, and to help human decision-makers with algorithmic forecasts?

Those first questions regarding how much data and of what quality are needed to address which kinds of problems strike me as very similar to Big O algorithm analysis in computer science, with a lot of interesting problems embedded in it.

For a somewhat trite example, why does Google Translate work as well as it does without any hard-coded information about grammar, semantics, linguistics, etc.? (See The Unreasonable Effectiveness of Data by Halevy, Norvig, and Pereira).

There is a lot of room here for uniquely Data Scientific questions that do not neatly fit into Computer Science, Statistics, or the Social Sciences.

Regarding the latter, there are also many Data Scientific questions highly relevant to the Social Sciences.

Social Sciences and Data Science

How does society balance the demand for data and individuals’ rights to privacy?
Is it possible to develop provably anonymous data-gathering strategies?
How can data-intensive organizations avoid perpetuating and reinforcing the inequalities inherent in data as algorithms and automation gain prominence in ever-more important aspects of modern life (e.g., 1, 2, 3)?
How do we identify and anticipate challenges that some applications of data science might pose to the functioning of our democratic political, economic, and cultural institutions?
How do we develop forms of data-literacy that foster collaboration across disciplines and generate awareness of the social, political and historical embeddedness of data and data infrastructures?
As an educational institution, how do we promote data-literacy at all levels, from undergraduate education, broadly conceived, to even the data collection activities of the university itself?

Data Science at UC San Diego

In our White Paper we outline four paths within the new Data Science Major and Institute:

Data Engineering: The development of data architectures, algorithms, systems, etc. for capturing, storing and processing an exponentially increasing torrent of data. In industry, holders of jobs with this title are often tasked with establishing the data infrastructure necessary to make analytics and machine learning feasible.

Machine Learning and its Foundations: The mathematical and algorithmic tools for learning from data and their theoretical foundations, including questions such as which problems are more or less amenable to general purpose data-driven approaches, what are the sample, memory, time complexities of specific problems, inter alia.

Social Science Oriented Analytics: Although some AI/ML applications can be entirely machine-centered (a control system for a quadrocopter), most of Data Science aims to generate usable insights for humans about human behavior. Here the focus is on understanding how data science tools can be adapted to answer social science questions, how social science processes generate data and what this means for their analysis, how the results of large-scale machine learning can be made useful/understandable to people, how to seamlessly integrate human intelligence with machine learning in expert systems and for crowd-sourcing applications, and how expert knowledge interacts with machine learning approaches.

Data and Society: Beyond being a tool, Data Science itself should be an object of social science investigation. There is a need to bring cutting-edge research and applications in data science in conversation with democratic legal frameworks as well as forms of social analysis that examine how values get designed into technical systems. How can we develop provably anonymous data gathering and reporting strategies, balance the need for privacy with the demand for data, incorporate fairness and accountability into algorithms, and counter statistical/algorithmic discrimination to prevent data-driven approaches from perpetuating and reinforcing inequities? More broadly, we need to understand the social ethics required for a society where data science applications are pervasive.

Finally, we close with what we believe are critical paths to success:

UC San Diego Division of Social Sciences Recommendations

Advancing Data Science as a distinct field of scientific inquiry and engineering, unifying faculty across many departments and divisions.

Improving data science education at the undergraduate, graduate, and faculty levels.

Internships and portfolio building projects during education, including capstone projects. In particular, extramural projects wherein students work with outside companies, agencies, or labs.
An interdisciplinary graduate program in Computational Social Science, at both the masters’ and doctoral levels
Additionally, UC San Diego Data Science should place a special emphasis on Data Ethics.

Pursuing civic and social good through outreach and community involvement.

A civic good oriented summer program similar to the University of Chicago’s Data Science for Social Good Summer Fellowship program.
University and community partnerships creating easy-to-use software tools for search, visualization, and analysis of big data for the lay members of the community.

This is an incredibly ambitious, but sincerely exciting effort. UC San Diego is poised to become a true founding leader in Data Science, just as it was with Cognitive Science in the 1980s.

But this is not a lone effort, and I would very much love to hear feedback from academics, professionals, and anyone else on the future of Data Science both here at UCSD and more broadly.