GitHub for Biology: Build or Buy

When I describe Synapse, the software platform we are building at Sage Bionetworks, I use the analogy “It’s like GitHub for biologists” or “GitHub for data science” a lot.  This seems to be useful, at least for the subset of people that I talk to that know what GitHub is.  For the rest of you, in a nutshell GitHub is a set of online tools to make it super easy to share code and manage software development projects.  It’s evolved into an active community where developers can quickly launch projects, contribute to or reuse existing projects, or recruit new developers to their projects.  It’s got a particularly strong following in the open source community, but is also used by a number of major corporations to run at least some of their projects.

Now, the idea that basic science would benefit by importing some of the culture and tools of open source software is certainly not an original thought on my part. This particular meme has been floating out on the net for some time, for example see Marcio von Muhlen’s post We need a GitHub of Science for more extended discussion.

I’m not actually going to ask the question “Do we need something like GitHub for biologists” in this post.  I’m going to assume the answer is yes, and move on to “If we need a GitHub for biologists why not just adapt the current GitHub so that it is geared for biological data as well as code?”  I think this is a more interesting question because it is often the case that adapting an existing technology to new uses wins out over building something new. In fact, Wired recently published a story about GitHub which mentioned experiments with other forms of text documents on the site, books, legal contracts, even one person who uploaded his genotype information to spur research (which another user promptly forked and issued a pull request on).   In addition to code sharing, GitHub also offers additional project management tools like wikis and issue trackers.  These are pretty generic tools, and in fact we’ve seen similar tools migrate from the software team to the research teams at Sage (we’re actually using a GitHub competitor, Atlassian’s Jira suite, internally).

So, why not just teach scientists to use GitHub, instead of building a new, dedicated system?  I came up with at least three reasons:

1. Git’s hyperdistributed peer-to-peer data sharing model is good for code, but bad for big scientific data.  This is because Git works by placing a complete version of the entire code repository, including all versioning and branching, onto the laptop of every developer who forks (copies) the code base for their own personal use, and then allows them to pull (merge in) other people’s development as it proceeds.  This has turned out to be a very powerful way to develop because it gives enormous flexibility to developers to experiment, and then select the experiments that work.  This works because code is small and usually evolves though small and mostly orthogonal diffs (changes) in text files that are efficient to merge.

In contrast to code developers, biologists, including those here at Sage Bionetworks, are dealing with fairly large data sets.  Our mission at Sage Bionetworks is to facilitate the integration across many clinical studies, each of which may contain 10-100 GB of data including full genotype, copy number, and expression profiles.    The rapidly increasing availability of sequencing technologies like RNAseq, full exome, and even full genome promises to increase the rate of data generation data faster than improvements in storage, and especially network bandwidth.  Furthermore, if you re-run an analysis, you’re probably going to change *all* the output results, and the analysis itself might require a warehouse full of computers to process in a reasonable time.  Giving everyone a copy of the data on their local machine just won’t work.  Giving analyst teams distributed access to shared and centralized data and compute resources is necessary, and becoming more technically straightforward given the rise of commercial cloud computing platforms.

2. Git Hub’s tools are optimized for the production of code.   In contrast, bioinformatics data analysis is not the single task of writing software.  Sure, some aspects of bioinformatics analysis require the writing of code: We do our share of this at Sage Bionetworks, and even have our own GitHub outpost where we post some newly developed algorithms and software.  But most of bioinformatics analysis is the iterative application of existing methods to data.  Analysts tune parameters, massage data into the right formats, make adjustments until they are satisfied that their data can serve as a robust foundation for their scientific interpretation.  All the top analysts I know work at an interactive command line in a scripting language (mostly R).  They need dedicated tools that capture and share the status of this iterative workflow.

3. Bioinformatic data analysts are a distinct community, with a culture that is very different than that experienced by open source developers.   The real-time transparency of open source that developers take for granted is much less prevalent in science, where information is typically not shared until it can be published in a formal journal article.  Ultimately, I hope Synapse can enrich and support a community of scientists working together more effectively. However, to create the right incentives we’re going to have to create a bridge to the more traditional metrics used to assess scientific productivity: publications and citations.   This will require experimentation to develop the right interfaces to encourage scientists to work in a more open manner (for more detail, see my previous post).

Having thought through what GibHub can offer vs. what bioinformatics scientists need, at Sage Bionetworks we’re taking the approach of “inspired by GitHub”, as opposed to “adapting GitHub”.  If this post convinced you that we made the right choice that’s great, and we’d appreciate any support you can give us on what I think is a pretty ambitious project.  But if you still think that we can simply adapt GitHub to for scientists’ use, then I’d really like to hear from you.  After all, questioning basic assumptions until you get them right is what science and engineering is all about.

About these ads

About Michael Kellen

I've spent my career working at the intersection of science and technology. Currently I lead the technology team at Sage Bionetworks, but this blog contains my own thoughts.
This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

3 Responses to GitHub for Biology: Build or Buy

  1. Marcin Cieslik says:

    I have to disagree, at least partially. (1) Sure, biological data sets are magnitudes larger than source trees, but it is only the unprocessed data which is large. The creative and iterative fork-pull loop could still operate at a level of the analysis code, fractional results etc. (2) The attitude to perform analyses ad-hoc at the interpreter does not lend itself to reproducible research. Does Synapse enable creating of code / analysis / data snapshot that can be re-executed (and understood) at future time points? (3) I am a little bit surprised how much the new startups that target scientists emphasize how they will allow to share / collaborate / connect / socialize. This is so different from the usual “please do not show these results / data / protocols yet”. A cultural change is needed, but I would not bet my money on it, or maybe I should?

    • Marcin, appreciate the comments,

      1. Initial pre-processing steps could be as critical to an analysis final results as the actual predictive model building. Secondly, even if the data is not really in the really big range where you need a distributed file system, some of the machine learning approaches require significant processing power. Staging the data somewhere where they can be operated on directly (e.g. S3) seems useful. I do agree that the fork-pull loop of Git is very nice. We’ve started looking at storing code in Git and possibly we can store analysis records in Git as well. Building another application that uses Git is different than reusing GitHub for a new user community.
      2. Our R package is designed to capture a complete R session so it can be later reconsituted. We are working for lightweight ways to record sessions as steps in a workflow, e.g. recording the data checked out of Synapse to start the session, and giving users simple R commands to save results and scripts back to Synapse to capture the session. This could be turned into a dependency graph of analytical steps showing the logical flow through a complex analysis. This is different than a lot of workflow tools which assume you have a pre-existing library of steps, and the purpose of the workflow tool is to help you wire those steps together in a complete sequence so you can run the whole thing together.
      3. Note that Sage Bionetworks is a non-profit organization. The mission is to experiment with ways to catalyze what we see as a needed cultural change. I think we’d have a hard time selling this to a VC. If you’re looking for the best ROI on your investment dollars you’d best look elsewhere!

      • Marcin Cieslik says:

        I like the idea of workflows defined as arbitrary steps that depend on data checkouts / commits! Thanks.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s