Teaching basic lab skills
for research computing

Rewarding Software Sharing by Mapping Scientific Software

Sharing software you write with other scientists can magnify the impact of your research, but there can be a surprising amount of sometimes thankless extra work involved. I work with a group at Carnegie Mellon's Institute for Software Research who have been asking scientists what that extra work is, and what motivates them to do it—despite a sometimes uncertain link between that extra work and the ways many of them are evaluated in their jobs. We're looking at ways of measuring and mapping software and its impacts, in order to help scientists demonstrate the positive impact that their work on shared software has on science. We're running an experiment that you can help with.

Uploading some code you've written onto a public repository like GitHub is easy, but it may invite some unexpected costs down the road1. If you didn't document every last assumption you've made, every command-line option, every rationale for, for example, you're going to get questions about it. If it's buggy, you'll be asked to fix it. If it doesn't perform a Frobnitz analysis, someone will ask you to add it.

You may struggle a bit to make time for that extra work. Tenure committees and grant reviewers may care more about how many papers you've published than how many tests you've written or users you've supported, even though that support may indirectly be enabling a great deal of valuable research. So if you don't want to skimp on software support, you might need to explicitly write that maintenance into your grants, your job description, or hire people to help. To do justify all this, you'd have to make a case that that work is really advancing science and not just keeping an old pet project on life support.

Unfortunately it turns out to be fairly difficult to know what scientific impact a particular piece of open source software is really making. You can collect download statistics, but it's hard to know how much usage a download represents, if any. You can try to count citations in published papers that mention your software, but this task is daunting and unreliable: publications do not always mention all the software they use. For example I spot checked ten papers listed by researchers as justification for supercomputer time, to look for citations or mentions of a library they all used: FFTW, a popular fast Fourier transform library. Only one mentioned the package, and only because the particulars of the algorithm were relevant to their research. FFTW seems to be a real workhorse of scientific computing across many fields, but its citations don't reflect its apparently broad impact. Furthermore, James Howison at the University of Texas has found2 that even when scientific software is mentioned in papers, "only 37% of mentions involve formal citations, and only 28% provide any version information".

The citation problem can't be completely laid at the feet of paper authors, however; practices and expectations differ by field, and it would be unwieldy to mention every library used by every program in one's workflow.

So how can we get better information about scientific impact into the hands of scientific software authors? Two possible places for improvment are: how publications require, collect, and promulgate usage information, and how scientists' actual usage can be measured in the first place.

From the perspective of the publications, there's a social/technical question of agreeing on appropriate citation practices for software, and technical ways of making that metadata usable and searchable. Several different groups of researchers have been pursuing this. Some software authors and repositories, such as R packages listed on Bioconductor, quote specific citations that they ask people to use. Some journals are requesting software to be submitted along with papers, and archiving that software. ImpactStory has plans to mine publication repositories for software references3, and organizations like Sage Synapse, the Open Science Framework, and MyExperiment are offering ways of storing a paper's software along with data in order to make research more replicable.

From the perspective of the scientist actually trying to get work done, there's the problem of keeping track of all the software one is making use of, and getting that information both to the publishers of our papers, as well as back to the authors of the software, who desperately need evidence that it's being used. That's the part of the problem our group at Carnegie Mellon and UT Austin (Jim Herbsleb, James Howison, and myself) have been working on.

We are building a tool for monitoring actual usage of scientific software "in the wild", and we're looking for volunteers willing to let us capture your usage data (an anonymous list of packages you used in each session). We're asking for users of the R language who are willing to install our scimapClient package, which collects anonymous information about which packages you used. The data is fed into an online repository, http://scisoft-net-map.isri.cmu.edu, where you can browse the growing data set.

We're just beginning to collect R data now, but we've examined analogous data an easier-to-collect source: scientific software run on a supercomputer. Folks at the Texas Advanced Computing Center (TACC) have let us analyze a year's worth worth of data4, which you can explore on an alternate version of our website at http://scisoft-net-map.isri.cmu.edu:7777.

For example, in the supercomputer data, there's a dynamically loaded C++ library called Boost, which was loaded by a few different packages people used. We can see how often it was used over time, and what it was used with:

Boost Usage

The big line chart shows usage of boost over time; a boost developer might use it to watch for trends, and compare usage against complementary or competing packages.

The (inset) graph visualization uses size to represent number of jobs in which packages were used, and orange pie wedges to show the proportion of those jobs that used boost. In this case, three packages: desmond, mpi, and gromacs, all used boost. Gromacs, a molecular dynamics package, has a version of boost statically compiled in which isn't visible in our dataset, but it optionally lets users link to this dynamic library for better performance in some scenarios. That occasional use of an external library is what we're seeing here. A gromacs maintainer might use a visualization like the inset graph above to hypothesize that users were seldom taking advantage of this option, at least in the context of TACC.

We believe there's great potential for this kind of analytical toolset, akin to what supercomputer operations centers use, but applied to the broader ecosystem of open software distributed across the laptops and desktops of the world's scientists. If you're an R user who'd like to be a part of that, you can visit our website and follow the instructions to instrument your own R installation.

References

  1. Trainer, E., Chaihirunkarn, C., Kalyanasundaram, A., Herbsleb, J.D. (2015): "From Personal Tool to Community Resource: What's the Extra Work and Who Will Do It?" In Proceedings of the 2015 ACM Conference on Computer-supported Cooperative Work & Social Computing (CSCW 2015, Vancouver, Canada), to appear.
  2. Howison, J., and Bullard, J. (Under review). The visibility of software in the scientific literature: how do scientists mention software and how effective are those citations? Working paper submitted to Journal of the Association for Information Science and Technology. PDF
  3. Priem, J. and Piwowar, H. Toward a comprehensive impact report for every software project. figshare, 2013, 790651. ref
  4. Mclay, R. and Cazes, J. Characterizing the Workload on Production HPC Clusters. (Working Paper) Texas Advanced Computing Center, 2012.

Addendum: Andy Terrel pointed out on Twitter that after so many words on the topic, I didn't actually cite the software that was used to collect this data. Here are the main things we used:

Dialogue & Discussion

You can review our commenting policy here.