I didn't have nearly enough time to enjoy everything that was going on at PyCon 2014 last week. One event I particularly regret missing was a sprint organized in part by the folks at Scrapinghub. They got a bunch of people to write little scrapers to go through old conference websites, pull out speakers' names, and run them through a gender identification library in order to plot changes in gender balance per conference over time. The code is all available on GitHub, and Gayane Petrosyan (currently at the Hacker School in New York City) has started plotting some of the results.
It's a really cool project,
and you really ought to fork it,
add a scraper for a conference or two,
and send a pull request with your results.
If you're going to do that on a Mac,
you may run into a problem.
If you're using Continuum's Anaconda installation of Python,
pip install scrapy fails
with a lengthy stack trace and the cryptic message, "Reason: image not found."
eventually figured out that you have to set a shell variable before trying to install
that Scrapy depends on:
This piece of information didn't appear anywhere on the Internet
(at least not that we could find).
The clue that set us on the right path
was a fix recommended in issue #693 from the
env DYLD_LIBRARY_PATH=/usr/local/opt/openssl/lib/ \ ARCHFLAGS="-arch x86_64 -Wno-error=unused-command-line-argument-hard-error-in-future" \ LDFLAGS="-L/usr/local/opt/openssl/lib" \ CFLAGS="-I/usr/local/opt/openssl/include" \ pip install cryptography
When I started this post, I planned to enumerate all the bits of background knowledge someone would need in order to solve this problem. The list quickly became unmanageable: environment variables and dynamically-loaded libraries in general, this particular variable and library in particular, the fact that the operating system's search for the latter can be controlled by the former, the way Python libraries can depend on compiled C libraries... There's a lot going on here, and most people, even ones who've been through a bootcamp, would give up on contributing to this project before they got the required software stack to work.
This post isn't meant to be a criticism of Scrapy,
they're all useful, well-engineered tools.
my goal was to explain why we can't just teach people end-user tasks like data analysis.
"Gender balance per conference over time" is a pretty innocuous question,
but even someone who understands
how to write a few lines of code to connect a scraping service
to Python's Natural Language Toolkit
would probably be defeated in this case.
Pre-configured virtual machines and all-in-one installers aren't a solution to this problem:
they only give you what their creators thought to include,
and as soon as you want to take advantage of anything else,
you're right back in the minefield.
The internet isn't a solution either:
I've been programming for over thirty years,
and I didn't know that the magic word I needed was
What I do know is that
every attempt I've seen to sprint for the finish line—to teach
the skills people want to do
before teaching the basic concepts those skills depend on—has produced more frustration than enlightenment.
In the end,
it turns out that my high school music teacher was right:
if you want to play a solo,
you have to learn your scales.
Postscript: there were several comments on this post on GitHub while it was still in draft; I've included them below, and more would be welcome.
This is a good issue. I know you were just picking this dynamic library problem an example, but it's one that's near to my heart as somebody who helps develop cross-platform tools. Dynamic library loaders are very complicated pieces of machinery, and they present similar-yet-different interfaces on OS X, Windows, and Linux.
I do believe that we're getting a lot better at providing "batteries-included" platforms for developers to work on, without having to understand the underlying guts of their tools. It used to be that if you wanted to perform a computation involving solving a system of linear equations, you had to know how to write C or Fortran code, perhaps call the LAPACK library if you wanted it to be very fast, call the compiler, the linker, and maybe use a Makefile, all to run your code from the command line.
Nowadays, you can solve this system of equations in a single line of code in Python (using SciPy), MATLAB, or even on the web with tools like SageMathCloud or Wolfram Alpha. We're making progress in layering low-level library dependencies away from users, but we're not done yet. Tools like conda and hashdist are emerging specifically to help increase the range of experts, allowing developers to focus on building tools higher in the stack.
Packaging, interoperability, and portability remain as challenges in scientific computing, for all languages and platforms. I think advances in tools and techniques will continue to have significant impact in improving productivity, aiding replicability and reproducibility, and reducing the need for scientists to set arcane environment variables to get their web scraper going.
I think well-configured VMs and cloud computing machines are a solution; it's super hard to write software for everyone's individual laptop, but if I tell you "Ubuntu 12.04 with these packages, and here's how to install them with apt-get, and then build our software", well, that's just going to work.
VMs can help but sometimes they just make the problem "worse" because you will need specific versions of virtualization software, network configured properly, ... since the VM was made to "Ubuntu X" and you are using "Ubuntu Y".
sure, but the Ubuntu Long Term Support releases should mitigate that quite a bit -- I'm now relying on 12.04 for our scientific software, and 12.04 is supported through 2017, I think.
It comes down to numbers: is the percentage of packages that JFI (Just Install) high enough that most scientists, most of the time, don't run into problems whose solution requires 10X the knowledge they have? Python's `cryptography` library isn't a small or obscure package, and Anaconda on Mac OS X isn't a rare combination; how long until events like that on the VM of your choice are so infrequent as to be a non-issue?
Docker.io might one-day help on the Linux platform, but as you point out, Macs are well, different.
Additionally, when you write a piece of code people find useful then open source it. If it becomes popular you will be dealing with a torrent of correspondence as people try to run your code on every conceivable software platform under the sun.
@gvwilson: I think we've crossed that bridge with the core SciPy Stack on the major platforms.
@ctb: The VM solution isn't a silver bullet, unfortunately. For one, VMs don't work on supercomputers. Additionally, as Greg alludes, when you combine more tools, you increase the risk that you will need a library at a version that isn't provided by the VM because the VM is too new or old, coming to the limits of "interoperability".
Despite this, I think it remains to be seen whether we see VMs emerge on major HPC platforms, or installation-isolating tools like Docker, HashDist, or Conda enter mainstream popularity.
The message I get from your post is: "This isn't as easy as we'd want it to be",
The reason I don't think VMs are the whole solution is that they don't - in their present general usage - solve the issue of me making multiple difficult to install tools work together.
Another solution - not necessarily the right one - is having an expert in every group/department who installs new software on a central set of managed machines for the group (aka the SysAdmin model or HPC centre model). But it can create an artificial divide between those who know what they'd like to do, and those who know what's possible on their machines.
VMs don't solve everything. But they solve an awful lot, especially at the level that Software Carpentry students and instructors work at, and I wish I understood why people ignore 'em. EOM. :)
I would've much rather seen this interesting and useful discussion as comments on the published blog post, rather than in github on the draft blog post. :(
Something like this happened to my partner and I during the SciPy 2013 git/testing tutorial. We got in a situation where git was refusing to commit our changes, and none of the usual suggested solutions from the instructors worked. In the end, the instructor had to "drive" and fix the problem, but we were all in a rush. It got fixed, but we didn't learn how. So ultimately my take home message regarding git was "here there be dragons."