The movements for open data and open access converge in the best form with the creation of web based repositories where researchers can freely upload their datasets: the datasets can be viewed, cited, and downloaded by anyone. Great!
Two major players are emerging: figshare and datahub.io (aka CKAN)
Contrary to many other repositories, they are not specific to a scientific field – anyone is welcome to upload. They are international, serving researchers based in any country. And they made a big effort in usability – for instance, it is not necessary to fill in long forms or get any form of pre-approval: you just create an account and upload your data. Again, great!
Yet, these 2 major players, figshare and datahub.io (aka CKAN), still lack some important features which slow down their adoption. In my case, this is to the point that I chose not to use figshare, and can use datahub.io only in a limited way. Let’s summarise the situation:
Strengths of figshare:
- attributes a doi url to your datasets. Your datasets can get referenced with a much respected url, as in “http://dx.doi.org/10.6084/m9.figshare.154972“. Brilliant!
- great website: the user interface is engaging.
- strong community engagement: figshare has just announced partnerships with PlosOne to host all the data of PlosOne publications. Also, they launched an “advisor scheme” which is basically recruiting evangelists / community supporters / you name them to spread the word about figshare. Great stuff, it is much needed to grow a community and a knowledge base around figshare.
Weaknesses of figshare:
- individual files inside the datasets don’t get unique identifiers. Let’s say you create a dataset on figshare, and upload two files in it. These two files won’t get unique, separate urls ( you don’t get http://doi.org/file1 and http://doi.org/file2). This is a major impairment, for three reasons.
- It means you can’t refer to files individually, say, if you would like to cite them in publications. For instance, if you dataset is a collection of pictures, you won’t be able to refer to each of these pictures individually – they simply don’t exist as independent resources on figshare.
- Second, you can’t add separate metadata to each file individually. That’s just wrong – each file may have an author, a date of creation, etc., etc., that needs to be referenced.
- Third, in the absence of a url for each file, the manipulation of these files through programmatic means ( = through an API) becomes much more problematic. You have to handle each file by performing operations on the dataset, which is the only entity that has a stable reference. I don’t have a specific use case, but it seems like not a good model (might slow things down or make them more complex than they need to, also, what happens if you’d like to ascribe a file to 2 different datasets?).
[note: and is it a corner case I describe here? Certainly not. Pictures in a collection is just one example. The point is, you can’t limit from the start, in the design of your application, what is a sensible use case for datasets or not. Scientists come with data of all stripes. The need for the granularity, cite-ability and permanence of resources should be assumed, not ignored.
Strengths of datahub.io:
- great website: I personally don’t like it as much as figshare, but by a tiny margin. Very intuitive to use (this is not a trivial question – this is where everything starts).
- attributes unique urls to each of your files! Yeah! (see the weaknesses of figshare to see why it is crucial)
Weaknesses of datahub.io:
- datahub.io is actually just the web interface to a core platform called CKAN. This platform is complex and its target users are people with programming skills. It can be downloaded, installed and run on private or institutional servers to service a repository. Great! Except that, in my honest opinion, having these two services and two audiences (the website datahub.io for users, and the CKAN platform for admins) creates confusion in key places. First, the online documentation provided is mainly for programmers using CKAN, and it is hard to find a specific doc that addresses only the needs of the users of datahub.io.
- Second, and here again these are just personal impressions, I got the feeling that users of the website version of CKAN were not that strong a community, compared to the institutional users of the CKAN platform. This impression is formed from having posted a relatively simple question on the CKAN mailing list, followed by the same question on Stackoverflow, which received so far no answer. To be clear, people at CKAN were immensely helpful and did try to help, it is just that the knowledge base seems stronger around CKAN than datahub.io
Which one did I choose?
Because I think that you should start with the right data model, as down the road it is the hardest to fix an incorrect one, I have chosen datahub.io to host my datasets. I just hope that they will grow a stronger community around it (maybe by differentiating it better with CKAN, which is a different product after all). Or that figshare will reconsider its data model?
Want to help?
So, as I said I have this unanswered, supra easy question about using the API for datahub.io. Earn 100 reputation points easily on Stackoverflow (and be assured of my eternal gratitude), go answer it!
I am Clement Levallois, a social scientist and data visualization specialist currently based at Erasmus University Rotterdam, The Netherlands.
Did you like this post? Visit the website of the consultancy connected to this blog (www.exploreyourdata.com), and share it on Twitter!









