Saturday, April 26, 2014

Licence choices for static and dynamic open data


Uploading dynamic datasets

My recent reading on the prospects for open data within archaeology in general has highlighted the distinction between open data that is static (such as a .pdf file that includes the finalised version of a report) and dynamic (data that can be incorporated into other datasets, can be analysed, and can be added to and updated). 

As I mentioned in my last post I am beginning to experiment with uploading dynamic data. As my data is from archaeobotanical analysis, these take the form of .csv files that list the identifications of different seed types found in archaeological deposits, and their quantities. These datasets, uploaded to both my Zenodo account and my Figshare account, have the potential to become dynamic open data, because they are stored in formats that can be reused. This why they have been stored as .csv files: csv (comma separated value) is a simple text-based format that can be read by many different applications. The format is used to exchange data between applications that do not otherwise "talk" to each  other.

.csv has great benefits for re-use over the .pdf file, as with a .pdf is difficult to re-use the data unless to re-type the information that it contains. (However, the .pdf does have other advantages, particularly that can be used to version control, and therefore for referencing. This is one of the most obvious advantages of the static format.)

What licence to choose?

While I have been converting Excel and Open Office files into .csv formats, I have been thinking about licences to choose. When I uploaded reports, I licensed them under a Creative Commons Attribution licence, CC-BY. But these were those static .pdf files with their version control. Datasets are different. In Figshare, the CC-0 licence is the default. This means that the data can be re-used by anyone without attribution. (Figshare do give reasons for this). This means that there is no legal obligation for the person using the data to cite it, although, as Figshare point out, the moral obligation remains. The conventions of academic citation mean that it is unlikely that someone using the data for genuine purposes would actually not bother to cite. Or if they did, it is unlikely that the person would retain their credibility (assuming they were caught). 

I struggled with this at first. Mostly, I hope, out of an apprehension that the practice of removing the legal obligations could begin to erode the moral and academic necessity of citation (rather than out of a vain wish to be cited as much as possible). I initially began uploading datasets to Zenodo, because they provide a range of licence choices, for datasets in particular, while Figshare is more limited. 

And then I went to a talk by Puneet Kishor from Creative Commons, one of the people who worked on the creation of the CC-0 licence. He pointed out that the Creative Commons licences do not stop people with no conscience from using data and text without attribution, instead the CC-0 licence is a way to help the people with good intentions, those who wish to work within the law, to re-use datasets from other researchers. And my ideas about using the CC-0 licence have changed. 

Subsequent reading has made me question whether I have the right to licence the data anyway; a fact is not copyright-able....and it is a fact that a grain of emmer wheat was found in Sample X from Site Y. (Although this could be questionable when the layers of interpretation that went into the retrieval and identification of that emmer wheat grain are taken into account....where to sample, how much to take, how to process, the identification decisions made based on the morphology of the grain, and so on.)

Nevertheless, I am about to start using the open licence more.