I really like these moments of reflection, which usually come at the end of a big project. What to do next? What to learn? To continue in the previous direction, or to change it slightly, or even dramatically? If you are a scientific researcher, and you are currently enjoying this moment, if you are looking for interesting research opportunities, you must not disregard content mining (or, if you prefer, text and data mining – TDM).
TDM is a research technique that is based on the extraction of data from previously published content (for example peer-reviewed papers). The point is that some scientific problems can be addressed without generating new empirical data, based instead on a huge amount of already available data. For example, DNA sequences or chemical formulas presented within scientific papers might be aggregated and analyzed by mining software to find some new assumptions. Also, charts with data, graphs, and even concluding sentences might be analyzed, and may lead to new conclusions in every field of academic activity. Sometimes the same data can be used to solve different problems, and sometimes even old problem might appear differently in juxtaposition with data generated by hundreds of researchers from various fields.
50 million published scholarly articles exist. There are plenty of new problems emerging in every discipline every year and no doubt some data published previously might be useful in solving them. What is more, no one can search these articles manually. Following all publications from one, narrow field is usually enough work for one scientist, and some data, which might be useful for him or her, could be published by researchers from other fields. Some fields are so crowded that data mining could by useful to simply recapitulate the most common conclusions (by so called “vote counting”) and to mark the differences in published articles. You can also perform meta-analysis to verify theories on huge amounts of empirical data.
Content mining offers the possibility of a new revolution. The report from the Expert Group of European Commission claims that:
There is growing recognition that we are at the threshold of the mass automation of service industries (automation of thinking) comparable with the robotic automation of manufacturing production lines (automation of muscle) in an earlier era. TDM will be widely used to provide insights in the re-design of this digital services economy.
There is however one problem with TDM. It is seen as the creation of derivate work, so in fact it can be treated as a violation of copyright. Although, at the moment scientific TDM is allowed in the United States for every single piece of content, regardless of licensing because it is seen as fair use. Similar regulation exists in a few other countries. Starting from June 1, 2014, it is legal in the United Kingdom to mine every piece of content without additional permission, but only for non-commercial purposes. In a huge number of other countries (including the majority of EU ones) there are no additional regulations on content mining and it is treated as a creation of derivate work. To mine works that are licensed under a Creative Commons No Derivatives License, or under regular copyright, you need permission from the copyright holder (an author or a publisher, depending on the contract between them). If you would like to mine a big database containing the works of thousands of authors and several publishers it could be very complicated.
The good news is that the European Commission is under growing pressure to solve this problem. The report that I mentioned above was published this spring and it recommends the creation of helpful regulations on data mining to preserve a competitive position of European science. Probably legal barriers for content mining will be soon abolished in major parts of the world. I recommend you to prepare for this moment and to start learning what and how to mine.