Matthew Davis is a Postdoctoral Research Fellow at The University of Texas at Austin. His research interests includes proteomics, genomics, and bioinformatics. And he is the inventor of PubChase.
WK: What is PubChase?
Matthew Davis: PubChase is a web and mobile application designed to save scientific researchers time. Researchers want to know what they need to read in order to stay up to date with the literature, and PubChase uses statistical modeling and data mining to generate article recommendations customized for each user. I have talked with a lot of people over the last few years about how much time they spend on searching for literature, and estimates vary from half an hour a week to 3 hours weekly. Some people check their RSS feed 3 times a day, for example, in the elevator or on a way to a meeting. The point is that people spend a lot of time searching through newly published literature to find what they should be reading, and PubChase saves time by performing the search for them.
How does PubChase work?
Each user has a personal library on PubChase and the articles in this library inform us about the user’s research interests. When new articles are published, PubChase can then determine if it is likely to be of interest to a user based on the other articles in their library. This is pretty much how Amazon works when it gives you recommendations based on books you have already read. The tricky thing about the scientific literature is that we are looking at article that have just been published. So we cannot base the recommendations which users have already read the articles, because new articles have no readership at all. We do consider social clusters and what users similar to you have already read, but the most important thing for PubChase is the article metadata that predicts the likelihood of the user wanting to read the article. And the most predictive metadata are usually obvious ones such as who the author of a paper is and where the article was published.
How many articles should be added to a library to achieve accurate recommendations?
This is a question that I am asked frequently. The answer is that it completely depends on who you are and what your interests are, because what the algorithm does is it looks at what articles you have read in the past and tries to predict what will be of interest to you – but also looking at people similar to you and what they have read. If you have very strange interests and there is no one like you, the algorithm will be learning only from your library. In that case, you need to add more articles to the library, maybe 100, or 200 to get excellent recommendations. But if there are a lot of researchers like you, you can get good recommendations with only 20 or 30 articles. For most people it is about 50 papers.
Are predictive algorithms used by PubChase open or closed source, and why?
The source has not been published. I am always happy to talk about source details, and the code should not be hard to recreate for anyone who wants to do something similar. We are a small company, and there are not many people to work on the code to test it and improve it. Thus the code works, but it is sloppy and I would be a bit embarrassed to release the source code. I should probably take time to clean it up and publish it.
What is the story of the creation of PubChase?
The orgins of PubChase are a few Python scripts that I wrote several years ago to search through the Entrez PubMed API. I decided to use some statistics on the search results to give them a rank, so that when I got 500 results, the best results were for me were placed at the top of the list. In the early summer of 2012, I was in Cambridge, Massachusetts, and I met a friend from my PhD studies, Lenny Teytelman. Lenny told me he had just started a software company named ZappyLab, to make mobile software for scientists. I said something that people of California usually laugh at “I have a great idea for an app!” My idea was something similar to Amazon recommendations. It was about informing people what paper they need to read every week.
Quickly our conversation led to numerous reasons why it should be made available for other people than just myself. Both Lenny and I did our doctoral work in biology, so we understood the problems of scientists. Before my Ph.D., I was working at IBM, cooperating with the Linux community, thus the idea of openness and access to information was important to me. We continued this conversation on what else PubChase could do once it had enough users. Ideas similar to what Altmetrics is trying to do now, by providing a new kind of insight for evaluating articles, to say who read each particular article and how he or she reacted to it. We hoped it might be something that evaluates the quality of a work and provides some additional insight on what is valuable to the people reading the papers. That was a big deal for Lenny, thus we started work and created a mobile app and website.
You are a postdoctoral researcher, and you probably have enough things to do. Why are you spending your time working on PubChase?
I have made a little money on contract with ZappyLab but it is nothing close to the market rate. The truth is that PubChase was my idea and I want to see it implemented and growing. I also think that once you make something and invite people to use it, you are also obligated to continue the service. Now we have several thousands of users and I feel obligated to these people. I really do believe that this is important for the long-term solutions to problems in scientific publishing. A lot of people write on blogs and in journals about problems with access to information and the evaluation of scientific work. I think that tools like PubChase may play an important role in fixing these problems – by showing more data about articles and about people who read them.
Do you think that PubChase may help to fix problems with peer-review?
We have recently integrated with PubPeer so that you see when there is PubPeer discussion about an article in your library. And we had already integrated with Retraction Watch, so that you will get an alert when an article from your library was retracted – these things are connected to a review process. But I can also imagine that data from PubChase might be used in the close future to evaluate articles. And we also invite people to publish their backstories, and discuss about published articles on PubChase.
How is PubChase funded?
Privately, through ZappyLab. We would rather have users than money, so we have made PubChase and all of our mobile apps totally free. People who need it the most are the graduate students, because they benefit the most from being introduced to the literature. These are people who have small budgets and are not likely to pay. PubChase does not have to make money. It is the first major product form ZappyLab, but not the biggest nor the last. I think that it is important to allow people to trust that we make good software, and we hope PubChase is building that trust.
PubChase is crowding PubMed Central databases – do you use, or plan to use, different sources of articles?
I chose PubMed because I am a biologist, but also because of their excellent open access policy. You can get entire National Library of Medicine metadata, download it, and integrate into your database, as well as live updates with their API – so they made it really easy to use. We would like to integrate other ones, for example, arXive think will be easy to do, and it is just a matter of time and money.
I have read a post on the PubChase blog claiming that you are not allowed to crowd part of PubMed Central database due to copyright issues. Did you take any actions to solve this problem?
The problem was not with PubMed. We could not reproduce part of the articles using the Lens format, and changing the format of the file is reproduction. The blog was about the problem that open access is sometimes not so open due to different licenses used by publishers, and its conditions. Articles are free to read but not completely free.
What should an author do to make his or her article discoverable on Pubchase?
That is funny question. When I first started talking to friends and coworkers about PubChase, there was a group of them that were very skeptical. They thought that people would find a way to game the system. But I think that the algorithm is quite hard to influence. However, if you want to make your articles more discoverable on PubChase you should find the most popular authors and collaborate with them. Then your name will appear next to theirs and you will get a higher rank. So if you want to be more discoverable go do a post-doc in Harvard and publish a paper with a famous Nobel Laureate. When you finish it and you start working in your lab, then PubChase will put your articles at the top of the list.
Thank you for the interview!