This document  was written in Mars 2007, and last modified on May 11, 2008. 

 

Some of my thoughts on science communication, knowledge processing, and knowledge management.  

The problem today

Nowadays huge amounts of scientific data and information are generated by millions of scientists, from hundreds of different scientific domains. It has become impossible for one single person to acquire all knowledge from a traditional discipline like physics, or mathematics for example. The scientific knowledge is not only vast, it is also very deep. Moreover, within the last two centuries we have seen the creation of a multitude of scientific disciplines, and we are now witnessing the segmentation of these domains into sub-domains, and the creation of interdisciplinary fields. All this poses huge problems related to communicating, sharing, and processing of information. In medicine for instance the problem is very acute, for doctors must be able to blend information form chemistry, biology and sometimes physics, and to integrate it with knowledge from their own practice. Computers are now used to make this information crisis bearable. These machines store information, process and retrieve it, and with the advent of the Internet, it is shared across the planet. However, computer technology promises even more, giving that we accept some changes in our ways of doing things. 

Some of the causes of this scientific information crisis are: people use different languages; the information is formatted in different ways (tables and graphics are not readable by all automatic agents); the information is fragmented into millions of scientific papers; a great number of scientific papers are proven to be non-accurate (although this may not be obvious for someone not specialized in the particular field); the information is not available to the general population for free, etc. I believe that these causes can be eliminated. Before proposing a solution, I will define a concept that I refer to by the name of scientific oeuvre. 

The scientific oeuvre represents the lifework of a scientist, or his/her entire contribution to our scientific knowledge. The entire oeuvre is presented in a structured way, much like a book is written.  Moreover, the oeuvre is a dynamic entity, in the sense that the author can, at any time, manage it (improve, enrich, augment it). It is also a historical entity, in the sense that all past versions of it are stored, and can be retrieved. The author is the only person who can change the slightest thing in his/her oeuvre. More than one individual can be considered as an author, and a single individual can coauthor more than one scientific oeuvre. The scientific oeuvre “lives” on the Internet, and is accessible by everyone, anytime. All scientific oeuvres are built on a unique platform, in order to facilitate search, and automatic information processes. 

Before the 20th century, most of the scientific/philosophic publications took the form of a book. Authors were gathering a large amount of information and were presenting it in a coherent and structured way. Science was not a popular activity, and certainly not as dynamic as it is today. During the industrial revolution, science and technology became the very source of economic growth. The need to communicate scientific knowledge effectively and in real time became crucial. The printing technology was already there to respond to this growing need, and the era of the scientific paper in science communication began. Its role was to respond to a fast-paced scientific development, to the immense quantity of scientific data generated, and to the significant segmentation of science in a multitude of domains and sub-domains. There is no time to wait for Dr. Einstein to publish his book on E=mc2. A scientific journal containing small and easily written papers, produced at a very efficient cost, distributes in a relatively short time important but small discoveries, to the entire scientific community. Specialized scientific publications rapidly appeared for all existing research areas. Massive catalogs were also created to assist searching and recovering of papers from this vast scientific literature that was generated.

Authorship recognition is maintained by a system of reference. The system of reference also serves the role of completing the information of a paper in a very compact way, by referring to other works covered by another paper, rather then rewriting everything at length. Quality insuring systems were also implemented to weed out the "bad science". Very large publishing organizations were quickly created, which covered wide scientific domains, and produced a large number of different scientific journals. Some of these organizations have also specialized in other types of medias, for example they organize live events or proceedings. Others specialize in science vulgarization, targeting individuals outside of the scientific community.

As technology evolved the presentation (layout) improved, and publication costs dropped. Publishing organizations began to offer a better product to a larger population. However, the greatest change was introduced by the advent of the Internet, following the development of computer and telecommunication technologies. In my opinion, this will ultimately put an end to the era of the scientific paper, rather sooner than later.

We can look at the Internet as a source of information: People store information on different machines, and the Internet renders the sharing, or the distribution, possible among a huge number of individuals across the planet. But in reality the Internet is more than just that. Using powerful search engines we can search, find, and sort and structure information no matter how various, or how scattered it is on the network. Specific scientific information is retrieved within seconds, without the need to know where it is actually located. This last observation is at the origin of the expression “I found it on the Internet” which implicitly conveys the idea of non-locality. Other important improvements brought by the internet technology are: e-mail-based scientific newsletters, scientific forums, and tele- or video-conferences. As I mentioned earlier, technology has also greatly contributed to the presentation of scientific concepts and results, making possible the inclusion of complex graphical presentations, and of image, video, and sound digital files. All these possibilities introduced by computer and Internet technologies have already been implemented in the field of science communication, but we are far from harvesting the full potential offered by this marvellous technology.

Recently we have witnessed the emergence and the explosion in popularity of wiki and social networking technologies. This phenomenon has taken the name of web 2.0. The fundamental idea is that the Internet enables the user (author) not only to create and share content, but also to modify or manage it anytime. Moreover, a group of people can coordinate their efforts to produce content, which in some cases represents highly complex and complete scientific knowledge. Better communication capabilities and unlimited access to scientific information can lead to a better critique, and to real-time correction of errors. Powerful search engines combined with clustering technologies enable one to generate domains of scientific knowledge from a gigantic and very diverse database. All this together leads to something bigger than just scientific communication - we are talking here about scientific knowledge management and processing, as well as about liberalization of scientific knowledge.

Web 2.0 has not fully penetrated the scientific community yet. Its full implementation means, in my opinion, the extinction of the scientific paper, and its replacement with the scientific oeuvre, or something equivalent to it. But this doesn’t come easy for different reasons. First, the members of the scientific community are slow to change, because their status relies on the actual system: the reputation of a scientist is formally calculated according to a system of points, which is based on the number of papers published, on the rank of the journals where papers are published, and on the number of times the author is cited in other papers. Second, people incorrectly believe that the scientific paper insures a higher quality of scientific information, and see the Internet as a haystack, a place where good and bad information is thrown in, and become  impossible to tell apart. And third, I believe change has been hindered by the publisher organizations themselves for economic reasons. The problem with the scientific publishers is that they don’t produce the scientific knowledge, they merely distribute it. We have to understand that the very existence of these organizations lies on the distribution of scientific knowledge, and the transformation proposed here induces a serious change in the way scientific knowledge is distributed. It is only normal to expect some resistance from their part when something is forcing them to review their business model, when something is threatening their foundations.

Beyound web 2.0

The Internet as we know it is a human creation, and we must take into account its future development, and the directions it might take. 

Talking about  web-action, and "living" content. This idea will be developed soon.
 

 

The scientific paper system vs the new system

 

Accessibility

Currently, most of the scientific knowledge is protected and is not free of charge. Academic institutions spend millions of dollars to maintain access to the most important databases. However, an independent person with a limited budget is incapable to access and process reasonable amounts of scientific information. The scientific knowledge produced by an individual is his/her intellectual property. Moreover, most of this knowledge relies on public infrastructures supported with public funds. Any individual should have free and unlimited access to this information. No one can have a monopoly on the scientific knowledge produced by individuals with public resources,  who want to share their work. Technology makes management and publication of scientific information very cost effective. Wikipedia is free, why can’t scientific knowledge be the same? 

 

Fragmentation

In the current system, the work of a scientist is fragmented in hundreds of publications. Furthermore, information on one specific subject is also highly fragmented. We all recognize the value of well-done reviews or books, where authors synthesize massive amounts of information scattered around the scientific literature. In the new system, a scientific oeuvre represents a coherent presentation of all the work of one individual. This format makes it suitable for automatic information processing. 

 

Accuracy

A large number of publications have been proved to be inaccurate. It is always possible to find in the literature other publications that set the record straight, and that may or may not refer to the original wrong paper. When we are searching scientific information, we are always running the risk to find that wrong paper. If the person involved is knowledgeable in that particular domain, alarm bells should ring when this particular paper is first read. But scientific knowledge is not there only for the specialist. People with different backgrounds should also have easy access to accurate information. In the current system, the author cannot manage in real time its own papers. Errors accumulate over time, and are buried in a very complex web of cross-references. In the new system, the author can react to critics, and update his/her oeuvre. Only after the author becomes inactive this particular oeuvre becomes static. 

 

Processing the scientific knowledge

If the scientific oeuvres are built on a universal platform using a unifying ontology, search engines, combined with clustering technologies, can be used to generate scientific knowledge domains. A domain can be defined as narrow, or as broad, as you want it to be. In more concrete terms, automatic agents can be designed to continuously search, store (the address and other locating parameters), and structure information related to a very specific scientific topic. With minimal input from a human operator to refine the results, this would enable the creation and maintenance of very complete review works. In the current state this cannot be achieved, because the information is stored in such a way that it cannot be easily interpreted and retrieved by an automatic agent. This is due to different factors: 

  • Different standards are used, and some information is lost in the formatting process.
  • Fragmentation: the information is fragmented into a large number of small publications.
  • Accuracy: a lot of publications contain proven non-valid scientific information and automatic agents can’t "know" that. 
  • Ontology: across domains, automatic interpretation is impossible because of incompatible ontologies underlaying the technical languages. 
  • Access: most of scientific publications are protected, and not free of charge.

In the case of the New System these problems disappear:

  • Access: all the scientific information is available to everyone, for free of charge.
  • Ontology: all oeuvres are based on a unique underlying structure and ontology, which makes it easier for an automatic agent to recognize, interpret, and retrieve information.
  • Fragmentation: the scientific knowledge produced by one author is presented within a unified structure (it is not fragmented).
  • Accuracy: the oeuvre is up to date, as the author has the possibility to manage it in real time. 

 

It is clear that the scientific paper has become outdated. It is clear that computers and the Internet will play a bigger role in the creation, processing, management, and distribution of scientific knowledge. This is how I view the future. What do you think?

 

Other people express similar ideas

 

Peter Murray-Rust

Peter Murray-Rust is Reader in Molecular Informatics at the University of Cambridge and Senior Research Fellow of Churchill College

See his Google presentation HERE, and the abstract of this presentation below.

ABSTRACT: The millions of scientific papers published each year are an amazing source for scientific discovery but in most of them the experimental data is destroyed by the publication process. Publishers insist on converting semantic data into PDF which effectively destroys everything. We have been developing social and technical strategies to preserve and liberate this data and where this has happened have been able to create completely new mashups and other semantic resources.
Chemistry is the most tractable discipline for the semantic web - most chemistry can be turned into XML with little semantic loss, using Chemical Markup Language and complementary MLs such as XHTML, MathML and SVG.
We have to mobilise a bottom-up revolution through modern Internet ideas - blogs, communal source development, interoperability. We have done this in chemistry through the Blue Obelisk movement - an informal but coherent group of young-at-heart hackers. We are adopting lightweight web technologies ("REST", etc.) to chemistry - an example will be CMLRSS which we run in a Bioclipse environment.

 

Andrew Walkingshaw

Researcher at the Unilever Centre for Molecular Informatics. Watch his seminar online "Web 2.0 for Scientists - an introduction" on his blog, but also on Youtube.