Data as an inexhaustible resource, the expert’s raw material
Following Part 2: The end of averaging
3) Data: an inexhaustible resource – the expert’s raw material
Whether they are in massive amounts, of reduced volume, or of intermediate size, data are always generated with a precise objective (they are then characterised as primary data).
They can then be used for other purposes (secondary data).
Secondary data are always rich in information, as long as there are specific questions or objectives and a minimum of information – on the context in which they were collected, and on the objectives of their collection.
First of all, it is important to distinguish between unstructured and structured data:
1. Unstructured data are typically texts, images, sounds or videos, the volume of which has grown exponentially in recent years with the development of information systems, and in more recent times, of social networks and mobile terminals. This unstructured data can be the subject of specific analyses in order to extract meaning from it or simply to transform it into structured data. Dictated, transcribed or handwritten medical reports are a good example of unstructured data.
2. Structured data correspond to quantities or modalities whose nature is known. Whether numerical, discrete or continuous, textual or categorical, structured data describe specific situations or phenomena.
Structured data are mostly collected by experts of a discipline in order to verify a hypothesis, or to follow a process. To this purpose, they document a set of observations in a variety of contexts at a given moment or over a given period of time. These observations are described by a set of variables, more or less numerous, reflecting a reality which is often complex and undescribable in all its dimensions.
Incidentally, it is notable that the expert is able to perceive a certain number of phenomena that are almost never available in the databases. For example, the structured data collected in clinical studies contain demographic (sex, age), clinical (weight, height, diagnosis, treatment, dose, duration), biological and biomedical data. However, they almost never contain data concerning the psychological or emotional profile of patients, nor on the quality of the relationship with their doctor. Yet these variables have a significant influence on the effectiveness of treatments.
The possibility of pooling and analyzing a very large amount of data from multiple experts – and therefore having a much wider range of experience than that of a single expert – more than compensates for the relative difficulty or impossibility of collecting certain variables. Indeed, if technology does not yet enable the measurement or the qualification of certain phenomena that are naturally perceived by the expert, the abilities of centralizing, storing, manipulating and processing of multidimensional data offered by computers open new horizons to experts of all disciplines.
If you observe the occurrence or intensity of a phenomenon while simultaneously monitoring a given parameter, you will soon detect by yourself – and without a computer – the presence and nature of a possible correlation between the parameter and the phenomenon.
This will also be possible – although at the cost of a slightly greater effort – with 2 parameters.
On the other hand, it will quickly become very difficult with 3 or more parameters, and impossible with more than 7 or 8 parameters.
Let’s take a look at the way the expert acquires his knowledge through practice. It is by accumulating experience through observations over a sufficiently long period of time that the expert is able to establish links between his actions, the contexts in which he acts, and the consequences of his actions.
Once again, these links are not necessarily conscious, and can be intuitive.
Is the expert’s mental process global and homogeneous by nature? Does he try to summarize all the particular cases he has been confronted to by a single large multivariate function that would allow him to predict the outcome of this or that action, and to make his choices according to this prediction?
Is the expert’s mental process instead specific and heterogeneous in nature? Does he seek to detect positive or negative biases in his experience, in relation to what he is seeking to achieve? By bias, we mean specific situations in which the phenomenon to be reproduced (or avoided) occurs with an abnormally high (or abnormally low) frequency compared to the average.
One thing is certain: the expert begins to learn from his experience with his first successes. Just as he stops learning from this experience if he no longer experiences failures. Learning feeds on contrast, when it emerges from noise and randomness.
This is also true for machine learning, which consists in exploiting available data based on past observations, often with an aim to predict a phenomenon according to variables selected according to their influence on the phenomenon of interest.
These machine learning techniques are now widely used in many sectors, and in particular in the medical sector, especially in the field of diagnosis and biomarkers. However, their success is very nuanced. Indeed, most of the biomarkers or bio-signatures predictive of a pathology or a response to a treatment, developed from the data of a set of patients, fail in the validation phase. In other words, they lose their predictive performance when tested on data from other sets of patients that are totally distinct from those according to which they were developed. Most publications regarding new markers or new signatures only propose so-called “cross-validations” which consists in dividing the learning process into X subsets ato learn X times in a row with the leave-one-out approach for validation.
Although these validations, which could be described as “inbred”, partially enable to evaluate the quality of the model, and to limit overlearning, they do not enable to evaluate the way it may behave with another data set. An article in Plos Computational Biology from 2011 demonstrated this very well (David Venet, Jacques E. Dumont, Vincent Detours “Most Random Gene Expression Signatures Are Significantly Associated with Breast Cancer Outcome”, October 20, 2011 DOI: 10.1371/journal.pcbi.1002240).
These approaches, who mainly aim to provide a global prediction, necessarily make the implicit assumption that the training data are representative of the source population. This implies that the biases present in the training data are identical to the biases present in the source population, as well as to those present in any other sample of observations from the latter. However, the smaller the sample is compared to the source population, the less true is this observsation.
Can a model built on a few hundred or even a few thousand patients be generalized to a few ten or hundred million patients? Maybe with a lot of luck. Most of the time not.
The amount of patients included in clinical trials systematically remains exceedingly low compared to the source population – hence very far from the world of “Big Data”.
Incidentally, it is striking to note that the amount of targeted therapies remains relatively low, and that personalized medicine mainly remains a concept despite the explosion of so-called “biomic” data, which provide more and more information on the biology of patients. Accumulating these data on increasingly large series of patients is probably a way to reduce the reduction rate of these models, by working with samples that are increasingly representative of the source population. This is what Craig Venter intends to do with his new company, Human Longevity Inc., which intends to sequence 40,000 patients per year, in order to quickly reach an amount of 100,000.
However, is this the only way? And will it be enough?
Find out more in the next episode: Big Data: a new therapeutic option? Part 4: Predict or cure?