The world can work better

Theory versus practice: multiple regression instead of expensive experiments

Undertaking attempts to decipher the causal structure of the reality that surrounds us, the first major and probably the biggest barrier is the number of variables. When discussing the construction of econometric models, Greenspan justly pointed out to the laboriousness of selecting variables and assumptions of the prepared equations before there is a result reflecting historical data adequately enough for the identified correlation of the observed variables to be considered as a basis for future forecasts. In order to meet the demand for the analysis of connections and correlations between quantities reflecting the parameters of reality, statistics present analysts with a multiple regression tool. It is a correlation technique which consists in simultaneously calculating the correlation between many independent (explanatory) variables and a given dependent variable (explained), namely an attempt to answer the question: what is the impact of the examined variable, taking into account the influence of all other observed variables, on the dependent variable. Leaving aside statistical jargon: what is the cause and what is the effect? If any observed variable is the cause of a given phenomenon, to what extent? Or are they interdependent variables and the reason for their variability lies deeper?

The effects of amateurs seeking cause-and-effect relationships by means of multiple regression surround us with the constant information noise in the mass media. Every now and then we learn that scientists of one or other provenance proved that coffee is bad for the heart or rather the opposite

Taking a moment away from the economy, the effects of amateurs seeking cause-and-effect relationships by means of multiple regression surround us with the constant information noise in the mass media. Every now and then we learn that scientists of one or other provenance proved that coffee is bad for the heart or rather the opposite, that even the smallest dose of alcohol generates some risks – or is good for you. We are bombarded with assurances that some kind of diet, dietary supplement, shark liver extract or evening primrose oil were examined, and the study showed this or that, etc., etc. Some time later, we can come across information, revealed in a smaller font and not on the first page that, however, it may not be exactly the case as initially trumpeted.

The very amount of information generated by research using the benefits of multiple regression analyzes may seem to be bothersome but does not seem to be dangerous. Beware. Studies on establishing cause-and-effect relationships based on multiple regression carry (as Richard Eugen Nisbett writes in his book Mindware) the danger of a basic cognitive fallacy – self-selection(46). The selection of the sample of the analyzed cases may, with a large number of variables and with the unconscious researcher, not be random. This selection, despite the efforts to preserve the random selection principle, is infected with deformations of non-representativeness that contaminate the test with two, three or numerous common group features that instead of being unimportant stowaways are the proper causes we seek but which existence we do not suspect in the representative sample. Statistical analysis then shows the correlation of variables, their coexistence, but also evokes the illusion of causality.

False dependencies, writes Taleb, are the fastest to be seen(47). When a researcher has a certain picture, some intuition, concept he wants to validate via observation and using available research statistics, then it is extremely easy to get deceived by one’s own behavioral lenses – hundreds of such cases are described by Kahneman in Thinking, Fast and Slow or Richard E. Nisbett in Mindware which discusses in detail the various levels of errors generated by the multiple regression analysis. A randomized experiment with a double-blind test is a tool to protect oneself against cognitive errors generated inevitably by our mind. Randomization is a random division of studied objects into peer groups. A double-blind test means that neither the participants of the study nor those who conduct it have access to key information about the study being conducted. And the experiment itself means high costs – growing fast, along with the growth of the sample size.

While understanding the costs of research on the one hand and the desire – no doubt positive – to create good on the other, it is easy to tilt the balance for the sake of acquiring new medicaments, financial instruments, new government expenditures or legal regulations. Not recognizing the significant difference between the (neo)classic approximation based on the simple profit gain achieved by an individual occasionally suffering from a spasm of „animal instincts” and a complex, emergent, relational, adaptive system with internal memory, in which the introduced changes may be subject to the scalability effect, we risk – as Taleb put it – fragility. In pursuit of benefits we do not see dangers arising from the fact that the world is full of mutual, strongly non-linear dependencies beyond our cognition. And the idea, to paraphrase a well-known saying, flies out of our mouth as a sparrow to return as an ox.