Epigenetics regulate our genes: but how do they change as we grow up?

Rosa Mulder1,2                    Esther Walton3,4 & Charlotte Cecil1,5,6

Follow Esther and Charlotte on Twitter.

Epigenetics can help explain how our genes and environment interact to shape our development. Interest in epigenetics has grown increasingly within the research community, but until now little was known about how epigenetics change over time. We therefore studied changes in our epigenome from birth to late adolescence and created an interactive website inviting other researchers to explore our findings.

What is epigenetics?

The term ‘epigenetics’ refers to the molecular structures around the DNA in our cells, that affect if, when, and how our genes work. Even though nearly every cell in our body contains the exact same copy of DNA, cells can look and function entirely differently. Epigenetics can explain this. For example, every cell in our body has the potential to store fat, but in adipose tissues the cells’ epigenetic structures cause the cells to actually store fat.

Before birth, epigenetics plays a role in the specialization of cells from conception onwards by turning genes ‘on’ and ‘off’. After birth, epigenetics help our body develop even further, and maintain the specialization of our cells. However, the way epigenetics influence how our cells function is not only programmed by our genes, but may also be affected by the environment. Hence, our development and health is shaped by both our genes and our environment. Researchers are therefore trying to measure epigenetic processes to understand the role that epigenetics plays in this process of ‘nurture affecting nature’.

Both nurture and nature influence our health; understanding epigenetics helps us to find out how they might interact.

How can we measure epigenetics?

One of the types of molecular structures that can affect gene functioning is ‘DNA methylation’. Here, a small molecule (a methyl group of one carbon atom bonded to three hydrogen atoms; Figure 1) is attached to the DNA sequence. DNA methylation affects the three-dimensional structure of the DNA and can thereby turn it ‘on’ or ‘off’. DNA methylation can now easily be measured in the lab with the help of micro-chips; very small chips that can detect hundreds of thousands of methylation sites in the genome at a time, from just a small droplet of blood. Such chips are now used in large epidemiological cohorts such as ALSPAC to measure the level of DNA methylation for each of these sites. In epigenome-wide associations studies (EWASs), researchers study the associations between each of these methylation sites and a trait, such as prenatal smoking, BMI, or stress.

Figure 1: DNA sequence with DNA methylation

How does DNA methylation change throughout development?

Until recently, EWASs have mainly been cross-sectional, studying DNA methylation only at one time-point. So, even though research indicates that epigenetics is important in postnatal development, we do not know how true this is for DNA methylation sites measured with these epigenome-wide arrays. Studying a mechanism that supposedly changes over time without  knowing how it changes can be problematic: say that we find an association between smoking during pregnancy and DNA methylation at birth, can we still expect this association to be there at a later age? To fully interpret EWAS findings, and to compare research findings between different studies, we need a full understanding of how DNA methylation changes throughout development.

We therefore set out to study DNA methylation from birth to late adolescence, using DNA methylation measured in blood from the participants of ALSPAC in the UK, as well as from participants from another large cohort, the Generation R Study in the Netherlands.

We studied the change in levels of DNA methylation over time as well as variation in this change between individuals. If DNA methylation is indeed mainly linked to the basic developmental stages we go through as we grow up, we would expect methylation changes to be largely consistent between individuals. However, if DNA methylation is affected more by the different environments we live in, and individual health profiles, we would expect a proportion of sites to change differently for different individuals.

Between ALSPAC and Generation R, we created a unique dataset containing over 5,000 samples from about 2,500 participants with DNA methylation measurements at almost half a million methylation sites measured repeatedly at birth, 6 years, 10 years, and at 17 years. With various statistical models we studied different trajectories of change in DNA methylation.

We found change in DNA methylation at just over half of the sites (see for an example Figure 2a). At about a quarter of sites, DNA methylation changed at a different rate for different individuals (Figure 2b). We further saw that sometimes change only happened in a specific time period; for example, only in between birth and the age of 6 years after which DNA methylation remained stable (Figure 2c), and that sometimes differences in the rate of change only started from the age of 9 years (Figure 2d). Last, for less than 1% of the sites on the chromosomes tested (we did exclude the sex chromosomes), we saw that DNA methylation changed differently for boys and girls (Figure 2e).

Figure 2. Different examples of methylation sites, with every graph representing one methylation site with age on the x-axis and level of DNA methylation on the y-axis. Every line represents change in DNA methylation over time for one individual, showing (a) change in DNA methylation, (b) different rates of change for different individuals, (c) change during the first six years of life, (d) different rates of change starting from 9 years of age, (e) different change for boys and girls, and (f) change, but no differences in rate of change in a site associated to prenatal smoking.

How can we use these findings in future research?

These results show that there are sites in the genome for that show change in DNA methylation that is consistent between individuals, as well as sites that change at a different rate for different individuals. We have published the trajectories of change for each methylation site on a publicly available website. This makes it easier for other researchers to find sites that are developmentally important and may be of relevance for health and disease. For example, a methylation site previously associated with prenatal smoking, remained stable over time (Figure 1f), indicating that prenatal influences of smoking may be long-lasting, at least up to adolescence. In the future, we hope to associate traits, such as stress and BMI, to these longitudinal changes, to further our understanding of the developmental nature of DNA methylation and the associated biological pathways leading to health and disease.

 

1Department of Child and Adolescent Psychiatry/Psychology, Erasmus MC, University Medical Center Rotterdam, Rotterdam, the Netherlands

2 Department of Child and Adolescent Psychiatry/Psychology, Erasmus MC, University Medical Center Rotterdam, Rotterdam, the Netherlands

3 MRC Integrative Epidemiology Unit, Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK

4 Department of Psychology, University of Bath, Bath, UK

5 Department of Epidemiology, Erasmus MC, University Medical Center Rotterdam, Rotterdam, the Netherlands

6 Department of Psychology, Institute of Psychology, Psychiatry & Neuroscience, King’s College London, London, UK

 

Further reading

Mulder, R. H., Neumann, A. H., Cecil, C. A., Walton, E., Houtepen, L. C., Simpkin, A. J., … & Jaddoe, V. W. (2020). Epigenome-wide change and variation in DNA methylation from birth to late adolescence. bioRxiv. (preprint)

Epidelta project website: http://epidelta.mrcieu.ac.uk/

Are schools in the COVID-19 era safe?

Sarah Lewis, Marcus Munafo and George Davey Smith

Follow Sarah, George and Marcus on Twitter

The COVID-19 pandemic caused by the SARS-COV2 virus in 2020 has so far resulted in a heavy death toll and caused unprecedented disruption worldwide. Many countries have opted for drastic measures and even full lockdowns of all but essential services to slow the spread of disease and to stop health care systems becoming overwhelmed. However, whilst lockdowns happened fast and were well adhered to in most countries, coming out of lockdown is proving to be more challenging. Policymakers have been trying to balance relaxing restriction measures with keeping virus transmission low. One of the most controversial aspects has been when and how to reopen schools.

Many parents and teachers are asking: Are schools safe?

The answer to this question depends on how much risk an individual is prepared to accept – schools have never been completely “safe”. Also, in the context of this particular pandemic, the risk from COVID-19 to an individual varies substantially by age, sex and underlying health status. However, from a historical context, the risk of death from contracting an infectious disease in UK schools (even in the era of COVD-19) is very low compared to just 40 years ago, when measles, mumps, rubella and whooping cough were endemic in schools. Similarly, from a global perspective UK schools are very safe – in Malawi, for example, the mortality rate for teachers is around five times higher than in the UK, with tuberculosis causing more than 25% of deaths among teachers.

In this blog post we use data on death rates to discuss safety, because there is currently better evidence on death rates by occupational status than, for example, infection rates. This is because death rates related to COVID-19 have been consistently reported by teh Office for National Statistics, whereas data on infection rates depends very much on the level of testing in the community (which has changed over time and differs by region).

Risks to children

Thankfully the risk of serious disease and death to children throughout the pandemic, across the UK and globally, has been low. Children (under 18 years) make up around 20% of the UK population, but account for only around 1.5% of those hospitalised with COVID-19. This age group have had better outcomes according to all measures compared to adults. As of the 12th June 2020, there have been 6 deaths in those with COVID-19 among those aged under 15 years across England and Wales. Whilst extremely sad, these deaths represent a risk of around 1 death per 2 million children. To place this in some kind of context, the number of deaths expected due to lower respiratory tract infections among this age group in England and Wales over a 3 month period is around 50 and 12 children would normally die due to road traffic accidents in Great Britain over a 3-month period.

Risks to teachers

Our previous blog post concluded that based on available evidence the risk to teachers and childcare workers within the UK from Covid-19 did not appear to be any greater than for any other group of working age individuals. It considered mortality from COVID-19 among teachers and other educational professionals who were exposed to the virus prior to the lockdown period (23rd March 2020) and had died by the 20th April 2020 in the UK. This represents the period when infection rates were highest, and when children were attending school in large numbers. There were 2,494 deaths among working-age individuals up to this date, and we found that the 47 deaths among teachers over this period represented a similar risk to all professional occupations – 6.7 (95% CI 4.1 to 10.3) per 100,000 among males and 3.3 (95% CI 2.0 to 4.9) per 100,000 among females.

The Office for National Statistics (ONS) has since updated the information on deaths according to occupation to include all deaths up to the 25th May 2020. The new dataset includes a further 2,267 deaths among individuals with COVID-19. As the number of deaths had almost doubled during this extended period, so too had the risk. A further 43 deaths had occurred among teaching and education professionals, bringing the total number of deaths involving COVID-19 among this occupational group to 90. It therefore appears that lockdown (during which time many teachers have not been in school) has not had an impact on the rate at which teachers have been dying from COVID-19.

As before, COVID-19 risk does not appear greater for teachers than other working age individuals

The revised risk to teachers of dying from COVID-19 remains very similar to the overall risk for all professionals at 12.9 (95% CI 9.3 to 17.4) per 100,000 among all male teaching and educational professionals and 6.0 (95% CI 4.2 to 8.1) per 100,000 among all females, compared with 11.6 (95%CI 10.2 to 13.0) per 100,000 and 8.0 (95%CI 6.8 to 9.3) per 100,000 among all male and female professionals respectively. It is useful to look at the rate at which we would normally expect teaching and educational professionals to die during this period, as this tells us by how much COVID-19 has increased mortality in this group. The ONS provide this in the form of average mortality rates for each occupational group for same 11 week period over the last 5 years.  The mortality due to COVID-19 during this period represents 33% for males and 19% for females of their average mortality over the last 5 years for the same period. For male teaching and educational professionals, the proportion of average mortality due to COVID-19 is very close to the value for all working-aged males (31%) and all male professionals (34%). For females the proportion of average mortality due to COVID-19 is lower than for all working-aged females (25%) and for female professionals (25%). During the pandemic period covered by the ONS, there was little evidence that deaths from all causes among the group of teaching and educational professionals were elevated above the 5-year average for this group.

Teaching is a comparatively safe profession

It is important to note that according to ONS data on adults of working age (20-59 years) between 2001-2011, teachers and other educational professionals have low overall mortality rates compared with other occupations (ranking 3rd  safest occupation for women and 6th for men). The same study found a 3-fold difference between annual mortality among teachers and among the occupational groups with the highest mortality rates (plant and machine operatives for women and elementary construction occupations among men). These disparities in mortality from all causes also exist in the ONS data covering the COVID-19 pandemic period, but were even more pronounced with a 7-fold difference between males teaching and educational professionals and male elementary construction occupations, and a 16-fold difference between female teachers and female plant and machine operatives.

There is therefore currently no indication that teachers have an elevated risk of dying from COVID-19 relative to other occupations, and despite some teachers having died with COVID19, the mortality rate from all causes (including COVID19) for this occupational group over this pandemic period is not substantially higher than the 5 year average.

Will reopening schools increase risks to teachers?

One could argue that the risk to children and teachers has been low because schools were closed for much of the pandemic, and children have largely been confined to mixing with their own households, so that when schools open fully risk will increase. However, infection rates in the community are now much lower than they were at their peak, when schools were fully open to all pupils without social distancing. Studies which have used contract tracing to determine whether infected children have transmitted the disease to others have consistently shown that they have not, although the number of cases included has been small, and asymptomatic children are often not tested. Modelling studies estimate that even if schools fully reopen without social distancing, this is likely to have only modest effects on virus transmission in the community. If infection levels can be controlled – for example by testing and contact tracing efforts – and cases can be quickly isolated, then we believe that schools pose a minimal risk in terms of the transmission of COVID, and to the health of teachers and children. Furthermore, the risk is likely to be more than offset by the harms caused by ongoing disruption to children’s educational opportunities.

Sarah Lewis is a Senior Lecturer in Genetic Epidemiology in the department of Population Health Sciences, and is an affiliated member of the MRC Integrative Epidemiology Unit (IEU), University of Bristol.

Marcus Munafo is a Professor of Biological Psychology, in the School of Psychology Science and leads the Causes, Consequences and Modification of Health Behaviours programme of research in the IEU, University of Bristol.

George Davey Smith is a Professor of Clinical Epidemiology, and director of the MRC IEU, University of Bristol.

We should be cautious about associations of patient characteristics with COVID-19 outcomes that are identified in hospitalised patients.

Gareth J Griffith, Gibran Hemani, Annie Herbert, Giulia Mancano, Tim Morris, Lindsey Pike, Gemma C Sharp, Matt Tudball, Kate Tilling and Jonathan A C Sterne, together with the authors of a preprint on collider bias in COVID-19 studies.

All authors are members of the MRC Integrative Epidemiology Unit at the University of Bristol. Jonathan Sterne is Director of Health Data Research UK South West

Among successful actors, being physically attractive is inversely related to being a good actor. Among American college students, being academically gifted is inversely related to being good at sport.

Among people who have had a heart attack, smokers have better subsequent health than non-smokers. And among low birthweight infants, those whose mothers smoked during pregnancy are less likely to die than those whose mothers did not smoke.

These relationships are not likely to reflect cause and effect in the general population: smoking during pregnancy does not improve the health of low birthweight infants. Instead, they arise from a phenomenon called ‘selection bias’, or ‘collider bias’.

Understanding selection bias

Selection bias occurs when two characteristics influence whether a person is included in a group for which we analyse data. Suppose that two characteristics (for example, physical attractiveness and acting talent) are unrelated in the population but that each causes selection into the group (for example, people who have a successful Hollywood acting career). Among individuals with a successful acting career we will usually find that physical attractiveness will be negatively associated with acting talent: individuals who are more physically attractive will be less talented actors (Figure 1). Selection bias arises if we try to infer a cause-effect relationship between these two characteristics in the selected group. The term ‘collider bias’ refers to the two arrows indicating cause and effect that ‘collide’ at the effect (being a successful actor).

Figure 1: Selection effects exerted on successful Hollywood actors. Green boxes highlight characteristics that influence selection. Yellow boxes indicate the variable selected upon. Arrows indicate causal relationships: the dotted line indicates a non-causal induced relationship that arises because of selection bias.

Figure 2 below explains this phenomenon. Each point represents a hypothetical person, with their level of physical attractiveness plotted against their level of acting talent. In the general population (all data points) an individual’s attractiveness tells us nothing about their acting ability – the two characteristics are unrelated. The red data points represent successful Hollywood actors, who tend to be more physically attractive and to be more talented actors. The blue data points represent other people in the population. Among successful actors the two characteristics are strongly negatively associated (green line), solely because of the selection process. The direction of the bias (whether it is towards a positive or negative association) depends on the direction of the selection processes. If they act in the same direction (both positive or both negative) the bias will usually be towards a negative association. If they act in opposite directions the bias will usually be towards a positive association.

Figure 2:  The effect of sample selection on the relationship between attractiveness and acting talent. The green line depicts the negative association seen in successful actors.

 

Why is selection bias important for COVID-19 research?

In health research, selection processes may be less well understood, and we are often unable to observe the unselected group. For example, many studies of COVID-19 have been restricted to hospitalised patients, because it was not possible to identify all symptomatic patients, and testing was not widely available in the early phase of the pandemic. Selection bias can seriously distort relationships of risk factors for hospitalisation with COVID-19 outcomes such as requiring invasive ventilation, or mortality.

Figure 3 shows how selection bias can distort risk factor associations in hospitalised patients. We want to know the causal effect of smoking on risk of death due to COVID-19, and the data available to us is on patients hospitalised with COVID-19. Associations between all pairs of factors that influence hospitalisation will be distorted in hospitalised patients. For example, if smoking and frailty each make an individual more likely to be hospitalised with COVID-19 (either because they influence infection with SARS-CoV-2 or because they influence COVID-19 disease severity), then their association in hospitalised patients will usually be more negative than in the whole population. Unless we control for all causes of hospitalisation, our estimate of the effect of any individual risk factor on COVID-19 mortality will be biased. For example, it would be unsurprising that within hospitalised patients with COVID-19 we observe that smokers have better health than non-smokers because they are likely to be younger and less frail, and therefore less likely to die after hospitalisation. But that finding may not reflect a protective effect of smoking on COVID-19 mortality in the whole population.

Figure 3: Selection effects on hospitalisation with COVID-19. Box colours are as in Figure 1. Blue boxes represent outcomes. Arrows indicate causal relationships, the dotted line indicates a non-causal induced relationship that arises because of selection bias.

 

Selection bias may also be a problem in studies based on data from participants who volunteer to download and use COVID-19 symptom reporting apps. People with COVID-19 symptoms are more likely to use the app, and so are people with other characteristics (younger people, people who own a smartphone, and those to whom the app is promoted on social media). Risk factor associations within app users may therefore not generalise to the wider population.

What can be done?

Findings from COVID-19 studies conducted in selected groups should be interpreted with great caution unless selection bias has been explicitly addressed. Two ways to do so are readily available. The preferred approach uses representative data collection for the whole population to weight the sample and adjust for the selection bias.  In absence of data on the whole population, researchers should conduct sensitivity analyses that adjust their findings based on a range of assumptions about the selection effects. A series of resources providing further reading, and tools allowing researchers to investigate plausible selection effects are provided below.

For further information please contact Gareth Griffith (g.griffith@bristol.ac.uk) or Jonathan Sterne (jonathan.sterne@bristol.ac.uk).

Further reading and selection tools:

Dahabreh IJ and Kent DM. Index Event Bias as an Explanation for the Paradoxes of Recurrence Risk Research. JAMA 2011; 305(8): 822-823.

Griffith, Gareth, Tim M. Morris, Matt Tudball, Annie Herbert, Giulia Mancano, Lindsey Pike, Gemma C. Sharp, Jonathan Sterne, Tom M. Palmer, George Davey Smith, Kate Tilling, Luisa Zuccolo, Neil M. Davies, and Gibran Hemani. Collider Bias undermines our understanding of COVID-19 disease risk and severity. Interactive App 2020 http://apps.mrcieu.ac.uk/ascrtain/

Groenwold, RH, Palmer TM and Tilling K. Conditioning on a mediator to adjust for unmeasured confounding OSF Preprint 2020: https://osf.io/vrcuf/

Hernán MA, Hernández-Díaz S and Robins JM. A structural approach to selection bias. Epidemiology 2004; 15: 615-625.

Munafo MR, Tilling K, Taylor AE, Evans DM and Davey Smith G. Collider Scope: When Selection Bias Can Substantially Influence Observed Associations. International Journal of Epidemiology 2018; 47: 226-35.

Luque-Fernandez MA, Schomaker M, Redondo-Sanchez D, Sanchez Perez MJ, Vaidya A and Schnitzer ME. Educational Note: Paradoxical collider effect in the analysis of non-communicable disease epidemiological data: a reproducible illustration and web application International Journal of Epidemiology 2019; 48: 640-653. Interactive App: https://watzilei.com/shiny/collider/

Smith LH and VanderWeele TJ. Bounding bias due to selection. Epidemiology 2019; 30: 509-516. Interactive App: https://selection-bias.herokuapp.com

 

Are teachers at high risk of death from Covid19?

Sarah Lewis, George Davey Smith and Marcus Munafo

Follow Sarah, George and Marcus on Twitter

Due to the SARS-CoV-2 pandemic schools across the United Kingdom were closed to all but a small minority of pupils (children of keyworkers and vulnerable children) on the 20th March 2020, with some schools reporting as few as 5 pupils currently attending. The UK government have now issued guidance that primary schools in England should start to accept pupils back from the 1st June 2020 with a staggered return, starting with reception, year 1 and year 6.

Concern from teachers’ unions

This has prompted understandable concern from the  teachers’ unions, and on the 13th May, nine unions which represent teachers and education professionals signed a joint statement calling on the government to postpone reopening school on the 1st June, “We all want schools to re-open, but that should only happen when it is safe to do so. The government is showing a lack of understanding about the dangers of the spread of coronavirus within schools, and outwards from schools to parents, sibling and relatives, and to the wider community.” At the same time, others have suggested that the harms to many children due to neglect, abuse and missed educational opportunity arising from school closures outweigh the small increased risk to children, teachers and other adults of catching the virus.

What risk does Covid19 pose to children?

Weighing up the risks to children and teachers

So what do we know about the risk to children and to teachers? We know that children are about half as likely to catch the virus from an infected person as adults, and  if they do catch the virus they  are likely to have only mild symptoms.  The current evidence, although inconclusive, also suggests that they may be less likely to transmit the virus than adults.  However, teachers have rightly pointed out that there is a risk of transmission between the teachers themselves and between parents and teachers.

The first death from COVID-19 in England was recorded at the beginning of March 2020 and by the 8th May 2020 39,071 deaths involving COVID-19 had been reported in England and Wales. Just three of these deaths were among children aged under 15 years and  only a small proportion of the deaths (4416 individuals, 11.3%) were among working aged people.  Even among this age group risk is not uniform; it increases sharply with age from 2.6 in 100,000 for 25-44 years olds with a ten fold increase to 26 in  100,000 individuals for those aged 45-64.

Risks to teachers compared to other occupations

In addition, each underlying health condition increases the risk of dying from COVID-19, with those having at least 1 underlying health problem making up most cases.   The Office for National Statistics in the UK have published age standardised deaths by occupation for all deaths involving COVID-19 up to the 20th April 2020. Most of the people dying by this date would have been infected at the peak of the pandemic in the UK  prior to the lockdown period. They found that during this period there were 2494 deaths involving Covid-19 in the working age population. The mortality rate for Covid-19 during this period was 9.9 (95% confidence intervals 9.4-10.4) per 100,000 males and 5.2 (95%CI 4.9-5.6) per 100,000 females, with Covid-19 involved in around 1 in 4 and 1 in 5 of all deaths among males and females respectively.

Amongst teaching and education professionals (which includes school teachers, university lecturers and other education professionals) a total of 47 deaths (involving Covid-19) were recorded, equating to mortality rates of 6.7 (95%CI 4.1-10.3) per 100,000 among males and 3.3 (95%CI 2.0-4.9) per 100,000 among females, which was very similar to the rates of 5.6 (95%CI 4.6-6.6) per 100,000 among males and 4.2(95%CI 3.3-5.2) per 100,000 females for all professionals. The mortality figures for all education professionals includes 7 out of 437000 (or 1.6 per 100,000 teachers) primary and nursery school teachers and 17 out of 395000 (or 4.3 per 100,000 teachers) secondary school teachers.  A further 20 deaths occurred amongst childcare workers giving a mortality rate amongst this group of 3.4 (95%CI=2.0-5.5) per 100,000 females (males were highly underrepresented in this group), this is in contrast to rates of 6.5 (95%CI=4.9-9.1) for female sales assistants and 12.7(95%CI= 9.8-16.2) for female care home workers.

Covid-19 risk does not appear greater for teachers than other working age individuals

In summary, based on current evidence the risk to teachers and childcare workers within the UK from Covid-19 does not appear to be any greater than for any other group of working age individuals. However, perceptions of elevated risk may have occurred, prompting some to ask “Why are so many teachers dying?” due to the way this issue is portrayed in the media with headlines such as “Revealed: At least 26 teachers have died from Covid-19” currently on the https://www.tes.com website. This kind of reporting, along with the inability of the government to communicate the substantial differences in risk between different population groups – in particular according to age – has caused understandable anxiety among teachers. Whilst, some teachers may not be prepared to accept any level of risk of becoming infected with the virus whilst at work, others may be reassured that the risk to them is small, particularly given that we all accept some level of risk in our lives, a value that can never be zero.

Likely impact on transmission in the community is unclear

As the majority of parents or guardians of school aged children will be in the 25-45 age range, the risk to them  is also likely to be small. Questions remain however around the effect of school openings on transmission in the community and the associated risk. This will be affected by many factors including the existing infection levels in the community, the extent to which pupils, parents and teachers are mixing outside of school (and at the school gate) and mixing between individuals of different age groups. This is the primary consideration of the government Scientific Advisory Group for Emergencies (SAGE) who are using modelling based on a series of assumptions to determine the effect of school openings on R0.

 

Sarah Lewis is a Senior Lecturer in Genetic Epidemiology in the department of Population Health Sciences, and is an affiliated member of the MRC Integrative Epidemiology Unit (IEU), University of Bristol

George Davey Smith is a Professor of Clinical Epidemiology, and director of the MRC IEU, University of Bristol

Marcus Munafo is a Professor of Biological Psychology, in the School of Psychology Science and leads the Causes, Consequences and Modification of Health Behaviours programme of research in the IEU, University of Bristol.

 

What can genetics tell us about how sleep affects our health?

Deborah Lawlor, Professor of Epidemiology, Emma Anderson, MRC Research Fellow, Marcus Munafò, Professor of Experimental Psychology, Mark Gibson, PhD student, Rebecca Richmond, Vice Chancellor’s Research Fellow

Follow Deborah, Marcus, and Rebecca on Twitter

Association is not causation – are we fooled (confounded) when we see associations between sleep problems and disease?

Sleep is important for health. Observational studies show that people who report having sleep problems are more likely to be overweight, and have more health problems including heart disease, some cancers and mental health problems.

A major problem with conventional observational studies is that we cannot tell whether these associations are causal; does being overweight cause sleep problems, or do sleep problems cause people to become overweight? Alternatively, factors that influence how we sleep may also influence our health. For example, smoking might cause sleep problems as well as heart disease and so we are fooled (confounded) into thinking sleep problems cause heart disease when it is really all explained by smoking. In the green paper Advancing our Health: Prevention in the 2020s, the UK Government acknowledged that sleep has had little attention in policy, and that causality between sleep and health is likely to run in both directions.

But, how can we determine the direction of causality for sure? And, how do we make sure we are results are not confounded?

Randomly allocated genetic variation

Our genes are randomly allocated to us from our parents when we are conceived. They do not change across our lifespan, and cannot be changed by smoking, overweight or ill health.

Here at the MRC Integrative Epidemiology Unit we have developed a research method called Mendelian randomization, which uses this family-level random allocation of genes to explore causal effects. To find out more about Mendelian randomization take a look at this primer from the Director of the Unit (Prof George Davey Smith).

In the last two years, we and colleagues from the Universities of Manchester, Exeter and Harvard have identified large numbers of genetic variants that relate to different sleep characteristics. These include:

  • Insomnia symptoms
  • How long, on average, someone sleeps each night
  • Chronotype (whether someone is an ‘early bird’ or ‘lark’ and prefers mornings, or a ‘night owl’ and prefers evenings). Chronotype is thought to reflect variation in our body clock (known as circadian rhythms).

We can use these genetic variants in Mendelian randomization studies to get a better understanding of whether sleep characteristics affect health and disease.

What we did

In our initial studies we used Mendelian randomization to explore the effects of sleep duration, insomnia and chronotype on body mass index, coronary heart disease, mental health problems, Alzheimer’s disease, and breast cancer. We analysed whether the genetic traits that are related to sleep characteristics – rather than the sleep characteristics themselves – are associated with the health outcomes. We combined those results with the effect of the genetic variants on sleep traits which allows us to estimate a causal effect. Using genetic variants rather than participants’ reports of their sleep characteristics makes us much more certain that the effects we identify are not due to confounding or reverse causation.

Are you a night owl or a lark?

What we found

Our results show a mixed picture; different sleep characteristics have varying effects on a range of health outcomes.

What does this mean?

Having better research evidence about the effects of sleep traits on different health outcomes means that we can give better advice to people at risk of specific health problems. For example, developing effective programmes to alleviate insomnia may prevent coronary heart disease and depression in those at risk. It can also help reduce worry about sleep and health, by demonstrating that some associations that have been found in previous studies are not likely to reflect causality.

If you are worried about your own sleep, the NHS has some useful guidance and signposting to further support.

Want to find out more?

Contact the researchers

Deborah A Lawlor mailto:d.a.lawlor@bristol.ac.uk

Further reading

This research has been published in the following open access research papers:

Genome-wide association analyses of chronotype in 697,828 individuals provides insights into circadian rhythms. Nature Comms (2019) https://www.nature.com/articles/s41467-018-08259-7

Biological and clinical insights from genetics of insomnia symptoms.  Nature Gen. (2019) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6415688/

Genome-wide association study identifies genetic loci for self-reported habitual sleep duration supported by accelerometer-derived estimates. Nature Comms. (2019) https://www.nature.com/articles/s41467-019-08917-4

Investigating causal relations between sleep traits and risk of breast cancer in women: mendelian randomisation study. BMJ (2019) https://www.bmj.com/content/365/bmj.l2327

Is disrupted sleep a risk factor for Alzheimer’s disease? Evidence from a two-sample Mendelian randomization analysis. https://www.biorxiv.org/content/10.1101/609834v1 (open access pre-print)

Evidence for Genetic Correlations and Bidirectional, Causal Effects Between Smoking and Sleep Behaviors. Nicotine and Tobacco (2018) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6528151/

Do development indicators underlie global variation in the number of young people injecting drugs?

Dr Lindsey Hines, Sir Henry Wellcome Postdoctoral Fellow in The Centre for Academic Mental Health & the Integrative Epidemiology Unit, University of Bristol

Dr Adam Trickey, Senior Research Associate in Population Health Sciences, University of Bristol

Follow Lindsey on Twitter

Injecting drug use is a global issue: around the world an estimated 15.6 million people inject psychoactive drugs. People who inject drugs tend to begin doing so in adolescence, and countries that have larger numbers of adolescents who inject drugs may be at risk of emerging epidemics of blood borne viruses unless they take urgent action. We mapped the global differences in the proportion of adolescents who inject drugs, but found that we may be missing the vital data we need to protect the lives of vulnerable young people. If we want to prevent HIV, hepatitis C, and overdose from sweeping through a new generation of adolescents we urgently need many countries to scale up harm reduction interventions, and to collect accurate which can inform public health and policy.

People who inject drugs are engaging in a behaviour that can expose them to multiple health risks such as addiction, blood-borne viruses, and overdose, and are often stigmatised. New generations of young people are still starting to inject drugs, and young people who inject drugs are often part of other vulnerable groups.

Much of the research into the causes of injecting drug use focuses on individual factors, but we wanted to explore the effect of global development on youth injecting. A recent systematic review showed wide country-level variation in the number of young people who comprise the population of people who inject drugs. By considering variation in countries, we hoped to be able to inform prevention and intervention efforts.

It’s important to note that effective interventions can reduce the harms of injecting drug use. Harm reduction programmes provide clean needles and syringes to reduce transmission of blood borne viruses. Opiate substitution therapy seeks to tackle the physical dependence on opiates that maintains injecting behaviour and has been shown to improve health outcomes.

What we did

Through a global systematic review and meta-analysis we aimed to find data on injecting drug use in published studies, public health and policy documents from every country. We used these data to estimate the global percentage of people who inject drugs that are aged 15-25 years old, and also estimated this for each region and country. We wanted to understand what might underlie variation in the number of young people in populations of people who inject drugs, and so we used data from the World Bank to identify markers of a country’s wealth, equality, and development.

What we found

Our study estimated that, globally, around a quarter of people who inject drugs are adolescents and young adults. Applied to the global population, we can estimate approximately 3·9 million young people inject drugs. As a global average, people start injecting drugs at 23 years old.

Estimated percentage of young people amongst those who inject drugs in each country

We found huge variation in the percentage of young people in each country’s population of people who inject drugs. Regionally, Eastern Europe had the highest proportion of young people amongst their populations who inject drugs, and the Middle Eastern and North African region had the lowest. In both Russia and the Philippines, over 50% of the people who inject drugs were aged 25 or under, and the average age of the populations of people who inject drugs was amongst the lowest observed.

Average age of the population of people who inject drugs in each country

In relation to global development indicators, people who inject drugs were younger in countries with lower wealth (indicated through Gross Domestic Product per capita) had been injecting drugs for a shorter time period. In rapidly urbanising countries (indicated through urbanisation growth rate) people were likely to start injecting drugs at later ages than people in countries with a slower current rate of urbanisation. We didn’t find any relationships between the age of people who inject drugs and a country’s youth unemployment, economic equality, or level provision of opiate substitution therapy.

However, many countries were missing data on injecting age and behaviours, or injecting drug use in general, which could affect these results.

What this means

1. The epidemic of injecting drug use is being maintained over time.

A large percentage of people who inject drugs are adolescents, meaning that a new generation are being exposed to the risks of injecting – and we found that this risk was especially high in less wealthy countries.

2. We need to scale up access to harm reduction interventions

There are highly punitive policies towards drug use in the countries with the largest numbers of young people in their populations of people who inject drugs. Since 2016, thousands of people who use drugs in the Philippines have died at the hands of the police. In contrast, Portugal has adopted a public health approach to drug use and addiction for decades, taking the radical step of taking people caught with drugs or personal use into addiction services rather than prisons. The rate of drug-related deaths and HIV infections in Portugal has since plummeted, as has the overall rate of drug use amongst young people: our data show that Portugal has a high average age for its population of people who inject drugs. If we do not want HIV, hepatitis C, and drug overdoses to sweep through a new generation of adolescents, we urgently need to see more countries adopting the approach pioneered by Portugal, and scaling up access to harm reduction interventions to the levels recommended by the WHO.

3. We need to think about population health, and especially mental health, alongside urban development.

Global development appears to be linked to injecting drug use, and the results suggest that countries with higher urbanisation growth are seeing new, older populations beginning to inject drugs. It may be that changes in environment are providing opportunities for injecting drug use that people hadn’t previously had. It’s estimated that almost 70% of the global population will live in urban areas by 2050, with most of this growth driven by low and middle-income countries.

4. We need to collect accurate data

Despite the health risks of injecting drug use, and the urgent need to reduce risks for new generations, our study has revealed a paucity of data monitoring this behaviour. Most concerning, we know the least about youth injecting drug use in low- and middle-income countries: areas likely to have the highest numbers of young people in their populations of people who inject drugs. Due to the stigma and the illicit nature of injecting drug use it is often under-studied, but by failing to collect accurate data to inform public health and policy we are risking the lives of vulnerable young people.

Contact the researchers

Lindsey.hines@bristol.ac.uk

Lindsey is funded by the Wellcome Trust.

Social media in peer review: the case of CCR5

Last week IEU colleague Dr Sean Harrison was featured on BBC’s Inside Science, discussing his role in the CCR5-mortality story. Here’s the BBC’s synopsis:

‘In November 2018 news broke via YouTube that He Jiankui, then a professor at Southern University of Science and Technology in Shenzhen, China had created the world’s first gene-edited babies from two embryos. The edited gene was CCR5 delta 32 – a gene that conferred protection against HIV. Alongside the public, most of the scientific community were horrified. There was a spate of correspondence, not just on the ethics, but also on the science. One prominent paper was by Rasmus Nielsen and Xinzhu Wei’s of the University of California, Berkeley. They published a study in June 2019 in Nature Medicine that found an increased mortality rate in people with an HIV-preventing gene variant. It was another stick used to beat Jiankiu – had he put a gene in these babies that was not just not helpful, but actually harmful? However it now turns out that the study by Nielsen and Wei has a major flaw. In a series of tweets, Nielsen was notified of an error in the UK Biobank data and his analysis. Sean Harrison at the University of Bristol tried and failed to replicate the result using the UK Biobank data. He posted his findings on Twitter and communicated with Nielsen and Wei who have now requested a retraction. UCL’s Helen O’Neill is intimately acquainted with the story and she chats to Adam Rutherford about the role of social media in the scientific process of this saga.’

Below, we re-post Sean’s blog which outlines how the story unfolded, and the analysis that he ran.

Follow Sean on Twitter

Listen to Sean on Inside Science

*****************************************************************************************************************************************

“CCR5-∆32 is deleterious in the homozygous state in humans” – is it?

I debated for quite a long time on whether to write this post. I had said pretty much everything I’d wanted to say on Twitter, but I’ve done some more analysis and writing a post might be clearer than another Twitter thread.

To recap, a couple of weeks ago a paper by Xinzhu (April) Wei & Rasmus Nielsen of the University of California was published, claiming that a deletion in the CCR5 gene increased mortality (in white people of British ancestry in UK Biobank). I had some issues with the paper, which I posted here. My tweets got more attention than anything I’d posted before. I’m pretty sure they got more attention than my published papers and conference presentations combined. ¯\_(ツ)_/¯

The CCR5 gene is topical because, as the paper states in the introduction:

In late 2018, a scientist from the Southern University of Science and Technology in Shenzhen, Jiankui He, announced the birth of two babies whose genomes were edited using CRISPR

To be clear, gene-editing human babies is awful. Selecting zygotes that don’t have a known, life-limiting genetic abnormality may be reasonable in some cases, but directly manipulating the genetic code is something else entirely. My arguments against the paper did not stem from any desire to protect the actions of Jiankui He, but to a) highlight a peer review process that was actually pretty awful, b) encourage better use of UK Biobank genetic data, and c) refute an analysis that seemed likely biased.

This paper has received an incredible amount of attention. If it is flawed, then poor science is being heavily promoted. Apart from the obvious problems with promoting something that is potentially biased, others may try to do their own studies using this as a guideline, which I think would be a mistake.

1

I’ll quickly recap the initial problems I had with the paper (excluding the things that were easily solved by reading the online supplement), then go into what I did to try to replicate the paper’s results. I ran some additional analyses that I didn’t post on Twitter, so I’ll include those results too.

Full disclosure: in addition to tweeting to me, Rasmus and I exchanged several emails, and they ran some additional analyses. I’ll try not to talk about any of these analyses as it wasn’t my work, but, if necessary, I may mention pertinent bits of information.

I should also mention that I’m not a geneticist. I’m an epidemiologist/statistician/evidence synthesis researcher who for the past year has been working with UK Biobank genetic data in a unit that is very, very keen on genetic epidemiology. So while I’m confident I can critique the methods for the main analyses with some level of expertise, and have spent an inordinate amount of time looking at this paper in particular, there are some things where I’ll say I just don’t know what the answer is.

I don’t think I’ll write a formal response to the authors in a journal – if anyone is going to, I’ll happily share whatever information you want from my analyses, but it’s not something I’m keen to do myself.

All my code for this is here.

The Issues

Not accounting for relatedness

Not accounting for relatedness (i.e. related people in a sample) is a problem. It can bias genetic analyses through population stratification or familial structure, and can be easily dealt with by removing related individuals in a sample (or fancy analysis techniques, e.g. Bolt-LMM). The paper ignored this and used everyone.

Quality control

Quality control (QC) is also an issue. When the IEU at the University of Bristol was QCing the UK Biobank genetic data, they looked for sex mismatches, sex chromosome aneuploidy (having sex chromosomes different to XX or XY), and participants with outliers in heterozygosity and missing rates (yeah, ok, I don’t have a good grasp on what this means, but I see it as poor data quality for particular individuals). The paper ignored these too.

Ancestry definition

The paper states it looks at people of “British ancestry”. Judging by the number in participants in the paper and the reference they used, the authors meant “white British ancestry”. I feel this should have been picked up on in peer review, since the terms are different. The Bycroft article referenced uses “white British ancestry”, so it would have certainly been clearer sticking to that.

Covariable choice

The main analysis should have also been adjusted for all principal components (PCs) and centre (where participants went to register with UK Biobank). This helps to control for population stratification, and we know that UK Biobank has problems with population stratification. I thought choosing variables to include as covariables based on statistical significance was discouraged, but apparently I was wrong. Still, I see no plausible reason to do so in this case – principal components represent population stratification, population stratification is a confounder of the association between SNPs and any outcome, so adjust for them. There are enough people in this analysis to take the hit.

The analysis

10

I don’t know why the main analysis was a ratio of the crude mortality rates at 76 years of age (rather than a Cox regression), and I don’t know why there are no confidence intervals (CIs) on the estimate. The CI exists, it’s in the online supplement. Peer review should have had problems with this. It is unconscionable that any journal, let alone a top-tier journal, would publish a paper when the main result doesn’t have any measure of the variability of the estimate. A P value isn’t good enough when it’s a non-symmetrical error term, since you can’t estimate the standard error.

So why is the CI buried in an additional file when it would have been so easy to put it into the main text? The CI is from bootstrapping, whereas the P value is from a log-rank test, and the CI of the main result crosses the null. The main result is non-significant and significant at the same time. This could be a reason why the CI wasn’t in the main text.

It’s also noteworthy that although the deletion appears strongly to be recessive (only has an effect is both chromosomes have the deletion), the main analysis reports delta-32/delta-32 against +/+, which surely has less power than delta-32/delta-32 against +/+ or delta-32/+. The CI might have been significant otherwise.

2

I think it’s wrong to present one-sided P values (in general, but definitely here). The hypothesis should not have been that the CCR5 deletion would increase mortality; it should have been ambivalent, like almost all hypotheses in this field. The whole point of the CRISPR was that the babies would be more protected from HIV, so unless the authors had an unimaginably strong prior that CCR5 was deleterious, why would they use one-sided P values? Cynically, but without a strong reason to think otherwise, I can only imagine because one-sided P values are half as large as two-sided P values.

The best analysis, I think, would have been a Cox regression. Happily, the authors did this after the main analysis. But the full analysis that included all PCs (but not centre) was relegated to the supplement, for reasons that are baffling since it gives the same result as using just 5 PCs.

Also, the survival curve should have CIs. We know nothing about whether those curves are separate without CIs. I reproduced survival curves with a different SNP (see below) – the CIs are large.

3

I’m not going to talk about the Hardy-Weinburg Equilibrium (HWE, inbreeding) analysis– it’s still not an area I’m familiar with, and I don’t really think it adds much to the analysis. There are loads of reasons why a SNP might be out of HWE – dying early is certainly one of them, but it feels like this would just be a confirmation of something you’d know from a Cox regression.

Replication Analyses

I have access to UK Biobank data for my own work, so I didn’t think it would be too complex to replicate the analyses to see if I came up with the same answer. I don’t have access to rs62625034, the SNP the paper says is a great proxy of the delta-32 deletion, for reasons that I’ll go into later. However, I did have access to rs113010081, which the paper said gave the same results. I also used rs113341849, which is another SNP in the same region that has extremely high correlation with the deletion (both SNPs have R2 values above 0.93 with rs333, which is the rs ID for the delta-32 deletion). Ideally, all three SNPs would give the same answer.

First, I created the analysis dataset:

  1. Grabbed age, sex, centre, principal components, date of registration and date of death from the UK Biobank phenotypic data
  2. Grabbed the genetic dosages of rs113010081 and rs113341849 from the UK Biobank genetic data
  3. Grabbed the list of related participants in UK Biobank, and our usual list of exclusions (including withdrawals)
  4. Merged everything together, estimating the follow-up time for everyone, and creating a dummy variable of death (1 for those that died, 0 for everyone else) and another one for relateds (0 for completely related people, 1 for those I would typically remove because of relatedness)
  5. Dropped the standard exclusions, because there aren’t many and they really shouldn’t be here
  6. I created dummy variables for the SNPs, with 1 for participants with two effect alleles (corresponding to a proxy for having two copies of the delta-32 deletion), and 0 for everyone else
  7. I also looked at what happened if I left the dosage as 0, 1 or 2, but since there was no evidence that 1 was any different from 0 in terms of mortality, I only reported the 2 versus 0/1 results

I conducted 12 analyses in total (6 for each SNP), but they were all pretty similar:

  1. Original analysis: time = study time (so x-axis went from 0 to 10 years, survival from baseline to end of follow-up), with related people included, and using age, sex, principal components and centre as covariables
  2. Original analysis, without relateds: as above, but excluding related people
  3. Analysis 2: time = age of participant (so x-axis went from 40 to 80 years, survival up to each year of life, which matches the paper), with related people included, and using sex, principal components and centre as covariables
  4. Analysis 2, without relateds: as above, but excluding related people
  5. Analysis 3: as analysis 2, but without covariables
  6. Analysis 3, without relateds: as above, but excluding related people

With this suite of analyses, I was hoping to find out whether:

  • either SNP was associated with mortality
  • including covariables changed the results
  • the time variable changed the results, and d) whether including relateds changed the results

Results

4

I found… Nothing. There was very little evidence the SNPs were associated with mortality (the hazard ratios, HRs, were barely different from 1, and the confidence intervals were very wide). There was little evidence including relateds or more covariables, or changing the time variable, changed the results.

Here’s just one example of the many survival curves I made, looking at delta-32/delta-32 (1) versus both other genotypes in unrelated people only (not adjusted, as Stata doesn’t want to give me a survival curve with CIs that is also adjusted) – this corresponds to the analysis in row 6.

5

You’ll notice that the CIs overlap. A lot. You can also see that both events and participants are rare in the late 70s (the long horizontal and vertical stretches) – I think that’s because there are relatively few people who were that old at the end of their follow-up. Average follow-up time was 7 years, so to estimate mortality up to 76 years, I imagine you’d want quite a few people to be 69 years or older, so they’d be 76 at the end of follow-up (if they didn’t die). Only 3.8% of UK Biobank participants were 69 years or older.

In my original tweet thread, I only did the analysis in row 2, but I think all the results are fairly conclusive for not showing much.

In a reply to me, Rasmus stated:

6

This is the claim that turned out to be incorrect:

11

Never trust data that isn’t shown – apart from anything else, when repeating analyses and changing things each time, it’s easy to forget to redo an extra analysis if the manuscript doesn’t contain the results anywhere.

This also means I couldn’t directly replicate the paper’s analysis, as I don’t have access to rs62625034. Why not? I’m not sure, but the likely explanation is that it didn’t pass the quality control process (either ours or UK Biobank’s, I’m not sure).

SNPs

I’ve concluded that the only possible reason for a difference between my analysis and the paper’s analysis is that the SNPs are different. Much more different than would be expected, given the high amount of correlation between my two SNPs and the deletion, which the paper claims rs62625034 is measuring directly.

One possible reason for this is the imputation of SNP data. As far as I can tell, neither of my SNPs were measured directly, they were imputed. This isn’t uncommon for any particular SNP, as imputation of SNP data is generally very good. As I understand it, genetic code is transmitted in blocks, and the blocks are fairly steady between people of the same population, so if you measure one or two SNPs in a block, you can deduce the remaining SNPs in the same block.

In any case there is a lot of genetic data to start with – each genotyping chip measures hundred of thousands of SNPs. Also, we can measure the likely success rate of the imputation, and SNPs that are poorly imputed (for a given value of “poorly”) are removed before anyone sees them.

The two SNPs I used had good “info scores” (around 0.95 I think – for reference, we dropped all SNPs with an info score of less than 0.3 for SNPs with minor allele frequencies similar), so we can be pretty confident in their imputation. On the other hand, rs62625034 was not imputed in the paper, it was measured directly. That doesn’t mean everyone had a measurement – I understand the missing rate of the SNP was around 3.4% in UK Biobank (this is from direct communication with the authors, not from the paper).

But. And this is a weird but that I don’t have the expertise to explain, the imputation of the SNPs I used looks… well… weird. When you impute SNP data, you impute values between 0 and 2. They don’t have to be integer values, so dosages of 0.07 or 1.5 are valid. Ideally, the imputation would only give integer values, so you’d be confident this person had 2 mutant alleles, and this person 1, and that person none. In many cases, that’s mostly what happens.

Non-integer dosages don’t seem like a big problem to me. If I’m using polygenic risk scores, I don’t even bother making them integers, I just leave them as decimals. Across a population, it shouldn’t matter, the variance of my final estimate will just be a bit smaller than it should be. But for this work, I had to make the non-integer dosages integers, so anything less than 0.5 I made 0, anything 0.5 to 1.5 was 1, and anything above 1.5 was 2. I’m pretty sure this is fine.

Unless there’s more non-integer doses in one allele than the other.

rs113010081 has non-integer dosages for almost 14% of white British participants in UK Biobank (excluding relateds). But the non-integer dosages are not distributed evenly across dosages. No. The twos has way more non-integer dosages than the ones, which had way more non-integer dosages than the zeros.

In the below tables, the non-integers are represented by being missing (a full stop) in the rs113010081_x_tri variable, whereas the rs113010081_tri variable is the one I used in the analysis. You can see that of the 4,736 participants I thought had twos, 3,490 (73.69%) of those actually had non-integer dosages somewhere between 1.5 and 2.

7

What does this mean?

I’ve no idea.

I think it might mean the imputation for this region of the genome might be a bit weird. rs113341849 has the same pattern, so it isn’t just this one SNP.

But I don’t know why it’s happened, or even whether it’s particularly relevant. I admit ignorance – this is something I’ve never looked for, let alone seen, and I don’t know enough to say what’s typical.

I looked at a few hundred other SNPs to see if this is just a function of the minor allele frequency, and so the imputation was naturally just less certain because there was less information. But while there is an association between the minor allele frequency and non-integer dosages across dosages, it doesn’t explain all the variance in the estimate. There were very few SNPs with patterns as pronounced as in rs113010081 and rs113341849, even for SNPs with far smaller minor allele frequencies.

Does this undermine my analysis, and make the paper’s more believable?

I don’t know.

I tried to look at this with a couple more analyses. In the “x” analyses, I only included participants with integer values of dose, and in the “y” analyses, I only included participants with dosages < 0.05 from an integer. You can see in the results table that only using integers removed any effect of either SNP. This could be evidence that the imputation having an effect, or it could be chance. Who knows.

4

rs62625034

rs62625034 was directly measured, but not imputed, in the paper. Why?

It’s possibly because the SNP isn’t measuring what the probe meant to measure. It clearly has a very different minor allele frequency in UK Biobank (0.1159) than in the GO-ESP population (~0.03). The paper states this means it’s likely measuring the delta-32 deletion, since the frequencies are similar and rs62625034 sits in the deletion region. This mismatch may have made it fail quality control.

But this raises a couple of issues. First is whether the missingness in rs62625034 is a problem – is the data missing completely at random or not missing at random. If the former, great. If the latter, not great.

The second issue is that rs62625034 should be measuring a SNP, not a deletion. In people without the deletion, the probe could well be picking up people with the SNP. The rs62625034 measurement in UK Biobank should be a mixture between the deletion and a SNP. The R2 between rs62625034 and the deletion is not 1 (although it is higher than for my SNPs – again, this was mentioned in an email to me from the authors, not in the paper), which could happen if the SNP is picking up more than the deletion.

The third issue, one I’ve realised only just now, is that previous research has shown that rs62625034 is not associated with lifespan in UK Biobank (and other datasets). This means that maybe it doesn’t matter that rs62625034 is likely picking up more than just the deletion.

Peter Joshi, author of the article, helpfully posted these tweets:

89

If I read this right, Peter used UK Biobank (and other data) to produce the above plot showing lots of SNPs and their association with mortality (the higher the SNP, the more it affects mortality).

Not only does rs62625034 not show any association with mortality, but how did Peter find a minor allele frequency of 0.035 for rs62625034 and the paper find 0.1159? This is crazy. A minor allele frequency of 0.035 is about the same as the GO-ESP population, so it seems perfectly fine, whereas 0.1159 does not.

I didn’t clock this when I first saw it (sorry Peter), but using the same datasets and getting different minor allele frequencies is weird. Properly weird. Like counting the number of men and women in a dataset and getting wildly different answers. Maybe I’m misunderstanding, it wouldn’t be the first time – maybe the minor allele frequencies are different because of something else. But they both used UK Biobank, so I have no idea how.

I have no answer for this. I also feel like I’ve buried the lead in this post now. But let’s pretend it was all building up to this.

Conclusion

This paper has been enormously successful, at least in terms of publicity. I also like to think that my “post-publication peer review” and Rasmus’s reply represents a nice collaborative exchange that wouldn’t have been possible without Twitter. I suppose I could have sent an email, but that doesn’t feel as useful somehow.

However, there are many flaws with the paper that should have been addressed in peer review. I’d love to ask the reviewers why they didn’t insist on the following:

  • The sample should be well defined, i.e. “white British ancestry” not “British ancestry”
  • Standard exclusions should be made for sex mismatches, sex chromosome aneuploidy, participants with outliers in heterozygosity and missing rates, and withdrawals from the study (this is important to mention in all papers, right?)
  • Relatedness should either be accounted for in the analysis (e.g. Bolt-LMM) or related participants should be removed
  • Population stratification should be both addressed in the analysis (maximum principal components and centre) and the limitations
  • All effect estimates should have confidence intervals (I mean, come on)
  • All survival curves should have confidence intervals (ditto)
  • If it’s a survival analysis, surely Cox regression is better than ratios of survival rates? Also, somewhere it would be useful to note how many people died, and separately for each dosage
  • One-tailed P values need a huge prior belief to be used in preference to two-tailed P values
  • Over-reliance on P values in interpretation of results is also to be avoided
  • Choice of SNP, if you’re only using one SNP, is super important. If your SNP has a very different minor allele frequency from a published paper using a very similar dataset, maybe reference it and state why that might be. Also note if there is any missing data, and why that might be ok
  • When there is an online supplement to a published paper, I see no legitimate reason why “data not shown” should ever appear
  • Putting code online is wonderful. Indeed, the paper has a good amount of transparency, with code put on github, and lab notes also put online. I really like this.

So, do I believe “CCR5-∆32 is deleterious in the homozygous state in humans”?

No, I don’t believe there is enough evidence to say that the delta-32 deletion in CCR-5 affects mortality in people of white British ancestry, let alone people of other ancestries.

I know that this post has likely come out far too late to dam the flood of news articles that have already come out. But I kind of hope that what I’ve done will be useful to someone.

Conference time at the MRC Integrative Epidemiology Unit!

Dr Jack Bowden, Programme Leader

Follow Jack on Twitter

 

Every two years my department puts on a conference on the topic of Mendelian Randomization (MR), a field that has been pioneered by researchers in Bristol over the last two decades. After months of planning, including finding a venue, inviting speakers from around the world and arranging the scientific programme, it’s a week and a half to go and we’re almost there!

But what is Mendelian Randomization research all about I hear you ask? Are you sure you want to know? Please read on but understand there is no going back…..

Are you sure you want to know about Mendelian Randomisation?

Have you ever had the feeling that something wasn’t quite right, that you are being controlled in some way by a higher force?

Well, it’s true. We are all in The Matrix. Like it or not, each of us has been recruited into an experiment from the moment we were born. Our genes, which are given to us by our parents at the point of conception, influence every aspect of our lives: how much we eat, sleep, drink, weigh, smoke, study, worry and play. The controlling effect is cleverly very small, and scientists only discovered the pattern by taking measurements across large populations, so as individuals we generally don’t notice. But the effect is real, very real!

How can we fight back?

We cannot escape The Matrix, but we can fight back by extracting knowledge from this unfortunate experiment we find ourselves in and using it for society’s advantage. For example, if we know that our genes predict 1-2% of variation in Low-Density Lipoprotein cholesterol (LDL-c – the ‘bad’ cholesterol) in the population, we can see if genes known to predict LDL-c also predict later life health outcomes in a group of individuals such as an increased risk of heart disease. If they do, then it provides strong evidence that reducing LDL-c will reduce heart disease risk, and we can then take steps to act. This is, in essence, the science of Mendelian randomization. See here for a nice animation of the method by our Unit director, George Davey Smith – our Neo if you like.

An example of the mathematical framework that leads to our analysis (honest)

Mendelian randomization is very much a team effort, involving scientists with expertise across many disciplines. My role, as a statistician and data scientist is to provide the mathematical framework to ensure the analysis is performed in a rigorous and reliable manner.

We start by drawing a diagram that makes explicit the assumptions our analysis rests on. The arrows show which factors influence which. In our case we must assume that a set of genes influence LDL-c, and can only influence heart disease risk through LDL-c. We can then translate this diagram into a system of equations that we apply to our data.

The great thing about Mendelian randomization is that, even when many other factors jointly influence LDL-c and heart disease risk, the Mendelian randomization approach should still work.

Recently, the validity of the Mendelian randomization approach has been called into question due to the problem of pleiotropy. In our example this would be when a gene affects heart disease through a separate unmodelled pathway.

 

An illustration of pleitropy

This can lead to bias in the analysis and therefore misleading results. My research is focused on novel methods that try to overcome the issue of pleiotropy, by detecting and adjusting for its presence in the analysis. For further details please see this video.

The MR Data challenge

At this year’s conference we are organising an MR Data Challenge, to engage conference participants in exploring and developing innovative approaches to Mendelian randomization using a publicly available data set. At a glance, the data comprises information on 150 genes and their association with

  • 118 lipid measurements (LDL cholesterol)
  • 7 health outcomes (including type II diabetes)

Eight research teams have submitted an entry to the competition, to describe how they would analyse the data and the conclusions they would draw. The great thing about these data is that the information on all 118 lipid traits simultaneously assessed to improve the robustness of the Mendelian randomization analysis.

Genetic data can help us understand how to resolve population health issues. Image credit: www.genome.gov

A key aim of the session is to bring together data scientists with experts from the medical world to comment on and debate the results. We will publish all of the computer code online so that anyone can re-run the analyses. In the future, we hope to add further data to this resource and for many new teams to join the party with their own analysis attempt.

Please come and join us at the MR conference in Bristol, 17-19 July, it promises to be epic!