Genealogical Musings: September 2017

Friday, September 29, 2017

MyHeritageDNA Matching Issues

I'm amazed, just not a good way

I'm going to illustrate why your DNA match list at MyHeritageDNA shouldn't be trusted. This has been touched on in a few other blogs (see here and here), but I want to highlight and update some details which are really very concerning.

The main problem is that the system is clearly including a high percentage of either false positive matches, or false negatives - and more than that, they are not necessarily weak matches with only small segments shared. Granted, false positives are a part of DNA matching no matter what company you test with, the nature of DNA means there are always going to be matches known as "Identical by State/Type" (IBS or IBT) versus "Identical by Descent" (IBD). Not to be confused with the medical bowel conditions using the same abbreviations, IBS matches are ones which share small amounts of DNA with you by chance, making them false positives, versus Identical by Descent which means your shared DNA comes from a common ancestor. IBS matches are a part of DNA matching no matter what - however, they only share small segments of DNA with you. This is why most companies have a cut off point where any match sharing no segments above about 7, 6, or 5 cM is automatically excluded, in attempts to reduce the amount of IBS/false positive matches included. The smaller amount of DNA you share, the more likely it is an IBS match. According to ISOGG, "False positive matching rates of between 12% and 23% have been reported for Family Finder data, and up to 34% at Ancestry using their current algorithm." So this is a normal part of DNA matching.

My closest match on MyHeritage is not a match to either my
mother or father

The trouble is, not only is MyHeritageDNA's rates of false positives much higher (around 60% according to my results and others), but more alarmingly, they are not all matches who share only small segments at the bottom of your list like normal IBS matches do. Once you get into the more distant cousins in your match list, you know that some of them are going to be IBS. However, when your top, closest match (after immediate family) who shares 89.5 cM with you (a significant amount), doesn't match either your mom or dad, you know something is very wrong. They are not just expected IBS matches, there is clearly a problem with the DNA matching system.

In fact, MyHeritage's cut off point for minimum segment size to qualify as a match appears to be 12 cM (none of my matches had a longest segment below this), which should almost assure that all the matches are IBD, not IBS (the normal cut off points are typically 5-7 cM)... and yet about 63% of my matches are clearly false positives. One of the blogs I linked to above seems to suggest these false positives are the result of imputed data (explained on their blog). While I don't fully understand imputed data, the blog is written by a professional scientist and therefore a reliable source.

Of course, it's also possible instead of being false positives for me, some of them are legit matches to me and false negatives for my parents, but here again, to have such a high match to me not turning up for either of my parents, something is clearly still wrong. If that's the case, it makes you wonder how many strong, legit matches are missing from my own match list too. Supporting the false negatives theory is the fact that none of my matches are shared matches with my Dad (don't worry, he and I match as father/child at all venues so there's no question he is my father). One could argue that's just because anyone on my dad's side who has tested is too distant to also match me, but that would have to mean I also wouldn't have any shared matches with my paternal grandfather, and yet I do. DNA doesn't skip a generation - how can I have shared matches with my paternal grandfather, but not my father, unless they are false negatives for my dad?

Of lesser concern is MyHeritage's relationship estimate ranges, which are often as specific as "1st cousin twice removed - 4th cousin", for example. No other company attempts to be as specific as this because there is so much overlap in how much DNA may be shared for different relationship types and degree. Looking at the chart to the right, you'll see that a 1st cousin twice removed is lumped into the same group, and therefore the same range of possible shared DNA, as a 2nd cousin. So why isn't the estimated range 2nd-4th cousins instead? Granted, relationship estimates take other things into consideration - not only the total amount of DNA shared, but over how many segments, and how long those segments are. Even so, if no other company is able to be as specific as 1st cousins twice removed instead of 2nd cousins, how have MyHeritage managed it? It also makes it difficult for people to understand their possible relationship with a match, since a lot of people don't even understand what "removed" means to begin with, and even those that do may have trouble knowing what relationships would be within a range that included a "removed". However, since the relationship range is only an estimate anyway, it's not a huge concern, just more of an annoyance.

I should note that all my tests (mine, my mom's, dad's, and paternal grandfather's) were transfers, meaning I had tested with another company and then uploaded my raw DNA data to MyHeritageDNA. I did not buy a test with MyHeritageDNA, and I've seen other blogs saying this makes a difference. However, since the matching database and algorithms are all the same regardless of whether you uploaded or bought a test, I don't see how this could be the case. If there's a false match in my list, then I am a false match in their list, regardless of whether one, both, or neither of us tested directly with MyHeritageDNA.

When you combine this very serious matching problem with the fact that their ethnicity report is also seemingly inferior for most people compared to other companies, it really makes their DNA test pretty worthless.Your mileage may vary regarding the ethnicity report, and of course DNA ethnicity reports are only estimates anyway, but in my experience, and that of many others, it is the least accurate out of all 4 of the big DNA genealogy companies. I highly recommend you don't get sucked in by their long running and continually reduced sales, but instead test with a more reliable company (even if it costs more) and then you can upload your raw DNA data to MyHeritage for free. Because that's about what their results are worth.

UPDATE (01/11/2018): Today I checked my kits' DNA matches at MyHeritage and so far, it looks like much of these issues have finally been resolved. The match in question who shared a significant amount of DNA with me but wasn't a match to either of my parents is now showing as a match to my dad and my paternal grandfather. In fact, my dad's kit now has a lot more matches than he had before (he previously only had 19 matches), many of them matches to me too, resolving the problem that we had no matches in common at first (except each other). Several of his original matches and now missing, suggesting they may have been false positives. And several of those still present have seen updates to how much DNA they share. So there's been a lot of changes, and it looks like they are good ones - matches now seem to make a lot more sense. Now I'm just looking forward to seeing updates in their ethnicity report.

Tuesday, September 19, 2017

A Gedmatch Admixture Guide: Parts 3 and 4

Continuing on from Parts 1 and 2 where I covered the different projects and calculators available for Admixture Proportions and what Oracle is and how to read it, I've had some requests to cover the other viewing options available like Admixture Proportions by Chromosome and Chromosome Painting. So that's what I'll be covering in Parts 3 and 4. For Part 5 on Spreadsheets, click here.

Part 3 - Admixture Proportions by Chromosome

How to find it: From your Gedmatch home page, under "Analyze your data" and then "DNA raw data", choose the option for Admixture (Heritage)" like you did in Part 1, but this time you're going to select " Admixture Proportions by Chromosome" from the bullet list. Be sure to select a project and then calculator and put in your kit number like normal. I would go with whatever calculator you found reflected your known ancestry best. If you haven't read Part 1 yet, you should do so first.

Admixture Proportions by Chromosome shows you your admixture proportions as broken down by individual chromosome; or, in other words, what percentages of each chromosome are most commonly found in which populations/ethnicity. This gives you a much more detailed view of where your DNA is most commonly found.

Admixture proportions (or ethnicity percentages) broken
down by chromosome

So with Eurogenes K13, it shows my chromosome 1 is 28.1% North Atlantic, 15.7% Baltic, 27.7% West Mediterranean, 16.9% West Asian, 10.9% East Mediterranean, and 1.1% Amerindian. This option can often show results in populations that don't show up in a normal Admixture Proportions calculator. However, always keep in mind small percentages may just be from "noise" - like a false positive. I have no Native American ancestry so the 1.1% Amerindian probably doesn't mean anything. You'll also note how I get some North Atlantic results, in varying amounts, on every single one of my chromosomes.

My Eurogenes K13 results

In my normal K13 results, I got 39.03% in North Atlantic, so this is just breaking that average of 39.03% down by chromosome. If you add up all the percentages for one population and divide it by 22 (number of chromosomes) you'll get your overall average for that population. You may note it's a little off from what the admixture calculator originally gave you - for example my average for North Atlantic when each chromosome is added up and divided by 22 is 38.89%, not the original 39.03%. I am not sure why that is, but it's such a small difference I'm not going to worry about it too much. If someone has more information on this discrepancy, please comment below!

At the bottom it says "Number of SNPs eval" - this is just how many of your SNPs were used for the evaluation.

It doesn't show which particular segments each percentage is found on though, but that brings us to the next options.

Part 4 - Chromosome Painting and Reduced Size

How to find it: Same as above, but select "Chromosome Painting" or "Chromosome Painting - Reduced Size" from the bullet list instead.

Chromosome Painting is a visual representation of your admixture proportions not only by chromosome but by segments of each chromosome. The different colors show which segments of each chromosome were most similar with which populations. When there are overlapping colors on the same segment, it means that segment is found in more than one population. The higher the spike, the stronger the match to that population. So segments where there are solid blocks of one color are more solidly found in only that population. Above is just a small portion of one of my chromosomes (7, I believe), as an example of the various populations that will show up for any given segment.

You'll note there are numbers along the bottom of each chromosome - this is marking the amount of base pairs in millions. One centiMorgan is one million base pairs. So if you have a segment painted with a certain color stretching from "10M" to "20M", for example, that's 10 million base pairs, or 10 cMs. Don't get too excited if you see colors for some unexpected populations - small segments could just be noise.

Chromosome painting reduced size

The reduced size option just condenses it so it's easier to view on a single screen. After viewing the full size, you'll quickly see just how cumbersome it is to get an overview, so the reduced size is ideal for that. The full size is better for examining particular portions. They don't label each chromosome but they are listed chromosome 1 to 22, from left to right. They are also rotated so the start of the chromosomes are at the bottom.

You may notice in either the full or reduced size that similar populations (though it's more noticeable in full), or neighboring regions, often spike and dip almost in unison with each other. This is because neighboring regions tend to share a lot of DNA and be genetically similar so when you see this, what you're seeing is that these portions of your DNA may be somewhat indistinguishable among two or more groups. This is important in understanding that not all DNA can be narrowed down to the more specific areas or countries that so many people wish it could, not with any reliability. It also illustrates why you might get results in a region that you have no known ancestry in when it neighbors a region you do have ancestry in.

23andMe's chromosome painting

If you tested with 23andMe, you may be somewhat familiar with chromosome painting already. 23andMe's option for it is a little more straight forward. It doesn't have all the spikes and dips, just solid blocks showing which segments were put into which groups (shown left). However, it does show the two sides of each chromosome whereas Gedmatch doesn't seem to do this. Although in some ways, Gedmatch's painting is more detailed, it is essentially the same concept, just a slightly different approach.

As another example, below is also a graphic from 23andMe - it's not a part of your results from this company, it's just showing, in part, how they determine ethnicity. Their example uses the more detailed type of chromosome painting found at Gedmatch, and it is labelled to show the probability of each ancestry on one side with increasing percentages of likelihood. It can be found in their guide article on ancestry composition. Gedmatch's chromosome painting can be read the same way (ie, the higher the peak, the higher the probability of that segment being from that population).

Disclaimer: Please note I am not a professional in the genetics industry, and it is difficult to find information particularly on some of the more advanced admixture tools on gedmatch. This is how I have come to understand the results and tools through my own experiences and research, but please, if someone more knowledgeable can correct me if I've misunderstood something, or can fill in some gaps, let me know by commenting below.