Genetic Genealogy is a Restricted Group with 68 members.
 Genetic Genealogy

 Restricted Group,
 68 members
Primary Navigation
There is a simple way to date SNPs, surname groups, and haplotype strings from counts on our dated YDNA phylogenetic tree!
Expand Messages
 0 Attachment
I have just posted this note to the genealogydna@... forum:I have had an interesting insight that I am working on. I can derive a dated YDNA phylogenetic tree, given a set of haplotypes that can be arbitrarily long. For a haplotype length of 37 markers, we can derive a time scale from a large number of pedigrees that gives the result that 10 RCC is about 433 years. The insight follows:In our dated YDNA phylogenetic tree, if you count (along a constant RCC line on the tree) the number of times that a descendant line is crossed, that number N is related to RCC by an exponential of the form: N equals K times e to the power ax, where:• N is the number of times a descendant line is crossed at each value of RCC on the tree,• K is the number of testees in the sample of haplotypes we use to form the tree,• x is RCC (a time scale derived from over 100 testee pedigrees),• e is 2.71828...., the base of the natural logarithm,• and 'a' is a constant of the set. Let's call 'a' the "tree factor".Call this relation, "the tree equation".For our phylogenetic trees, 'a' is a negative number and probably is composed of factors that include:the average number of sons along the descendant linesthe average rate at which descendant lines die outthe average mutation rate in the set of testeescharacteristics of the testee set chosen for the tree, etc.The tree equation is not a perfect exponential. It has glitches in it, but the quality of the exponential relationship can be quite high, with values of Rsquared (the variance) exceeding 0.9.The fact that the relationship is exponential is not unexpected, since the growth of the world's population is also exponential.This insight provides additional impetus to understanding what a welldefined set of inputs to the phylogenetic tree can tell us about the evolution of haplotypes, the dates of origin of family surname clusters and SNPs, and subhaplogroups. That date of origin is where N=1 in the tree equation and it can be found either from the equation or from an extrapolation of the graph of N vs. RCC from which the tree was derived!===============================Sincerely, Bill Howard 0 Attachment
Bill:
I'm sorry you did that before we had hashed out our discussion about crossovers vs branch points. I believe your public statement is premature and significantly flawed. I thought we agreed to continue our discussions after the holidays.
My position is that you should interpret you tree by progressing from the root of the tree (i.e., from the progenitor) and moving forward in time counting branch points, instead of working from the present and working backward counting crossovers. The difference is subtle but it yields significant difference in result. I suggest that an estimate for the date of the most recent common ancestor (MRCA) is directly related to the first branch point on the tree and does not need to be extrapolated as you do graphically or from your derived exponential equation. Although your RCC approach uses clustering to develop a tree, the purpose is (or should be) to characterize clades from real past populations. And in the real world, clades are defined by mutations that cause branch points. If you wish to make some kind of adjustment to account for the difference in dates between the progenitor and MRCA for the entire set of genotypes you are working with, then that adjustment also applies to all other branch points that define clades and subclades, wach of which has its own progenitor and MRCA.
As I have mentioned to you before, within population genetics, there is a well developed coalescence theory. This theory is based on well stated assumptions such as a constant mutation rate and that the effect of mutations is neutral. The coalescence theory is now relatively mature taking into consideration other factors such as population bottlenecks and interbreeding between populations. You implicitly make similar simplifying assumptions in your RCC approach to developing a tree and then calibrating it. However, once you have actually drawn a tree, your interpretation of the that tree is significantly different and I believe incorrect. Subtle but significant.
JIm
==================== J. J. (Jim) Logan Logan DNA Project, GenGenNV, ISOGG, GOONS, CWG/VASSAR ===================================================================
On 12/30/2013 3:01 PM, weh8@... wrote:
I have just posted this note to the genealogydna@... forum:I have had an interesting insight that I am working on. I can derive a dated YDNA phylogenetic tree, given a set of haplotypes that can be arbitrarily long. For a haplotype length of 37 markers, we can derive a time scale from a large number of pedigrees that gives the result that 10 RCC is about 433 years. The insight follows:In our dated YDNA phylogenetic tree, if you count (along a constant RCC line on the tree) the number of times that a descendant line is crossed, that number N is related to RCC by an exponential of the form: N equals K times e to the power ax, where:• N is the number of times a descendant line is crossed at each value of RCC on the tree,• K is the number of testees in the sample of haplotypes we use to form the tree,• x is RCC (a time scale derived from over 100 testee pedigrees),• e is 2.71828...., the base of the natural logarithm,• and 'a' is a constant of the set. Let's call 'a' the "tree factor".Call this relation, "the tree equation".For our phylogenetic trees, 'a' is a negative number and probably is composed of factors that include:the average number of sons along the descendant linesthe average rate at which descendant lines die outthe average mutation rate in the set of testeescharacteristics of the testee set chosen for the tree, etc.The tree equation is not a perfect exponential. It has glitches in it, but the quality of the exponential relationship can be quite high, with values of Rsquared (the variance) exceeding 0.9.The fact that the relationship is exponential is not unexpected, since the growth of the world's population is also exponential.This insight provides additional impetus to understanding what a welldefined set of inputs to the phylogenetic tree can tell us about the evolution of haplotypes, the dates of origin of family surname clusters and SNPs, and subhaplogroups. That date of origin is where N=1 in the tree equation and it can be found either from the equation or from an extrapolation of the graph of N vs. RCC from which the tree was derived!===============================Sincerely, Bill Howard 0 Attachment
Jim,In today's posting, I did not go into where you start on the tree, only that I have found that the fit is to an exponential. And you agreed to that. So, I do not understand why you think that mentioning that we are dealing with an exponential is SIGNIFICANTLY flawed. That seems a bit harsh to me.In my work so far, I have seen no significant difference in the analysis whether you sample cuts through the tree at regular intervals or only at the branch points. In fact, in one comparison I made between the two ways of counting, I found that the variance (Rsquared) of the cuts at regular intervals was slightly higher than the branch point cuts, but the difference, although it favors my approach, is not significant.I also fail to see why you say that the result is significantly difficult depending on whether you count forward or backward in time. If we differ, I think it is because I believe that the progenitor lived before the TMRCA if the group and you believe that the progenitor is the MRCA. We apparently disagree here. I remember an exchange I had with a staff member at FTDNA a few years ago who seemed to understand what I was doing (dealing only with the TMRCA point), and they stated that the challenge was to date the progenitor of a Haplogroup (or a SNP) who lived before the MRCA. That is what I believe I am doing here. We can only directly determine the earliest branch point, but that is where the mutation occurred, not when the progenitor lived. One of is sons was responsible for the branching. Bye from Bill
Sent from Bill Howard's iPadOn Dec 30, 2013, at 18:42, "J. J. (Jim) Logan" <jjlnv@...> wrote:
Bill:
I'm sorry you did that before we had hashed out our discussion about crossovers vs branch points. I believe your public statement is premature and significantly flawed. I thought we agreed to continue our discussions after the holidays.
My position is that you should interpret you tree by progressing from the root of the tree (i.e., from the progenitor) and moving forward in time counting branch points, instead of working from the present and working backward counting crossovers. The difference is subtle but it yields significant difference in result. I suggest that an estimate for the date of the most recent common ancestor (MRCA) is directly related to the first branch point on the tree and does not need to be extrapolated as you do graphically or from your derived exponential equation. Although your RCC approach uses clustering to develop a tree, the purpose is (or should be) to characterize clades from real past populations. And in the real world, clades are defined by mutations that cause branch points. If you wish to make some kind of adjustment to account for the difference in dates between the progenitor and MRCA for the entire set of genotypes you are working with, then that adjustment also applies to all other branch points that define clades and subclades, wach of which has its own progenitor and MRCA.
As I have mentioned to you before, within population genetics, there is a well developed coalescence theory. This theory is based on well stated assumptions such as a constant mutation rate and that the effect of mutations is neutral. The coalescence theory is now relatively mature taking into consideration other factors such as population bottlenecks and interbreeding between populations. You implicitly make similar simplifying assumptions in your RCC approach to developing a tree and then calibrating it. However, once you have actually drawn a tree, your interpretation of the that tree is significantly different and I believe incorrect. Subtle but significant.
JIm
==================== J. J. (Jim) Logan Logan DNA Project, GenGenNV, ISOGG, GOONS, CWG/VASSAR ===================================================================
On 12/30/2013 3:01 PM, weh8@... wrote:I have just posted this note to the genealogydna@... forum:I have had an interesting insight that I am working on. I can derive a dated YDNA phylogenetic tree, given a set of haplotypes that can be arbitrarily long. For a haplotype length of 37 markers, we can derive a time scale from a large number of pedigrees that gives the result that 10 RCC is about 433 years. The insight follows:In our dated YDNA phylogenetic tree, if you count (along a constant RCC line on the tree) the number of times that a descendant line is crossed, that number N is related to RCC by an exponential of the form: N equals K times e to the power ax, where:• N is the number of times a descendant line is crossed at each value of RCC on the tree,• K is the number of testees in the sample of haplotypes we use to form the tree,• x is RCC (a time scale derived from over 100 testee pedigrees),• e is 2.71828...., the base of the natural logarithm,• and 'a' is a constant of the set. Let's call 'a' the "tree factor".Call this relation, "the tree equation".For our phylogenetic trees, 'a' is a negative number and probably is composed of factors that include:the average number of sons along the descendant linesthe average rate at which descendant lines die outthe average mutation rate in the set of testeescharacteristics of the testee set chosen for the tree, etc.The tree equation is not a perfect exponential. It has glitches in it, but the quality of the exponential relationship can be quite high, with values of Rsquared (the variance) exceeding 0.9.The fact that the relationship is exponential is not unexpected, since the growth of the world's population is also exponential.This insight provides additional impetus to understanding what a welldefined set of inputs to the phylogenetic tree can tell us about the evolution of haplotypes, the dates of origin of family surname clusters and SNPs, and subhaplogroups. That date of origin is where N=1 in the tree equation and it can be found either from the equation or from an extrapolation of the graph of N vs. RCC from which the tree was derived!===============================Sincerely, Bill Howard 0 Attachment
See below==================== J. J. (Jim) Logan Logan DNA Project, GenGenNV, ISOGG, GOONS, CWG/VASSAR ===================================================================
On 12/30/2013 8:00 PM, Bill Howard wrote:
Yes, I agreed to an exponential fit to graph of branch point times for the of the tree. And, no, you did not explicitly mention the direction of counting. However, you introduced it implicitly by started that you were counting crossings for various values of RCC. In your approach you start with N=K and ended with an N=2, and then extend a graph or a derived equation to determine a value for N=1. I have been telling you that your counts are all off by one unit and that analyzing the tree from the other direction yields count values from N=1 to N=K1. The whole curve is shifted. As I said, the difference is subtle but significant.Jim,In today's posting, I did not go into where you start on the tree, only that I have found that the fit is to an exponential. And you agreed to that. So, I do not understand why you think that mentioning that we are dealing with an exponential is SIGNIFICANTLY flawed. That seems a bit harsh to me.
Of course, the Rsquared values are similar because in each case you are using the exact same curve except for a slight shift in its position. I did not bring up to question of sampling at regular intervals; that is a different issue. I am addressing only the point of branch points vs crossover and I believe that is much more significant.In my work so far, I have seen no significant difference in the analysis whether you sample cuts through the tree at regular intervals or only at the branch points. In fact, in one comparison I made between the two ways of counting, I found that the variance (Rsquared) of the cuts at regular intervals was slightly higher than the branch point cuts, but the difference, although it favors my approach, is not significant.
Please forget about discussion with FTDNA for now; that is not germane to the point principle of the point being discussed. I agree the you "fail to see why . . . the results is significantly different [not difficult]". It has been the point of contention for some time. You are correct when you said the earliest branch point is where the mutation occurred. I also agree that this point is may not be the progenitor; however, it is the most recent common ancestor for the data set. But now let's consider his branch point in more detail. You say, one of the sons is responsible for the branching. Although it is a technical point, I disagree. It was the father (the most common recent ancestor) that was responsible because he is the one who passed the mutant marker(s). In any case, the son that was different instantly became the progenitor of a new branch. However, since we are dealing with a time scale of many generations, that son was probably not the most recent common ancestor of the persons in that branch since there are likely many generations between his birth and any know mutations; that point in time associated with such mutations is symbolized by the next branch point on the mutant son's branch.I also fail to see why you say that the result is significantly difficult depending on whether you count forward or backward in time. If we differ, I think it is because I believe that the progenitor lived before the TMRCA if the group and you believe that the progenitor is the MRCA. We apparently disagree here. I remember an exchange I had with a staff member at FTDNA a few years ago who seemed to understand what I was doing (dealing only with the TMRCA point), and they stated that the challenge was to date the progenitor of a Haplogroup (or a SNP) who lived before the MRCA. That is what I believe I am doing here. We can only directly determine the earliest branch point, but that is where the mutation occurred, not when the progenitor lived. One of is sons was responsible for the branching.
Now consider the son of other branch  the one that does not have the mutation. That son is now an instant progenitor of that branch. (This is, of course a simplifies explanation since there could actually be more sons or cousins to consider. Ultimately, however branches die out leaving a single common ancestor.) Similarly, he is probably not the progenitor of that branch; that is symbolized by the next branch point for that son's branch.
Now consider the father and two sons as a family symbolized by the first (in time) branch point. In summary, father is a most recent common ancestor of the two branches and the two sons are each progenitors of their respective branches. The date of the common ancestor and these two progenitors is estimated as that of the branch point. Then the date of most recent common ancestor of each of the branches is determined by the RCC of first (in time) branch points of there respective branches. A similar process can be carried forward until we ultimate reach the terminal points of the kits themselves. Thus we have identified an approximate point in time for each most recent common ancestor and progenitor for each branch in the tree.
Now look at the tree as a whole. Since we are typically dealing with a time scale of thousands of years at the first branch point, the difference is dates of birth of the father and each of his two sons is relatively small and can be ignored. We thus know the date of progenitor of each branch and the most recent common ancestor of these branches. We do not know the progenitor of the data set. This is what we are trying to determine. There are a several of way to do this. The best way in my opinion, is to place this data set in context of a larger data set and repeat the analysis. That is, build a bigger tree such that the tree you are analyzing is a branch. This is what I did some time ago to estimate the age of the various branches of mtDNA Haplogroup J. I did an analysis of the JT mega haplotype and from that determined the age of the origin of Haplotypes J and T. Another approach is a statistical analysis of the average path length from origin to each terminal and then apply a mutation rate to determine a date; that is not feasible here since mutations per se are not available. Another is some sort of
statistical projection. Perhaps a projection using a derived exponential equation is appropriate, but I would like to see it tested further, e.g., applying it to branches. Another possible projection might be able to do an analysis to determine the average ratio of progenitor date to MRCA date for the various branches and and apply that as a projection for the whole tree.
I believe you RCC approach is useful and, and you know, I have been using it in my study on the "Origin and Distribution of the Logan Surname in the Early Modern British Isles". My critique had not been to cast doubt on its usefulness, but rather to make its interpretation to conform to well developed theory in Population Genetics and to justify beyond simply saying "I found something that works" .
Jim
 Bye from Bill 0 Attachment
When I posted my insight that the junction points on the dated YDNA phylogenetic tree could be described by a simple exponential function, I had not expected that Jim Logan would say it was significantly flawed because he had already agreed (off line) that the run of junction points vs. RCC (time) was exponential. We have been having an offline discussion about how to determine the time when the progenitor lived but his recent criticism unfortunately opens up the discussion more broadly. So, here is my reply.It is true that it APPEARS from the tree that the progenitor MIGHT live at the RCC that corresponds to the point where the first pair of haplotypes share a MRCA, but in my discussions with both the FTDNA staff (which Jim says I should forget) and with others, it is apparent that the oldest junction point on the tree is only where the SAMPLE we use is shown to meet. But the sample is incomplete. The real progenitor lived earlier. If we had a complete sample, the progenitor would be located at an RCC that is very near that oldest junction point, but our samples are far from complete. More on this point at the end of this note.Since there are descendant lines that die out and since the sample is incomplete, we can take the run of points that we see in the exponential and use them to ESTIMATE (not derive) the time to the true progenitor, since the run of points on the tree already results from lines that did not die out. The RATE of the run of junction points has resulted already from the dying out process and the remnants of the dying out process is already incorporated in the run of points that are used to derive the exponential. (My work so far indicates that whether you count upward or downward is only a minor problem. Why? Because the same result is arrived at using either approach. The difference we see is both subtle and insignificant. The number of testees is K in the equation, not K1).Jim should not ignore the discussion I had with FTDNA about how to find the progenitor. They, Sidney Sachs and I agree that the earliest pair in the sample is not the progenitor who must have lived earlier in time. The incompleteness of the sample, combined with lines dying out, dictate that you have to find an earlier time for the origin of the group, whether the group is a surname, a haplogroup or a SNP. Since those two impediments are incorporated in the run of the junction points expressed by the exponential, then solving for the RCC at which N= 1, the progenitor, makes sense.Jim writes that he agrees that the earliest branch point may not be the progenitor. We both agree that the earliest branch point is where the mutation occurred and that it might not be the progenitor. We agree also that it is the MRCA of the (incomplete) data set. But in a more complete data set, the MRCA of the branch point and the progenitor will be closer together in time.In an earlier email between us, I took two large clusters in the same tree and used the same process for each of them BEFORE they joined on the tree. Knowing the time (RCC) when they did join, I could compare their individual progenitors, anticipating that they would be the same. In both cases the progenitor of EACH big subcluster was at the branch point that was shown on the tree, within the errors of both determinations. So, if it works for a subcluster set, it should also work for the full set. That is the essence of the "proof" that if the two progenitors of subclusters that are known to join are seen to be the same though this process, then the progenitor of the full set can be derived using the same process. This extrapolation gets neatly around the problem of incomplete sets and of lines dying out. It is the best estimate we can make from the data.In Jim's penultimate paragraph, he suggests that "The best way in my opinion, is to place this data set in context of a larger data set and repeat the analysis. That is, build a bigger tree such that the tree you are analyzing is a branch." Jim should read our prior email exchanges and the paragraph just above, because that is exactly what I have done. In the case of the two subclusters, their progenitor is at the junction point of where they connect, and the bigger tree is merely the larger set in which the sub clusters are two components.Jim then writes "Another possible projection might be .... to do an analysis to determine the average ratio of progenitor date to MRCA date for the various branches and apply that as a projection for the whole tree." I have also done that, and I found from an analysis of the treematrix combination that the multiplication factor to get from the earliest junction point on the tree to the time of the progenitor is about 1.22." In other words, take the RCC of the earliest tree junction point and multiply it by 1.22 to estimate the date of the progenitor. Completely independently of that, in my paper 1** the same problem arose when I was discussing how to determine the TMRCA of a surname cluster with the JoGG editor, Whit Athey, and on page 262 of that paper I derive the progenitor determining factor as a multiplier of 52.7 instead of 43.3. That ratio is ((52.7/43.3) = 1.22) in exact agreement with what I have found from the treematrix analysis of completely independent data, although the exact agreement is probably fortuitous. However, the sample sizes were in the same ball park which may explain why the results are the same to within the errors of the determinations. If they had been very different, the multiplier would have been lower than 1.22 for a much larger sample because the derivation of the time of the progenitor should be closer to the time of the earliest pair on the tree for reasons that I mentioned earlier in this note.I am pleased that Jim mentions that my RCC correlation approach is useful and that he has been using it in his studies. Bye from Bill==========================================================================================On Dec 30, 2013, at 11:45 PM, J. J. (Jim) Logan replied (see indents):On 12/30/2013 8:00 PM, Bill Howard wrote:Jim,In today's posting, I did not go into where you start on the tree, only that I have found that the fit is to an exponential. And you agreed to that. So, I do not understand why you think that mentioning that we are dealing with an exponential is SIGNIFICANTLY flawed. That seems a bit harsh to me.
Of course, the Rsquared values are similar because in each case you are using the exact same curve except for a slight shift in its position. I did not bring up to question of sampling at regular intervals; that is a different issue. I am addressing only the point of branch points vs crossover and I believe that is much more significant.In my work so far, I have seen no significant difference in the analysis whether you sample cuts through the tree at regular intervals or only at the branch points. In fact, in one comparison I made between the two ways of counting, I found that the variance (Rsquared) of the cuts at regular intervals was slightly higher than the branch point cuts, but the difference, although it favors my approach, is not significant.
Please forget about discussion with FTDNA for now; that is not germane to the point principle of the point being discussed. I agree the you "fail to see why . . . the results is significantly different [not difficult]". It has been the point of contention for some time. You are correct when you said the earliest branch point is where the mutation occurred. I also agree that this point is may not be the progenitor; however, it is the most recent common ancestor for the data set. But now let's consider his branch point in more detail. You say, one of the sons is responsible for the branching. Although it is a technical point, I disagree. It was the father (the most common recent ancestor) that was responsible because he is the one who passed the mutant marker(s). In any case, the son that was different instantly became the progenitor of a new branch. However, since we are dealing with a time scale of many generations, that son was probably not the most recent common ancestor of the persons in that branch since there are likely many generations between his birth and any know mutations; that point in time associated with such mutations is symbolized by the next branch point on the mutant son's branch.I also fail to see why you say that the result is significantly difficult depending on whether you count forward or backward in time. If we differ, I think it is because I believe that the progenitor lived before the TMRCA if the group and you believe that the progenitor is the MRCA. We apparently disagree here. I remember an exchange I had with a staff member at FTDNA a few years ago who seemed to understand what I was doing (dealing only with the TMRCA point), and they stated that the challenge was to date the progenitor of a Haplogroup (or a SNP) who lived before the MRCA. That is what I believe I am doing here. We can only directly determine the earliest branch point, but that is where the mutation occurred, not when the progenitor lived. One of is sons was responsible for the branching.
Now consider the son of other branch  the one that does not have the mutation. That son is now an instant progenitor of that branch. (This is, of course a simplifies explanation since there could actually be more sons or cousins to consider. Ultimately, however branches die out leaving a single common ancestor.) Similarly, he is probably not the progenitor of that branch; that is symbolized by the next branch point for that son's branch.
Now consider the father and two sons as a family symbolized by the first (in time) branch point. In summary, father is a most recent common ancestor of the two branches and the two sons are each progenitors of their respective branches. The date of the common ancestor and these two progenitors is estimated as that of the branch point. Then the date of most recent common ancestor of each of the branches is determined by the RCC of first (in time) branch points of there respective branches. A similar process can be carried forward until we ultimate reach the terminal points of the kits themselves. Thus we have identified an approximate point in time for each most recent common ancestor and progenitor for each branch in the tree.
Now look at the tree as a whole. Since we are typically dealing with a time scale of thousands of years at the first branch point, the difference is dates of birth of the father and each of his two sons is relatively small and can be ignored. We thus know the date of progenitor of each branch and the most recent common ancestor of these branches. We do not know the progenitor of the data set. This is what we are trying to determine. There are a several of way to do this. The best way in my opinion, is to place this data set in context of a larger data set and repeat the analysis. That is, build a bigger tree such that the tree you are analyzing is a branch. This is what I did some time ago to estimate the age of the various branches of mtDNA Haplogroup J. I did an analysis of the JT mega haplotype and from that determined the age of the origin of Haplotypes J and T. Another approach is a statistical analysis of the average path length from origin to each terminal and then apply a mutation rate to determine a date; that is not feasible here since mutations per se are not available. Another is some sort of
statistical projection. Perhaps a projection using a derived exponential equation is appropriate, but I would like to see it tested further, e.g., applying it to branches. Another possible projection might be able to do an analysis to determine the average ratio of progenitor date to MRCA date for the various branches and and apply that as a projection for the whole tree.
I believe you RCC approach is useful and, and you know, I have been using it in my study on the "Origin and Distribution of the Logan Surname in the Early Modern British Isles". My critique had not been to cast doubt on its usefulness, but rather to make its interpretation to conform to well developed theory in Population Genetics and to justify beyond simply saying "I found something that works" .
Jim
,___
 0 Attachment
Bill:
I hereby retract and apologize for my statement that your approach was "significantly flawed". Although there are still differences in detail, it is clear that we have common objectives. Furthermore, I now believe, any differences in results are "insignificant"  certainly within any margin of error associated with the source data.
I do have some comments, but I will address them to you privately.
Jim==================== J. J. (Jim) Logan Logan DNA Project, GenGenNV, ISOGG, GOONS, CWG/VASSAR ===================================================================
Your message has been successfully submitted and would be delivered to recipients shortly.