- Feb 23, 2013Sorry for the misspelling it is Poisson. If I trusted the data I could use Poisson to tell me how likely 2 logs with 6 qsos would show up on the Queue Status page. That's not much use.Keith
**From:**Pete Ennis <nelasat@...>**To:**"ARRL-LOTW@yahoogroups.com" <ARRL-LOTW@yahoogroups.com>**Sent:**Saturday, February 23, 2013 5:10 PM**Subject:**Re: [ARRL-LOTW] Re: DuplicatesPoission statisics are`used to predict the occurance of a unit. In our case logs. It is not used to find a mean or average.`Keith**From:**"Joe Subich, W4TV" <lists@...>**To:**ARRL-LOTW@yahoogroups.com**Sent:**Saturday, February 23, 2013 4:13 PM**Subject:**Re: [ARRL-LOTW] Re: DuplicatesWarning - long boring post on sampling methods - hit delete now if you are tired of this ongoing subject. > K2DSL: You need a data point you don't have to solve for either total > # of QSOs or previously processed QSOs. > Formula: Total # QSOs uploaded - New QSOs Uploaded = Previously > Uploaded QSOs T - N = P abbreviating the above. > Data isn't available for T or P so you have 2 unknowns. What you > have presented is your personal determination of T based based on > assumptions extrapolated from the data snapshots. That's what I noted > in my original comment above - there are 2 unknowns. I'm not using P to solve for T .... T is determined *independently* by sampling the entire population and P is calculated from that. There is only one unknown - T (total QSOs) since P is T-N and we know N from the status data at https://p1k.arrl.org/lotwuser/default. > Your start date is 12/3/12 but in looking at the data, 1/18 is when > the system seems to have reliably caught up where the queue reached > 0 and didn't have any significant backlogs re-occur. That's a > sampling of just 36 days. The backlog reliably went under one day at 17:59 12 January, below four hours at 04:59 on 13 January and less than one hour at 09:59 on 13 Jan. For various reasons - including large log uploads, large numbers of uploads, and system outages delays have ranged from zero to 3 hours, 13 minutes since that time but for our purposes LotW can be considered to be "operating normally" (or have "caught up") by 9:59 on 1/13. However, whether "normal" operation is defined as starting 1/13 or 1/18 is immaterial. > I also removed any hourly data points where there was no data > reported leaving data where there was at least 1 log in the queue. > This reduced the number of hourly data points to 738. I didn't look > to see what Joe previously stated the sample size was but that's > what I consider the sample size of the data for analysis purposes. We are *not concerned* with the number of sample points - the purpose is not to determine the number of logs being processed as we already know that *exactly* from https://p1k.arrl.org/lotwuser/default. Our purpose is to determine the average number of QSOs in the log so we can determine the number of QSOs processed: T = Number of logs (N) times the average number of QSOs in each log (Q). In the period I used - 23:59z 13 Jan, 2013 through 23:59z 22 Feb 2013 - there are 30,318 logs in the sample (excluding samples that are not valid because of duplication when the backlog is longer than one hour) from a total population of 194,874 logs ("user files") processed in that time period. That means the sample is slightly more than 15% of all logs processed (which is a huge level of "over sampling"). > The median helps to minimize skewing from the more extreme outliers > in the data such as hourly snapshots with very small # logs & QSOs > and with very large # of logs & QSOs. The median of 101 QSOs per > snapshot is approximately the middle value where 1/2 of the 738 > snapshots are less than 101 QSOs and 1/2 the 738 snapshots are > greater than 101 QSOs. Median, mean and standard deviation are for a normal distribution. This population is far from "normal" - it is a Poisson process. A Poisson process is one in which events happen discretely and are independent - the number of customers arriving at a bank in an hour, the number of trees in an acre of forest, the number of pieces of litter along one mile of highway, etc. The number of QSOs in a log upload is also a Poisson process. In Poisson statistics we deal with only a Mean and variance (which are identical). The entire goal here is to have enough samples (k) so that the probability that the mean we calculate is within the error value we are willing to accept of the true mean of the whole population. By over sampling we assure that the calculated value is "close enough" - thanks to the properties of limits (the error in a sample goes to zero as the size of the sample reaches the whole population). > Averaging comes out to be 328 QSOs per log (10801/33). The median > comes out > to be 25 QSOs per log (101/4). You can see that assuming > the average number is representative of all logs shows a vastly > different picture from what the median shows across 738 snapshots. As I've shown above, the number of "snapshots" means nothing - it is the number of logs in the sample that is important - the greater the number of logs, the more accurate will be the estimation of the mean. In any case, we know absolutely that your "normal" median can not be anywhere close to the actual median as for the last five weeks the average (mean) number of *new* QSOs per log (New QSOs divided by User Files from https://p1k.arrl.org/lotwuser/default) was 63, 68, 65, 82 and 70. Averaged across the entire period the average number of *new* QSOs per log is 70. The fact that "new" QSOs alone are nearly three times greater than your median would argue that the true mean is much closer to what you give as the mean. The "maximum likelihood" of the Poisson distribution happens to be the simple mean but that is also a *minimum value unbiased estimator.* This is more statistical theory than we need to go into here - but it simply says the mean of any Poisson distribution will be no lower than the simple mean of the samples. Again, the only issue becomes whether the number of independent samples - in this case the *sum of the logs* in the hourly reports - is large enough that the probability their mean will be within N% of the mean of the population. With a sample size that exceeds 15% - the answer is an unequivocal *yes* for the entire six week period. I have not calculated the sample sizes for the individual weeks but I have no reason to doubt that they will also be more than sufficient as well. It is unfortunate that ARRL have not seen fit to release the "input" data the way they release the "new QSOs" and "user files" numbers as there would be no question concerning the level of wasted processing but even your mean (maximum likelihood) puts the level of previously processed QSOs at more than 75% for the five week period. 73, ... Joe, W4TV On 2/23/2013 12:29 PM, David Levine wrote: > For those that don't care about statistics or a different view of the > statistics being discussed, press delete. > > I only had a brief amount of time today to look at the data Joe sent me > before I need to head out. Before I recap some findings, I need to address > 1 comment Joe made: > >> I just don't believe your assumptions/extrapolation to be accurate. >> It's like in a simple math equation.... A + x = y and you are solving >> for both. We know A but the doubt has been on the accuracy of x, >> something you adamantly stop your feet as accurate and others feel is >> not the case. . > > W4TV: The analysis solves for only one thing ... the only unknown is the > total number of QSOs processed each week. Since we know exactly the > number of files processed, the only issue is determining the average > number of QSOs per file which is a straight forward problem. > > K2DSL: You need a data point you don't have to solve for either total # of > QSOs or previously processed QSOs. > Formula: Total # QSOs uploaded - New QSOs Uploaded = Previously Uploaded > QSOs > T - N = P abbreviating the above. > Data isn't available for T or P so you have 2 unknowns. What you have > presented is your personal determination of T based based on assumptions > extrapolated from the data snapshots. That's what I noted in my original > comment above - there are 2 unknowns. > > In my brief analysis of the data which I will look further into later or > tomorrow, I notice the following: > Your start date is 12/3/12 but in looking at the data, 1/18 is when the > system seems to have reliably caught up where the queue reached 0 and > didn't have any significant backlogs re-occur. That's a sampling of just 36 > days. > > I also removed any hourly data points where there was no data reported > leaving data where there was at least 1 log in the queue. This reduced the > number of hourly data points to 738. I didn't look to see what Joe > previously stated the sample size was but that's what I consider the sample > size of the data for analysis purposes. > > In looking at those 738 hourly snapshots I come up with the following: > Average # logs in the queue shown across the hourly snapshots: 33 > Median # logs in the queue shown across the hourly snapshots: 4 > > Average # of QSOs shown across the hourly snapshots: 10,801 > Median # of QSOs shown across the hourly snapshots: 101 > > The median helps to minimize skewing from the more extreme outliers in the > data such as hourly snapshots with very small # logs & QSOs and with very > large # of logs & QSOs. The median of 101 QSOs per snapshot is > approximately the middle value where 1/2 of the 738 snapshots are less than > 101 QSOs and 1/2 the 738 snapshots are greater than 101 QSOs. > > Averaging comes out to be 328 QSOs per log (10801/33). The median comes out > to be 25 QSOs per log (101/4). You can see that assuming the average number > is representative of all logs shows a vastly different picture from what > the median shows across 738 snapshots. I don't believe either of these > numbers to be accurate for extrapolation in determining the # of total QSOs > per log. This number is important because it is then used with the # of > logs uploaded (actual data point provided) to determine the total # of QSOs > uploaded (one of the missing values noted at the top of my post). Joe's > premise is that his calculated average is "statistically" valid and what > others are challenging in their responses. > > K2DSL - David > - << Previous post in topic Next post in topic >>