## Re: [ARRL-LOTW] Re: Duplicates

Expand Messages
• ... Events per unit of time are one example of a Poisson process but they are not the *only* case. The generalized definition of a Poisson process is any
Message 1 of 147 , Feb 23, 2013
• 0 Attachment
> Poission statisics are used to predict the occurance of a unit

Events per unit of time are one example of a Poisson process but they
are not the *only* case. The generalized definition of a Poisson
process is any given number of events in a fixed interval or area.
Poisson statistics are equally valid for describing the number of beer
bottles in (each) trash bin in a stadium as they are for describing the
number of cars passing the stadium each hour.

Each log can be considered an independent bin or area - the number of
QSOs contained in each log is independent of the number of QSOs in each
of the other logs - thus the number of QSOs per log is Poisson.

73,

... Joe, W4TV

On 2/23/2013 6:10 PM, Pete Ennis wrote:
> Poission statisics are used to predict the occurance of a unit. In our case logs. It is not used to find a mean or average.
>
> Keith
>
> From: "Joe Subich, W4TV" <lists@...>
> To: ARRL-LOTW@yahoogroups.com
> Sent: Saturday, February 23, 2013 4:13 PM
> Subject: Re: [ARRL-LOTW] Re: Duplicates
>
>
>
> Warning - long boring post on sampling methods - hit delete now if you
> are tired of this ongoing subject.
>
>> K2DSL: You need a data point you don't have to solve for either total
>> # of QSOs or previously processed QSOs.
>> Formula: Total # QSOs uploaded - New QSOs Uploaded = Previously
>> Uploaded QSOs T - N = P abbreviating the above.
>> Data isn't available for T or P so you have 2 unknowns. What you
>> have presented is your personal determination of T based based on
>> assumptions extrapolated from the data snapshots. That's what I noted
>> in my original comment above - there are 2 unknowns.
>
> I'm not using P to solve for T .... T is determined *independently* by
> sampling the entire population and P is calculated from that. There is
> only one unknown - T (total QSOs) since P is T-N and we know N from the
> status data at https://p1k.arrl.org/lotwuser/default.
>
>> Your start date is 12/3/12 but in looking at the data, 1/18 is when
>> the system seems to have reliably caught up where the queue reached
>> 0 and didn't have any significant backlogs re-occur. That's a
>> sampling of just 36 days.
>
> The backlog reliably went under one day at 17:59 12 January, below four
> hours at 04:59 on 13 January and less than one hour at 09:59 on 13 Jan.
> For various reasons - including large log uploads, large numbers of
> uploads, and system outages delays have ranged from zero to 3 hours,
> 13 minutes since that time but for our purposes LotW can be considered
> to be "operating normally" (or have "caught up") by 9:59 on 1/13.
>
> However, whether "normal" operation is defined as starting 1/13 or 1/18
> is immaterial.
>
>> I also removed any hourly data points where there was no data
>> reported leaving data where there was at least 1 log in the queue.
>> This reduced the number of hourly data points to 738. I didn't look
>> to see what Joe previously stated the sample size was but that's
>> what I consider the sample size of the data for analysis purposes.
>
> We are *not concerned* with the number of sample points - the purpose
> is not to determine the number of logs being processed as we already
> know that *exactly* from https://p1k.arrl.org/lotwuser/default. Our
> purpose is to determine the average number of QSOs in the log so we
> can determine the number of QSOs processed: T = Number of logs (N)
> times the average number of QSOs in each log (Q).
>
> In the period I used - 23:59z 13 Jan, 2013 through 23:59z 22 Feb 2013 -
> there are 30,318 logs in the sample (excluding samples that are not
> valid because of duplication when the backlog is longer than one hour)
> from a total population of 194,874 logs ("user files") processed in
> that time period. That means the sample is slightly more than 15% of
> all logs processed (which is a huge level of "over sampling").
>
>> The median helps to minimize skewing from the more extreme outliers
>> in the data such as hourly snapshots with very small # logs & QSOs
>> and with very large # of logs & QSOs. The median of 101 QSOs per
>> snapshot is approximately the middle value where 1/2 of the 738
>> snapshots are less than 101 QSOs and 1/2 the 738 snapshots are
>> greater than 101 QSOs.
>
> Median, mean and standard deviation are for a normal distribution.
> This population is far from "normal" - it is a Poisson process. A
> Poisson process is one in which events happen discretely and are
> independent - the number of customers arriving at a bank in an hour,
> the number of trees in an acre of forest, the number of pieces of
> litter along one mile of highway, etc. The number of QSOs in a log
> upload is also a Poisson process.
>
> In Poisson statistics we deal with only a Mean and variance (which
> are identical). The entire goal here is to have enough samples (k)
> so that the probability that the mean we calculate is within the
> error value we are willing to accept of the true mean of the whole
> population. By over sampling we assure that the calculated value
> is "close enough" - thanks to the properties of limits (the error
> in a sample goes to zero as the size of the sample reaches the
> whole population).
>
>> Averaging comes out to be 328 QSOs per log (10801/33). The median
>> comes out > to be 25 QSOs per log (101/4). You can see that assuming
>> the average number is representative of all logs shows a vastly
>> different picture from what the median shows across 738 snapshots.
>
> As I've shown above, the number of "snapshots" means nothing - it is
> the number of logs in the sample that is important - the greater the
> number of logs, the more accurate will be the estimation of the mean.
> In any case, we know absolutely that your "normal" median can not be
> anywhere close to the actual median as for the last five weeks the
> average (mean) number of *new* QSOs per log (New QSOs divided by User
> Files from https://p1k.arrl.org/lotwuser/default) was 63, 68, 65, 82
> and 70. Averaged across the entire period the average number of *new*
> QSOs per log is 70. The fact that "new" QSOs alone are nearly three
> times greater than your median would argue that the true mean is much
> closer to what you give as the mean. The "maximum likelihood" of the
> Poisson distribution happens to be the simple mean but that is also
> a *minimum value unbiased estimator.* This is more statistical theory
> than we need to go into here - but it simply says the mean of any
> Poisson distribution will be no lower than the simple mean of the
> samples.
>
> Again, the only issue becomes whether the number of independent samples
> - in this case the *sum of the logs* in the hourly reports - is large
> enough that the probability their mean will be within N% of the mean
> of the population. With a sample size that exceeds 15% - the answer is
> an unequivocal *yes* for the entire six week period.
>
> I have not calculated the sample sizes for the individual weeks but I
> have no reason to doubt that they will also be more than sufficient
> as well.
>
> It is unfortunate that ARRL have not seen fit to release the "input"
> data the way they release the "new QSOs" and "user files" numbers as
> there would be no question concerning the level of wasted processing
> but even your mean (maximum likelihood) puts the level of previously
> processed QSOs at more than 75% for the five week period.
>
> 73,
>
> ... Joe, W4TV
>
> On 2/23/2013 12:29 PM, David Levine wrote:
>> For those that don't care about statistics or a different view of the
>> statistics being discussed, press delete.
>>
>> I only had a brief amount of time today to look at the data Joe sent me
>> before I need to head out. Before I recap some findings, I need to address
>>
>>> I just don't believe your assumptions/extrapolation to be accurate.
>>> It's like in a simple math equation.... A + x = y and you are solving
>>> for both. We know A but the doubt has been on the accuracy of x,
>>> something you adamantly stop your feet as accurate and others feel is
>>> not the case. .
>>
>> W4TV: The analysis solves for only one thing ... the only unknown is the
>> total number of QSOs processed each week. Since we know exactly the
>> number of files processed, the only issue is determining the average
>> number of QSOs per file which is a straight forward problem.
>>
>> K2DSL: You need a data point you don't have to solve for either total # of
>> QSOs or previously processed QSOs.
>> QSOs
>> T - N = P abbreviating the above.
>> Data isn't available for T or P so you have 2 unknowns. What you have
>> presented is your personal determination of T based based on assumptions
>> extrapolated from the data snapshots. That's what I noted in my original
>> comment above - there are 2 unknowns.
>>
>> In my brief analysis of the data which I will look further into later or
>> tomorrow, I notice the following:
>> Your start date is 12/3/12 but in looking at the data, 1/18 is when the
>> system seems to have reliably caught up where the queue reached 0 and
>> didn't have any significant backlogs re-occur. That's a sampling of just 36
>> days.
>>
>> I also removed any hourly data points where there was no data reported
>> leaving data where there was at least 1 log in the queue. This reduced the
>> number of hourly data points to 738. I didn't look to see what Joe
>> previously stated the sample size was but that's what I consider the sample
>> size of the data for analysis purposes.
>>
>> In looking at those 738 hourly snapshots I come up with the following:
>> Average # logs in the queue shown across the hourly snapshots: 33
>> Median # logs in the queue shown across the hourly snapshots: 4
>>
>> Average # of QSOs shown across the hourly snapshots: 10,801
>> Median # of QSOs shown across the hourly snapshots: 101
>>
>> The median helps to minimize skewing from the more extreme outliers in the
>> data such as hourly snapshots with very small # logs & QSOs and with very
>> large # of logs & QSOs. The median of 101 QSOs per snapshot is
>> approximately the middle value where 1/2 of the 738 snapshots are less than
>> 101 QSOs and 1/2 the 738 snapshots are greater than 101 QSOs.
>>
>> Averaging comes out to be 328 QSOs per log (10801/33). The median comes out
>> to be 25 QSOs per log (101/4). You can see that assuming the average number
>> is representative of all logs shows a vastly different picture from what
>> the median shows across 738 snapshots. I don't believe either of these
>> numbers to be accurate for extrapolation in determining the # of total QSOs
>> per log. This number is important because it is then used with the # of
>> logs uploaded (actual data point provided) to determine the total # of QSOs
>> uploaded (one of the missing values noted at the top of my post). Joe's
>> premise is that his calculated average is "statistically" valid and what
>> others are challenging in their responses.
>>
>> K2DSL - David
>>
>
>
• I run ACLog... When you hit the ALL SINCE button, change the date to be something about a week prior to the LoTW failure... I believe , ACLog got very
Message 147 of 147 , Aug 24, 2014
• 0 Attachment
I run ACLog... When you hit the "ALL SINCE" button, change the date to
be something about a week prior to the LoTW failure...

I "believe", ACLog got very confused as a result of the fail mode of
LoTW. That corrected a very similar problem for me.
--
Thanks and 73's,
For equipment, and software setups and reviews see:
www.nk7z.net
for MixW support see;
http://groups.yahoo.com/neo/groups/mixw/info
for Dopplergram information see:
http://groups.yahoo.com/neo/groups/dopplergram/info
for MM-SSTV see:
http://groups.yahoo.com/neo/groups/MM-SSTV/info

On Sun, 2014-08-24 at 09:05 -0700, reillyjf@... [ARRL-LOTW]
wrote:
>
>
> Thanks for the suggestion. I did a complete download, and beat the
> number of duplicates down from 275 to 30. No exactly sure why the
> N3FJP ACL is missing this information.
> - 73, John, N0TA
>
>
>
Your message has been successfully submitted and would be delivered to recipients shortly.