Loading ...
Sorry, an error occurred while loading the content.

Re: [ARRL-LOTW] Re: Duplicates

Expand Messages
  • Joe Subich, W4TV
    ... Events per unit of time are one example of a Poisson process but they are not the *only* case. The generalized definition of a Poisson process is any
    Message 1 of 147 , Feb 23, 2013
    • 0 Attachment
      > Poission statisics are used to predict the occurance of a unit

      Events per unit of time are one example of a Poisson process but they
      are not the *only* case. The generalized definition of a Poisson
      process is any given number of events in a fixed interval or area.
      Poisson statistics are equally valid for describing the number of beer
      bottles in (each) trash bin in a stadium as they are for describing the
      number of cars passing the stadium each hour.

      Each log can be considered an independent bin or area - the number of
      QSOs contained in each log is independent of the number of QSOs in each
      of the other logs - thus the number of QSOs per log is Poisson.

      73,

      ... Joe, W4TV


      On 2/23/2013 6:10 PM, Pete Ennis wrote:
      > Poission statisics are used to predict the occurance of a unit. In our case logs. It is not used to find a mean or average.
      >
      > Keith
      >
      > From: "Joe Subich, W4TV" <lists@...>
      > To: ARRL-LOTW@yahoogroups.com
      > Sent: Saturday, February 23, 2013 4:13 PM
      > Subject: Re: [ARRL-LOTW] Re: Duplicates
      >
      >
      >
      > Warning - long boring post on sampling methods - hit delete now if you
      > are tired of this ongoing subject.
      >
      >> K2DSL: You need a data point you don't have to solve for either total
      >> # of QSOs or previously processed QSOs.
      >> Formula: Total # QSOs uploaded - New QSOs Uploaded = Previously
      >> Uploaded QSOs T - N = P abbreviating the above.
      >> Data isn't available for T or P so you have 2 unknowns. What you
      >> have presented is your personal determination of T based based on
      >> assumptions extrapolated from the data snapshots. That's what I noted
      >> in my original comment above - there are 2 unknowns.
      >
      > I'm not using P to solve for T .... T is determined *independently* by
      > sampling the entire population and P is calculated from that. There is
      > only one unknown - T (total QSOs) since P is T-N and we know N from the
      > status data at https://p1k.arrl.org/lotwuser/default.
      >
      >> Your start date is 12/3/12 but in looking at the data, 1/18 is when
      >> the system seems to have reliably caught up where the queue reached
      >> 0 and didn't have any significant backlogs re-occur. That's a
      >> sampling of just 36 days.
      >
      > The backlog reliably went under one day at 17:59 12 January, below four
      > hours at 04:59 on 13 January and less than one hour at 09:59 on 13 Jan.
      > For various reasons - including large log uploads, large numbers of
      > uploads, and system outages delays have ranged from zero to 3 hours,
      > 13 minutes since that time but for our purposes LotW can be considered
      > to be "operating normally" (or have "caught up") by 9:59 on 1/13.
      >
      > However, whether "normal" operation is defined as starting 1/13 or 1/18
      > is immaterial.
      >
      >> I also removed any hourly data points where there was no data
      >> reported leaving data where there was at least 1 log in the queue.
      >> This reduced the number of hourly data points to 738. I didn't look
      >> to see what Joe previously stated the sample size was but that's
      >> what I consider the sample size of the data for analysis purposes.
      >
      > We are *not concerned* with the number of sample points - the purpose
      > is not to determine the number of logs being processed as we already
      > know that *exactly* from https://p1k.arrl.org/lotwuser/default. Our
      > purpose is to determine the average number of QSOs in the log so we
      > can determine the number of QSOs processed: T = Number of logs (N)
      > times the average number of QSOs in each log (Q).
      >
      > In the period I used - 23:59z 13 Jan, 2013 through 23:59z 22 Feb 2013 -
      > there are 30,318 logs in the sample (excluding samples that are not
      > valid because of duplication when the backlog is longer than one hour)
      > from a total population of 194,874 logs ("user files") processed in
      > that time period. That means the sample is slightly more than 15% of
      > all logs processed (which is a huge level of "over sampling").
      >
      >> The median helps to minimize skewing from the more extreme outliers
      >> in the data such as hourly snapshots with very small # logs & QSOs
      >> and with very large # of logs & QSOs. The median of 101 QSOs per
      >> snapshot is approximately the middle value where 1/2 of the 738
      >> snapshots are less than 101 QSOs and 1/2 the 738 snapshots are
      >> greater than 101 QSOs.
      >
      > Median, mean and standard deviation are for a normal distribution.
      > This population is far from "normal" - it is a Poisson process. A
      > Poisson process is one in which events happen discretely and are
      > independent - the number of customers arriving at a bank in an hour,
      > the number of trees in an acre of forest, the number of pieces of
      > litter along one mile of highway, etc. The number of QSOs in a log
      > upload is also a Poisson process.
      >
      > In Poisson statistics we deal with only a Mean and variance (which
      > are identical). The entire goal here is to have enough samples (k)
      > so that the probability that the mean we calculate is within the
      > error value we are willing to accept of the true mean of the whole
      > population. By over sampling we assure that the calculated value
      > is "close enough" - thanks to the properties of limits (the error
      > in a sample goes to zero as the size of the sample reaches the
      > whole population).
      >
      >> Averaging comes out to be 328 QSOs per log (10801/33). The median
      >> comes out > to be 25 QSOs per log (101/4). You can see that assuming
      >> the average number is representative of all logs shows a vastly
      >> different picture from what the median shows across 738 snapshots.
      >
      > As I've shown above, the number of "snapshots" means nothing - it is
      > the number of logs in the sample that is important - the greater the
      > number of logs, the more accurate will be the estimation of the mean.
      > In any case, we know absolutely that your "normal" median can not be
      > anywhere close to the actual median as for the last five weeks the
      > average (mean) number of *new* QSOs per log (New QSOs divided by User
      > Files from https://p1k.arrl.org/lotwuser/default) was 63, 68, 65, 82
      > and 70. Averaged across the entire period the average number of *new*
      > QSOs per log is 70. The fact that "new" QSOs alone are nearly three
      > times greater than your median would argue that the true mean is much
      > closer to what you give as the mean. The "maximum likelihood" of the
      > Poisson distribution happens to be the simple mean but that is also
      > a *minimum value unbiased estimator.* This is more statistical theory
      > than we need to go into here - but it simply says the mean of any
      > Poisson distribution will be no lower than the simple mean of the
      > samples.
      >
      > Again, the only issue becomes whether the number of independent samples
      > - in this case the *sum of the logs* in the hourly reports - is large
      > enough that the probability their mean will be within N% of the mean
      > of the population. With a sample size that exceeds 15% - the answer is
      > an unequivocal *yes* for the entire six week period.
      >
      > I have not calculated the sample sizes for the individual weeks but I
      > have no reason to doubt that they will also be more than sufficient
      > as well.
      >
      > It is unfortunate that ARRL have not seen fit to release the "input"
      > data the way they release the "new QSOs" and "user files" numbers as
      > there would be no question concerning the level of wasted processing
      > but even your mean (maximum likelihood) puts the level of previously
      > processed QSOs at more than 75% for the five week period.
      >
      > 73,
      >
      > ... Joe, W4TV
      >
      > On 2/23/2013 12:29 PM, David Levine wrote:
      >> For those that don't care about statistics or a different view of the
      >> statistics being discussed, press delete.
      >>
      >> I only had a brief amount of time today to look at the data Joe sent me
      >> before I need to head out. Before I recap some findings, I need to address
      >> 1 comment Joe made:
      >>
      >>> I just don't believe your assumptions/extrapolation to be accurate.
      >>> It's like in a simple math equation.... A + x = y and you are solving
      >>> for both. We know A but the doubt has been on the accuracy of x,
      >>> something you adamantly stop your feet as accurate and others feel is
      >>> not the case. .
      >>
      >> W4TV: The analysis solves for only one thing ... the only unknown is the
      >> total number of QSOs processed each week. Since we know exactly the
      >> number of files processed, the only issue is determining the average
      >> number of QSOs per file which is a straight forward problem.
      >>
      >> K2DSL: You need a data point you don't have to solve for either total # of
      >> QSOs or previously processed QSOs.
      >> Formula: Total # QSOs uploaded - New QSOs Uploaded = Previously Uploaded
      >> QSOs
      >> T - N = P abbreviating the above.
      >> Data isn't available for T or P so you have 2 unknowns. What you have
      >> presented is your personal determination of T based based on assumptions
      >> extrapolated from the data snapshots. That's what I noted in my original
      >> comment above - there are 2 unknowns.
      >>
      >> In my brief analysis of the data which I will look further into later or
      >> tomorrow, I notice the following:
      >> Your start date is 12/3/12 but in looking at the data, 1/18 is when the
      >> system seems to have reliably caught up where the queue reached 0 and
      >> didn't have any significant backlogs re-occur. That's a sampling of just 36
      >> days.
      >>
      >> I also removed any hourly data points where there was no data reported
      >> leaving data where there was at least 1 log in the queue. This reduced the
      >> number of hourly data points to 738. I didn't look to see what Joe
      >> previously stated the sample size was but that's what I consider the sample
      >> size of the data for analysis purposes.
      >>
      >> In looking at those 738 hourly snapshots I come up with the following:
      >> Average # logs in the queue shown across the hourly snapshots: 33
      >> Median # logs in the queue shown across the hourly snapshots: 4
      >>
      >> Average # of QSOs shown across the hourly snapshots: 10,801
      >> Median # of QSOs shown across the hourly snapshots: 101
      >>
      >> The median helps to minimize skewing from the more extreme outliers in the
      >> data such as hourly snapshots with very small # logs & QSOs and with very
      >> large # of logs & QSOs. The median of 101 QSOs per snapshot is
      >> approximately the middle value where 1/2 of the 738 snapshots are less than
      >> 101 QSOs and 1/2 the 738 snapshots are greater than 101 QSOs.
      >>
      >> Averaging comes out to be 328 QSOs per log (10801/33). The median comes out
      >> to be 25 QSOs per log (101/4). You can see that assuming the average number
      >> is representative of all logs shows a vastly different picture from what
      >> the median shows across 738 snapshots. I don't believe either of these
      >> numbers to be accurate for extrapolation in determining the # of total QSOs
      >> per log. This number is important because it is then used with the # of
      >> logs uploaded (actual data point provided) to determine the total # of QSOs
      >> uploaded (one of the missing values noted at the top of my post). Joe's
      >> premise is that his calculated average is "statistically" valid and what
      >> others are challenging in their responses.
      >>
      >> K2DSL - David
      >>
      >
      >
    • David Cole
      I run ACLog... When you hit the ALL SINCE button, change the date to be something about a week prior to the LoTW failure... I believe , ACLog got very
      Message 147 of 147 , Aug 24, 2014
      • 0 Attachment
        I run ACLog... When you hit the "ALL SINCE" button, change the date to
        be something about a week prior to the LoTW failure...

        I "believe", ACLog got very confused as a result of the fail mode of
        LoTW. That corrected a very similar problem for me.
        --
        Thanks and 73's,
        For equipment, and software setups and reviews see:
        www.nk7z.net
        for MixW support see;
        http://groups.yahoo.com/neo/groups/mixw/info
        for Dopplergram information see:
        http://groups.yahoo.com/neo/groups/dopplergram/info
        for MM-SSTV see:
        http://groups.yahoo.com/neo/groups/MM-SSTV/info


        On Sun, 2014-08-24 at 09:05 -0700, reillyjf@... [ARRL-LOTW]
        wrote:
        >
        >
        > Thanks for the suggestion. I did a complete download, and beat the
        > number of duplicates down from 275 to 30. No exactly sure why the
        > N3FJP ACL is missing this information.
        > - 73, John, N0TA
        >
        >
        >
      Your message has been successfully submitted and would be delivered to recipients shortly.