## Re: [ARRL-LOTW] Re: Duplicates

Expand Messages
• ... You said it again ... there is only *one* unknown in that equation since the two items you claim to be unknown are related to each other by a constant
Message 1 of 147 , Feb 23, 2013
> K2DSL: I didn't say you were using P to solve for T so why state
> that? I said there are 2 unknowns in the equation which is accurate.

You said it again ... there is only *one* unknown in that equation
since the two items you claim to be "unknown" are related to each
other by a constant (known). Thus, if you know one, you know both
- they can not both be unknown.

> If I remove the top 20% of status records based on backed up QSOs
> leaving 80% of the samples, that removes 95% of the QSOs you are
> using from the samples.

You can't arbitrarily remove 80% of the samples - each sample is as
valid as any other. Picking and choosing data to suit your hypothesis
is wrong, wrong, wrong.

> Without the top 20%, the remaining 80% of the samples result in an average
> 85 QSOs/log. Knowing 125,261 logs uploaded as a fact within the sample
> period, Total QSOs in the sample period = 10,598,584 QSOs.

You don't know *as a fact* that the total QSOs are 10,598,584. You have
an inaccurate value for QSOs/log because you arbitrarily discarded the
largest logs. Any time you lop off the top 80% of the samples you are
going to understate the average log size and excessively reduce the
level of previously processed records - specifically because it is the
large non-contest logs that are the primary source of the duplicates
we're trying to measure! Of course you know that are are intentionally
trying to manipulate the data to support your "low dupes" mythology.

> that calculates out to:
> New QSO = 82%
> Previously uploaded QSOs = 18%

Which is an absurdly low number since it is only about 1/3 of the dupe
level given by K1MK. You're not going to have a 65% decline in the
rate of dupes between the first four weeks of December and February
with only a 15% increase in new QSOs - not when the number of logs
processed increased by 50% in the same time period. Your hypothesis
fails even the most basic examination.

There is nothing to support a 75% *decrease* in average log size -
which would be required to reach your 18% dupe rate - between December
and February. Instead, all of the signs point to a modest increase in
average upload size - on the order of 15% or so.

> That is how skewed the data is and also what others have been
> criticizing. With knowledge of the fundamental way the data will
> show in any one second, a single large file will have much greater
> impact as can be seen from this simple analysis.

The only place skewedness enters the equation with sampling is that
one needs a larger sample to reach the same confidence level as one
would achieve with a less skewed (smaller standard deviation in terms
of "normal" distributions) population. Go study statistics instead of
spewing a bunch of stuff that is provably wrong.

> You don't have to agree with the approach (I didn't particularly
> agree with yours) and I know you won't like the outcome, but it shows
> that not having the facts and making general assumptions based on a
> limited set of data and one that is particularly skewed can result in
> potentially inaccurate analysis.

All this shows is that you can distort the result when you 1) discard
data that does not agree with your desired outcome and 2) results are
unreliable when you arbitrarily reduce the sample size. Both of those
are known principles of statistics/statistical analysis.

You're never going to agree with my analysis because it doesn't support
your belief system. There is obviously no use debating this further -
however, all of the equations to calculate the "expected value" of a
distribution, probability mass function, cumulative distribution
function, confidence intervals, and the other parameters on which to
base an unbiased analysis are available on-line for you to study and
maybe learn something.

73,

... Joe, W4TV

On 2/23/2013 7:51 PM, David Levine wrote:
>> K2DSL: You need a data point you don't have to solve for either total
>> # of QSOs or previously processed QSOs.
>> Formula: Total # QSOs uploaded - New QSOs Uploaded = Previously
>> Uploaded QSOs T - N = P abbreviating the above.
>> Data isn't available for T or P so you have 2 unknowns. What you
>> have presented is your personal determination of T based based on
>> assumptions extrapolated from the data snapshots. That's what I noted
>> in my original comment above - there are 2 unknowns.
>
> W4TV: I'm not using P to solve for T .... T is determined *independently* by
> sampling the entire population and P is calculated from that. There is
> only one unknown - T (total QSOs) since P is T-N and we know N from the
> status data at https://p1k.arrl.org/lotwuser/default.
>
> K2DSL: I didn't say you were using P to solve for T so why state that? I
> said there are 2 unknowns in the equation which is accurate. The only FACTs
> are # new qsos and # logs uploaded. You can't claim Total # of QSOs as a
> FACT as it's your guestimate based on the incomplete data provided. This
> specific item is what others are questioning.
>
> If I remove the top 20% of status records based on backed up QSOs leaving
> 80% of the samples, that removes 95% of the QSOs you are using from the
> samples. That is how skewed the data is and also what others have
> been criticizing. With knowledge of the fundamental way the data will show
> in any one second, a single large file will have much greater impact as can
> be seen from this simple analysis.
>
> Without the top 20%, the remaining 80% of the samples result in an average
> 85 QSOs/log. Knowing 125,261 logs uploaded as a fact within the sample
> period, Total QSOs in the sample period = 10,598,584 QSOs. We know as a
> fact the new QSOs in that sample period is 8,709,697 so that calculates out
> to:
> New QSO = 82%
> Previously uploaded QSOs = 18%
>
> You don't have to agree with the approach (I didn't particularly agree with
> yours) and I know you won't like the outcome, but it shows that not having
> the facts and making general assumptions based on a limited set of data and
> one that is particularly skewed can result in potentially inaccurate
> analysis.
>
> David - K2DSL
>
• I run ACLog... When you hit the ALL SINCE button, change the date to be something about a week prior to the LoTW failure... I believe , ACLog got very
Message 147 of 147 , Aug 24, 2014
I run ACLog... When you hit the "ALL SINCE" button, change the date to
be something about a week prior to the LoTW failure...

I "believe", ACLog got very confused as a result of the fail mode of
LoTW. That corrected a very similar problem for me.
--
Thanks and 73's,
For equipment, and software setups and reviews see:
www.nk7z.net
for MixW support see;
http://groups.yahoo.com/neo/groups/mixw/info
for Dopplergram information see:
http://groups.yahoo.com/neo/groups/dopplergram/info
for MM-SSTV see:
http://groups.yahoo.com/neo/groups/MM-SSTV/info

On Sun, 2014-08-24 at 09:05 -0700, reillyjf@... [ARRL-LOTW]
wrote:
>
>
> Thanks for the suggestion. I did a complete download, and beat the
> number of duplicates down from 275 to 30. No exactly sure why the
> N3FJP ACL is missing this information.
> - 73, John, N0TA
>
>
>
Your message has been successfully submitted and would be delivered to recipients shortly.