Loading ...
Sorry, an error occurred while loading the content.

AW: AW: ANN: LOD Cloud - Statistics and compliance with best practices

Expand Messages
  • Chris Bizer
    Hi Martin, Thomas and Kingsley, ... dataset of ... Why? As I said, we are including all datasets which fulfill the minimal technical requirements. As Denny s
    Message 1 of 4 , Oct 21, 2010
    • 0 Attachment
      Hi Martin, Thomas and Kingsley,

      > >> First, I think it is pretty funny that you list Denny's April's fool
      dataset of
      > creating triples for numbers as an acceptable part of the cloud,

      Why?

      As I said, we are including all datasets which fulfill the minimal technical
      requirements.
      As Denny's dataset does this it is included. The same would of course be
      true for BestBuy and other GoodRelations datasets, if they would be
      connected by RDF links to other datasets in the cloud.

      > >> The fundamental mistake of what you say is that linked open e-commerce
      > data is not "a dataset" but a wealth of smaller datasets. Asking me to
      create
      > CKAN entries for each store or business in the world that provides
      > GoodRelations data is as if Google was asking any site owner in the world
      to
      > register his or her site manually via CKAN.
      > >>
      > >> That is 1990s style and does not have anything to do with a "Web" of
      data.

      I agree with you that it would be much better, if somebody would set up a
      crawler, properly crawl the Web of Data and then provide a catalog about all
      datasets. As long as nobody does this, I think it is useful to have the
      manually maintained CKAN catalog as a first step.

      An interesting step into this direction if the profiling work done by Felix
      Naumann's group for the BTC dataset. See
      http://www.cs.vu.nl/~pmika/swc/submissions/swc2010_submission_3.pdf

      > >> Is HTML + RDFa with hash fragments, available via HTTP GET
      > "dereferencable" for you? E.g.

      Absolutely!

      > >> To be frank, I think the bubbles diagram fundamentally misses the point
      in
      > the sense that the power of linked data is in integrating a huge amount of
      > small, specific data sources, and not in linking a manually maintained
      blend of
      > ca. 100 monolithic datasets.

      Valid point. I agree with you that the power of the Linked Data
      architecture is that it provides for building a single global dataspace
      which of course may contain small as well as big data sources.

      The goal of the LOD diagram is not to visualize any small chunk of RDF on
      the Web, as this would be impossible for obvious reasons - including the
      size of your screen.

      We restrict the diagram to bigger datasets, hoping that these may be
      especially relevant to data consumers.

      Of course, you may disagree with this restriction.

      From Thomas:
      > > How about handling GoodRelations the same way as FOAF, representing it
      > > as a somewhat existing bubble without exactly specifying where it
      > > links to and from where inbound links come from

      We also don't do this for FOAF anymore in the new version of the diagram.

      From Thomas:
      > > In the end, the idea of a Web catalogue was mostly abandoned at some
      > > point due to being unmanageable, maybe the same happens to the Web
      > > /data/ "catalogue", aka. LOD cloud (the metaphor doesn't work
      > > perfectly, but you get the point).

      Yes. But I personally think that the Yahoo catalog was rather useful in the
      early days of the Web.

      In the same way, I think that the CKAN catalog is rather useful in the
      current development stage of the Web of Data and I'm looking forward to the
      time, when the Web of Data has grown to a point where such a catalog becomes
      unmanageable.

      But again: I agree that crawling the Web of Data and then deriving a dataset
      catalog as well as meta-data about the datasets directly from the crawled
      data would be clearly preferable and would also scale way better.

      Thus: Could please somebody start a crawler and build such a catalog?

      As long as nobody does this, I will keep on using CKAN.

      Cheers,

      Chris


      > -----Ursprüngliche Nachricht-----
      > Von: semantic-web-request@... [mailto:semantic-web-
      > request@...] Im Auftrag von Kingsley Idehen
      > Gesendet: Mittwoch, 20. Oktober 2010 20:30
      > An: Thomas Steiner
      > Cc: Martin Hepp; Chris Bizer; Semantic Web; public-lod@...; Anja
      > Jentzsch; semanticweb@yahoogroups.com
      > Betreff: Re: AW: ANN: LOD Cloud - Statistics and compliance with best
      > practices
      >
      > On 10/20/10 2:13 PM, Thomas Steiner wrote:
      > > Hi all,
      > >
      > > How about handling GoodRelations the same way as FOAF, representing it
      > > as a somewhat existing bubble without exactly specifying where it
      > > links to and from where inbound links come from (on the road right
      > > now, so can't check for sure whether it is already done this way)? The
      > > individual datasets are too small to be entered manually into CKAN (+1
      > > for Martin's arguments here).
      > > In the end, the idea of a Web catalogue was mostly abandoned at some
      > > point due to being unmanageable, maybe the same happens to the Web
      > > /data/ "catalogue", aka. LOD cloud (the metaphor doesn't work
      > > perfectly, but you get the point).
      > >
      > > Martin's point as I get it is that GR forms part of the Web of data.
      > > Currently this is (about to be) honored by search engines and the
      > > like, GR-enabled price/product comparison engines etc. are probably
      > > being worked on (or are already live?), so Linked Open Commerce (well,
      > > an aspect of it) will be/is real soon/now. Whether/how GR forms part
      > > of the LOD cloud is a secondary, if at all, question in my humble
      > > opinion.
      > >
      > > All this is my private point of view, my Google hat completely off.
      > > Sorry for the many slashes/alternative sentence endings.
      >
      > This is why we opted to make a LOC (Linked Open Commerce) pictorial [1]
      > that connects to LOD. In short, I would encourage all Linked Data
      > publishers and curators to embark upon similar endeavors, as long as
      > they accurately depict their specific Linked Data slant and
      > contributions. Remember, this is about the Web, LOD is just one of many
      > Linked Data clusters within the burgeoning Web of Linked Data :-)
      >
      > Links:
      >
      > 1. http://linkedopencommerce.com -- this space includes variety of
      > purpose specific Linked Data pictorials.
      >
      > Kingsley
      > > Best,
      > > Tom
      > >
      > > Thank God not sent from a BlackBerry, but from my iPhone
      > >
      > > On 20.10.2010, at 19:16, Martin Hepp<martin.hepp@...>
      > wrote:
      > >
      > >> Hi Chris:
      > >>
      > >> First, I think it is pretty funny that you list Denny's April's fool
      dataset of
      > creating triples for numbers as an acceptable part of the cloud,
      > >>
      > >> http://ckan.net/package/linked-open-numbers
      > >>
      > >> <Picture 39.png>
      > >> (right next to WordNet)
      > >>
      > >> The fundamental mistake of what you say is that linked open e-commerce
      > data is not "a dataset" but a wealth of smaller datasets. Asking me to
      create
      > CKAN entries for each store or business in the world that provides
      > GoodRelations data is as if Google was asking any site owner in the world
      to
      > register his or her site manually via CKAN.
      > >>
      > >> That is 1990s style and does not have anything to do with a "Web" of
      data.
      > >>
      > >>> 1.Data items are accessible via dereferencable URIs (provding only
      > access
      > >>> via SPARQL is not enough, as linked data browsers and search engines
      > cannot
      > >>> work with SPARQL endpoints)
      > >> Is HTML + RDFa with hash fragments, available via HTTP GET
      > "dereferencable" for you? E.g.
      > >>
      > >> http://stores.bestbuy.com/10/
      > >>
      > >> If yes, fine. If not - why? IMO, HTML with RDFa payload does not brake
      > any fundamental principles of the Web architecture.
      > >>
      > >>
      > >>> 2.The dataset sets at least 50 RDF links pointing at other datasets or
      at
      > >>> least one other dataset is setting 50 RDF links pointing at your
      dataset.
      > >>
      > >> This is often hard to meet and seems like a very artificial requirement
      to
      > me.
      > >>
      > >> First, many small datasets may be just 50 triples in total. Why should
      a
      > hairdresser in Kentucky, exposing its description in GoodRelations + RDFa
      > have 50 outbound links? What should this beauty store in CA exposing 800
      > triples do to qualify as linked data?
      > >>
      > >> http://www.plushbeautybar.com/services.html
      > >>
      > >> Second, what kind of links to core LOD entities do you expect from shop
      > operators? For example, take
      > >>
      > >> http://semantic.eurobau.com/
      > >>
      > >> That dataset contains some 30 million triples of construction-materials
      > information. Which links to dbPedia would you reasonably expect? Is this
      > Linked Data in your opinion or not? If not, why?
      > >>
      > >> To be frank, I think the bubbles diagram fundamentally misses the point
      in
      > the sense that the power of linked data is in integrating a huge amount of
      > small, specific data sources, and not in linking a manually maintained
      blend of
      > ca. 100 monolithic datasets.
      > >>
      > >> Integrating 100 datasets does not have anything to do with Web-scale
      > information integration. Note that Google estimated back in 2008 that
      there
      > were ca. 1 trillion URIs in their index alone. So what are 100 manually
      > converted datasets in comparison to that?
      > >>
      > >> Best
      > >>
      > >> Martin
      > >>
      > >> On 20.10.2010, at 08:49, Chris Bizer wrote:
      > >>
      > >>> Hi Martin,
      > >>>
      > >>> we are not ignoring anything.
      > >>>
      > >>> I personally think that http://linkedopencommerce.com/ is an quite
      > exciting
      > >>> effort and would love to see more e-commerce data in the LOD cloud.
      > >>>
      > >>> We have asked the community repeatedly to provide information about
      > datasets
      > >>> that they like to be included into the LOD cloud on CKAN.
      > >>>
      > >>> You did not do this. And at this time, we also did not hear about
      > >>> http://linkedopencommerce.com/ yet.
      > >>>
      > >>> It would be great, if you would add information about your dataset(s)
      to
      > >>> CKAN, so that we can include it into the next version of the cloud
      > diagram.
      > >>>
      > >>> Of course given that they fulfill the minimal requirements for
      inclusion,
      > >>> which are:
      > >>>
      > >>> 1.Data items are accessible via dereferencable URIs (provding only
      > access
      > >>> via SPARQL is not enough, as linked data browsers and search engines
      > cannot
      > >>> work with SPARQL endpoints)
      > >>> 2.The dataset sets at least 50 RDF links pointing at other datasets or
      at
      > >>> least one other dataset is setting 50 RDF links pointing at your
      dataset.
      > >>>
      > >>> Cheers,
      > >>>
      > >>> Chris
      > >>>
      > >>> -----Ursprüngliche Nachricht-----
      > >>> Von: Martin Hepp [mailto:martin.hepp@...]
      > >>> Gesendet: Dienstag, 19. Oktober 2010 22:09
      > >>> An: Anja Jentzsch; Chris Bizer
      > >>> Cc: Semantic Web; semanticweb@yahoogroups.com
      > >>> Betreff: Re: ANN: LOD Cloud - Statistics and compliance with best
      > practices
      > >>>
      > >>> Hi Anja, Chris:
      > >>>
      > >>> It's kind of a joke that you ignore the 1 billion triples of
      > >>> GoodRelations data on the Web, e.g. available at
      > >>>
      > >>> http://linkedopencommerce.com/
      > >>>
      > >>> or
      > >>>
      > >>> http://www.ebusiness-unibw.org/wiki/
      > >>> GoodRelations#Examples_in_the_Wild
      > >>>
      > >>> Martin
      > >>>
      > >>>
      > >>> On 19.10.2010, at 17:56, Anja Jentzsch wrote:
      > >>>
      > >>>> Hi all,
      > >>>>
      > >>>> in the last weeks, we have analyzed which data sources in the new
      > >>>> version of the LOD cloud comply to various best practices that are
      > >>>> recommended by W3C or have emerged within the LOD community.
      > >>>>
      > >>>> We have checked the implementation of the following nine best
      > >>>> practices:
      > >>>>
      > >>>> 1. Provide dereferencable URIs
      > >>>> 2. Set RDF links pointing at other data sources
      > >>>> 3. Use terms from widely deployed vocabularies
      > >>>> 4. Make proprietary vocabulary terms dereferencable
      > >>>> 5. Map proprietary vocabulary terms to other vocabularies
      > >>>> 6. Provide provenance metadata
      > >>>> 7. Provide licensing metadata
      > >>>> 8. Provide data-set-level metadata
      > >>>> 9. Refer to additional access methods
      > >>>>
      > >>>> The compliance with the best practices was either checked manually
      > >>>> or by using scripts that downloaded and analyzed some data from the
      > >>>> data sources.
      > >>>> We have added the results of the evaluation in the form of tags to
      > >>>> the LOD data set catalog on CKAN [1].
      > >>>>
      > >>>> We are now happy to release the first statistics about the structure
      > >>>> of the LOD could as well as the compliance of the datasets with the
      > >>>> best practices.
      > >>>> The statistics can be found here:
      > >>>>
      > >>>> http://www4.wiwiss.fu-berlin.de/lodcloud/state/
      > >>>>
      > >>>> The document contains an initial, preliminary release of the
      > >>>> statistics. If you spot any errors in the data describing the LOD
      > >>>> data sets on CKAN, it would be great if you would correct them
      > >>>> directly on CKAN.
      > >>>>
      > >>>> For information on how to describe datasets on CKAN please refer to
      > >>>> the Guidelines for Collecting Metadata on Linked Datasets in CKAN
      [2].
      > >>>>
      > >>>> After your feedback and corrections, we will then move the corrected
      > >>>> version of the statistics to http://www.lod-cloud.net/ (around
      > >>>> October 24th).
      > >>>>
      > >>>> Have fun with the statistics and the encouraging as well as
      > >>>> disappointing insights that they provide.
      > >>>>
      > >>>> Cheers,
      > >>>>
      > >>>> Chris Bizer, Anja Jentzsch and Richard Cyganiak
      > >>>>
      > >>>> [1] http://www.ckan.net/group/lodcloud
      > >>>> [2]
      > >>>
      > http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/DataS
      > ets/CKAN
      > >>> metainformation
      > >>>>
      > >>>
      > >
      >
      >
      > --
      >
      > Regards,
      >
      > Kingsley Idehen
      > President& CEO
      > OpenLink Software
      > Web: http://www.openlinksw.com
      > Weblog: http://www.openlinksw.com/blog/~kidehen
      > Twitter/Identi.ca: kidehen
      >
      >
      >
      >
    • Giovanni Tummarello
      ... Hi Chris, all I can only restate that within Sindice we re very open to anyone who wanted to develop data anlisys apps creating catalogs automatically. At
      Message 2 of 4 , Oct 21, 2010
      • 0 Attachment
        > But again: I agree that crawling the Web of Data and then deriving a dataset
        > catalog as well as meta-data about the datasets directly from the crawled
        > data would be clearly preferable and would also scale way better.
        >
        > Thus: Could please somebody start a crawler and build such a catalog?
        >
        > As long as nobody does this, I will keep on using CKAN.
        >

        Hi Chris, all

        I can only restate that within Sindice we're very open to anyone who
        wanted to develop data anlisys apps creating catalogs automatically.
        At the moment a map reduce job a couple of week ago gave an excess of
        100k independent datasets. How many interlinked etc? to be analyzed.

        Our interest (and the interest of the Semantic Web vision i want to
        sposor) is to make sure RDFa sites are fully included and so are those
        who provide markup which can however be translated in an
        automatic/agreeable way (so no scraping or "sponging") into RDF. (that
        is anything that any23.org can turn into triples)

        If you were indeed interested in running your or developing your
        algorithms in our running dataset no problem, the code can be made
        opensource so it would run on others similarly structured datasets.

        This said yes i think too that in this phase a CKAN like repository
        can be an interesting aggregation point, why not.

        But i do think the diagram, which made great sense as an example when
        Richard started it is now at risk of providing a disservice
        which is in line which what Martin is making noticed.

        The diagram as it is now kinda implicitly conveys the sense that if
        something is so large then all that matters must be there and that's
        absolutely not the case.

        a) there are plenty of extremely useful datasets is RDF/RDFa etc which
        are not there
        b) the usefulness of being linked is all but a proven fact, so on the
        one hand people might want to "be there" on the other you'd have to do
        pushing toward serious commercial entities (for example) to "link to
        dbpedia" for reasons that arent clear and that hurts your credibility.

        So danny ayers has fun linking to dbpedia so he is in there with his
        joke dataset, but you cant credibly bring that argument to large
        retailers so they're left out?

        this would be ok if the diagram was just "hey its my own thing i set
        my rules" - fine but the fanfare around it gives it a different
        meaning and thus the controversy above.

        .. just tried to put in words what might be a general unspoken feeling..

        Short message recap
        a) ckan - nice why not might be useful but..
        b) generated diagram : we have the data or can collect it so whoever
        is interested in analitics pls let us know and we can work it out
        (matter of fact it turns out most uf us in here are paid by EU for
        doing this in collaborative projects :-) )

        cheers
        Giovanni
      • Chris Bizer
        Hi Denny, thank you for your smart and insightful comments. ... since the ... about the ... -- the ... in ... how ... Absolutely. Opening up the discussion on
        Message 3 of 4 , Oct 22, 2010
        • 0 Attachment
          Hi Denny,

          thank you for your smart and insightful comments.

          > I also find it a shame, that this thread has been hijacked, especially
          since the
          > original topic was so interesting. The original email by Anja was not
          about the
          > LOD cloud, but rather about -- as the title of the thread still suggests
          -- the
          > compliance of LOD with some best practices. Instead of the question "is X
          in
          > the diagram", I would much rather see a discussion on "are the selected
          > quality criteria good criteria? why are some of them so little followed?
          how
          > can we improve the situation?"

          Absolutely. Opening up the discussion on these topics is exactly the reason
          why we compiled the statistics.

          In order to guide the discussion back to this topic, maybe it is useful to
          repost the original link:

          http://www4.wiwiss.fu-berlin.de/lodcloud/state/

          A quick initial comment concerning the term "quality criteria". I think it
          is essential to distinguish between:

          1. The quality of the way data is published, meaning to which extend the
          publishers comply with best practices (a possible set of best practices is
          listed in the document)
          2. The quality of the data itself. I think Enrico's comment was going into
          this direction.

          The Web of documents is an open system built on people agreeing on standards
          and best practices.
          Open system means in this context that everybody can publish content and
          that there are no restrictions on the quality of the content.
          This is in my opinion one of the central facts that made the Web successful.

          The same is true for the Web of Data. There obviously cannot be any
          restrictions on what people can/should publish (including, different
          opinions on a topic, but also including pure SPAM). As on the classic Web,
          it is a job of the information/data consumer to figure out which data it
          wants to believe and use (definition of information quality = usefulness of
          information, which is a subjective thing).

          Thus it also does not make sense to discuss the "objective quality" of the
          data that should be included into the LOD cloud (objective quality just does
          not exist) and it makes much more sense to discuss the mayor issues that we
          are still having in regard to the compliance with publishing best practices.

          > Anja has pointed to a wealth of openly
          > available numbers (no pun intended), that have not been discussed at all.
          For
          > example, only 7.5% of the data source provide a mapping of "proprietary
          > vocabulary terms" to "other vocabulary terms". For anyone building
          > applications to work with LOD, this is a real problem.

          Yes, this is also the figure that scared me most.

          > but in order to figure out what really needs to be done, and
          > how the criteria for good data on the Semantic Web need to look like, we
          > need to get back to Anja's original questions. I think that is a question
          we
          > may try to tackle in Shanghai in some form, I at least would find that an
          > interesting topic.

          Same with me.
          Shanghai was also the reason for the timing of the post.

          Cheers,

          Chris

          > -----Ursprüngliche Nachricht-----
          > Von: semantic-web-request@... [mailto:semantic-web-
          > request@...] Im Auftrag von Denny Vrandecic
          > Gesendet: Freitag, 22. Oktober 2010 08:44
          > An: Martin Hepp
          > Cc: Kingsley Idehen; public-lod; Enrico Motta; Chris Bizer; Thomas
          Steiner;
          > Semantic Web; Anja Jentzsch; semanticweb; Giovanni Tummarello; Mathieu
          > d'Aquin
          > Betreff: Re: AW: ANN: LOD Cloud - Statistics and compliance with best
          > practices
          >
          > I usually dislike to comment on such discussions, as I don't find them
          > particularly productive, but 1) since the number of people pointing me to
          > this thread is growing, 2) it contains some wrong statements, and 3) I
          feel
          > that this thread has been hijacked from a topic that I consider productive
          and
          > important, I hope you won't mind me giving a comment. I wanted to keep it
          > brief, but I failed.
          >
          > Let's start with the wrong statements:
          >
          > First, although I take responsibility as a co-creator for Linked Open
          Numbers,
          > I surely cannot take full credit for it. The dataset was a shared effort
          by a
          > number of people in Karlsruhe over a few days, and thus calling the whole
          > thing "Denny's numbers dataset" is simply wrong due to the effort spent by
          > my colleagues on it. It is fine to call it "Karlsruhe's numbers dataset"
          or simply
          > Linked Open Numbers, but providing me with the sole attribution is too
          > much of an honor.
          >
          > Second, although it is claimed that Linked Open Numbers are "by design and
          > known to everybody in the core community, not data but noise", being one
          > of the co-designers of the system I have to disagree. It is "noise by
          design".
          > One of my motivations for LON was to raise a few points for discussion,
          and
          > at the same time provide with a dataset fully adhering to Linked Open Data
          > principles. We were obviously able to get the first goal right, and we
          didn't do
          > too bad on the second, even though we got an interesting list of bugs by
          > Richard Cyganiak, which, pitily, we still did not fix. I am very sorry for
          that.
          > But, to make the point very clear again, this dataset was designed to
          follow
          > LOD principles as good as possible, to be correct, and to have an
          > implementation that is so simple that we are usually up, so anyone can use
          > LON as a testing ground. Due to a number of mails and personal
          > communications I know that LON has been used in that sense, and some
          > developers even found it useful for other features, like our provision of
          > number names in several languages. So, what is called "noise by design"
          > here, is actually an actively used dataset, that managed to raise, as we
          have
          > hoped, discussions about the point of counting triples, was a factor in
          the
          > discussion about literals as subjects, made us rethink the notion of
          > "semantics" and computational properties of RDF entities in a different
          way,
          > and is involved in the discussion about quality of LOD. With respect to
          that, in
          > my opinion, LON has achieved and exceeded its expectations, but I
          > understand anyone who disagrees. Besides that, it was, and is, huge fun.
          >
          > Now to some topics of the discussion:
          >
          > On the issue of the LOD cloud diagram. I want to express my gratitude to
          all
          > the people involved, for the effort they voluntarily put in its
          development
          > and maintenance. I find it especially great, that it is becoming
          increasingly
          > transparent how the diagram is created and how the datasets are selected.
          > Chris has refered to a set of conditions that are expected for inclusion,
          and
          > before the creation of the newest iteration there was an explicit call on
          this
          > mailing list to gather more information. I can only echo the sentiment
          that if
          > someone is unhappy with that diagram, they are free to create their own
          and
          > put it online. The data is available, the SVG is available and editable,
          and they
          > use licenses that allow the modification and republishing.
          >
          > Enrico is right that a system like Watson (or Sindice), that automatically
          > gathers datasets from the Web instead of using a manually submitted and
          > managed catalog, will probably turn out to be the better approach. Watson
          > used to have an overview with statistics on its current content, and I
          really
          > loved that overview, but this feature has been disabled since a few
          months.
          > If it was available, especially in any graphical format that can be easily
          reused
          > in slides -- for example, graphs on the growth of number of triples,
          datasets,
          > etc., graphs on the change of cohesion, vocabulary reuse, etc. over time,
          > within the Watson corpus -- I have no doubts that such graphs and data
          > would be widely reused, and would in many instances replace the current
          > usage of the cloud diagram. (I am furthermore curious about Enrico's
          > statement that the Semantic Web =/= Linked Open Data and wonder about
          > what he means here, but that is a completely different thread).
          >
          > Finally, to what I consider most important in this thread:
          >
          > I also find it a shame, that this thread has been hijacked, especially
          since the
          > original topic was so interesting. The original email by Anja was not
          about the
          > LOD cloud, but rather about -- as the title of the thread still suggests
          -- the
          > compliance of LOD with some best practices. Instead of the question "is X
          in
          > the diagram", I would much rather see a discussion on "are the selected
          > quality criteria good criteria? why are some of them so little followed?
          how
          > can we improve the situation?" Anja has pointed to a wealth of openly
          > available numbers (no pun intended), that have not been discussed at all.
          For
          > example, only 7.5% of the data source provide a mapping of "proprietary
          > vocabulary terms" to "other vocabulary terms". For anyone building
          > applications to work with LOD, this is a real problem.
          >
          > Whenever I was working on actual applications using LOD, I got
          disillusioned.
          > The current state of LOD is simply insufficient to sustain serious
          application
          > development on top of it. Current best practices (like follow-your-nose)
          are
          > theoretically sufficient, but not fully practical. To just give a few
          examples:
          > * imagine you get an RDF file with some 100 triples, including some 120
          > vocabulary terms. In order to actually display those, you need the label
          for
          > every single of these terms, preferably in the user's language. But most
          RDF
          > files do not provide such labels for terms they merely reference. In order
          to
          > actually display them, we need to resolve all these 120 terms, i.e. we
          need to
          > make more than a hundred calls to the Web -- and we are only talking about
          > the display of a single file! In Semantic MediaWiki we had, from the
          > beginning, made sure that all referenced terms are accompanied with some
          > minimum definition, providing labels, types, etc. which enables tools to
          at
          > least create a display quickly and then gather further data, but that
          practice
          > was not adopted. Nevermind the fact that language labels are basically not
          > used for multi-linguality (check out Chapter 4 of my thesis for the data,
          it's
          > devastating).
          > * URIs. Perfectly valid URIs like, e.g. used in Geonames, like
          > http://sws.geonames.org/3202326/ suddenly cause trouble, because their
          > serialization as a QName is, well, problematic.
          > * missing definitions. E.g. DBpedia has the properties
          > http://dbpedia.org/ontology/capital and
          > http://dbpedia.org/property/capital -- used in the very same file about
          the
          > same country. Resolving them will not help you at all to figure out how
          they
          > relate to each other. As a human I may make an educated guess, but for a
          > machine agent? And in this case we are talking about the *same* data
          > provider, nevermind cross-data-provider mapping.
          >
          > I could go on for a while -- and these are just examples *on top* of the
          > problems that Anja raises in her original post, and I am sure that
          everyone
          > who has actually used LOD from the wild has stumbled upon even more such
          > problems. She is raising here a very important point, for the practical
          > application of the data. But instead of discussing these issues that
          actually
          > matter, we talk about bubble graphs, that are created and maintained
          > voluntarily, and why a dataset is included or not, even though the
          criteria
          > have been made transparent and explicit. All these issues seriously hamper
          > the uptake of usage of LOD and lead to the result that it is so much
          easier to
          > use dedicated, proprietary APIs in many cases.
          >
          > At one point it was stated that Chris' criteria were random and hard to
          fulfill
          > in certain cases. If you'd ask me, I would suggest much more draconian
          > criteria, in order to make data reuse as simple as we all envision. I
          really enjoy
          > the work of the pedantic web group with respect to this, providing
          validators
          > and guidelines, but in order to figure out what really needs to be done,
          and
          > how the criteria for good data on the Semantic Web need to look like, we
          > need to get back to Anja's original questions. I think that is a question
          we
          > may try to tackle in Shanghai in some form, I at least would find that an
          > interesting topic.
          >
          > Sorry again for the length of this rant, and I hope I have offended
          everyone
          > equally, I really tried not to single anyone out,
          > Denny
          >
          > P.S.: Finally, a major reason why I think I shouldn't have commented on
          this
          > thread is because it involves something I co-created, and thus I am afraid
          it
          > impossible to stay unbiased. I consider constant advertising of your own
          > ideas tiring, impolite, and bound to lead to unproductive discussions due
          to
          > emotional investment. If the work you do is good enough, you will find
          > champions for it. If not, improve it or do something else.
          >
          >
          >
          > On Oct 21, 2010, at 20:56, Martin Hepp wrote:
          >
          > > Hi all:
          > >
          > > I think that Enrico really made two very important points:
          > >
          > > 1. The LOD bubbles diagram has very high visibility inside and outside
          of the
          > community (up to the point that broad audiences believe the diagram would
          > define relevance or quality).
          > >
          > > 2. Its creators have a special responsibility (in particular as
          scientists) to
          > maintain the diagram in a way that enhances insight and understanding,
          > rather than conveying false facts and confusing people.
          > >
          > > So Kingsley's argument that anybody could provide a better diagram does
          > not really hold. It will harm the community as a whole, sooner or later,
          if the
          > diagram misses the point, simply based on the popularity of this diagram.
          > >
          > > And to be frank, despite other design decisions, it is really ridiculous
          that
          > Chris justifies the inclusion of Denny's numbers dataset as valid Linked
          Data,
          > because that dataset is, by design and known to everybody in the core
          > community, not data but noise.
          > >
          > > This is the "linked data landfill" mindset that I have kept on
          complaining
          > about. You make it very easy for others to discard the idea of linked data
          as a
          > whole.
          > >
          > > Best
          > >
          > > Martin
          > >
          > >
        • john.nj.davies@bt.com
          This article from the NYT may provide an amusing distraction from the current discussion: I thought the powerpoint slide shown looked eerily familiar ;-)
          Message 4 of 4 , Oct 22, 2010
          • 0 Attachment

            This article from the NYT may provide an amusing distraction from the current discussion: I thought the powerpoint slide shown looked eerily familiar ;-)

             

            http://www.nytimes.com/2010/04/27/world/27powerpoint.html?_r=1

             

            John

            PS excellent post Denny IMHO

             

             

            Dr John Davies
            Chief Researcher
            Future Business Applications & Services
            BT Innovate & Design
            __________________________________________________
            Tel:    +44 1473 609583
            Email:    john.nj.davies@...

            This email contains BT information, which may be privileged or confidential.
            It's meant only for the individual(s) or entity named above. If you're not the intended
            recipient, note that disclosing, copying, distributing or using this information
            is prohibited. If you've received this email in error, please let me know immediately
            on the email address above. Thank you.
            We monitor our email system, and may record your emails.

            British Telecommunications plc
            Registered office: 81 Newgate Street London EC1A 7AJ
            Registered in England no: 1800000

             

             

             

            From: semanticweb@yahoogroups.com [mailto:semanticweb@yahoogroups.com] On Behalf Of Chris Bizer
            Sent: 22 October 2010 09:36
            To: 'Denny Vrandecic'; 'Martin Hepp'
            Cc: 'Kingsley Idehen'; 'public-lod'; 'Enrico Motta'; 'Thomas Steiner'; 'Semantic Web'; 'Anja Jentzsch'; 'semanticweb'; 'Giovanni Tummarello'; 'Mathieu d'Aquin'
            Subject: [semanticweb] AW: AW: ANN: LOD Cloud - Statistics and compliance with best practices

             

             

            Hi Denny,

            thank you for your smart and insightful comments.

            > I also find it a shame, that this thread has been hijacked, especially
            since the
            > original topic was so interesting. The original email by Anja was not
            about the
            > LOD cloud, but rather about -- as the title of the thread still suggests
            -- the
            > compliance of LOD with some best practices. Instead of the question
            "is X
            in
            > the diagram", I would much rather see a discussion on "are the
            selected
            > quality criteria good criteria? why are some of them so little followed?
            how
            > can we improve the situation?"

            Absolutely. Opening up the discussion on these topics is exactly the reason
            why we compiled the statistics.

            In order to guide the discussion back to this topic, maybe it is useful to
            repost the original link:

            http://www4.wiwiss.fu-berlin.de/lodcloud/state/

            A quick initial comment concerning the term "quality criteria". I think it
            is essential to distinguish between:

            1. The quality of the way data is published, meaning to which extend the
            publishers comply with best practices (a possible set of best practices is
            listed in the document)
            2. The quality of the data itself. I think Enrico's comment was going into
            this direction.

            The Web of documents is an open system built on people agreeing on standards
            and best practices.
            Open system means in this context that everybody can publish content and
            that there are no restrictions on the quality of the content.
            This is in my opinion one of the central facts that made the Web successful.

            The same is true for the Web of Data. There obviously cannot be any
            restrictions on what people can/should publish (including, different
            opinions on a topic, but also including pure SPAM). As on the classic Web,
            it is a job of the information/data consumer to figure out which data it
            wants to believe and use (definition of information quality = usefulness of
            information, which is a subjective thing).

            Thus it also does not make sense to discuss the "objective quality" of the
            data that should be included into the LOD cloud (objective quality just does
            not exist) and it makes much more sense to discuss the mayor issues that we
            are still having in regard to the compliance with publishing best practices.

            > Anja has pointed to a wealth of openly
            > available numbers (no pun intended), that have not been discussed at all.
            For
            > example, only 7.5% of the data source provide a mapping of
            "proprietary
            > vocabulary terms" to "other vocabulary terms". For anyone
            building
            > applications to work with LOD, this is a real problem.

            Yes, this is also the figure that scared me most.

            > but in order to figure out what really needs to be done, and
            > how the criteria for good data on the Semantic Web need to look like, we
            > need to get back to Anja's original questions. I think that is a question
            we
            > may try to tackle in Shanghai in some form, I at least would find that an
            > interesting topic.

            Same with me.
            Shanghai was also the reason for the timing of the post.

            Cheers,

            Chris

            > -----Ursprüngliche Nachricht-----
            > Von: semantic-web-request@...
            [mailto:semantic-web-
            > request@...] Im Auftrag von Denny
            Vrandecic
            > Gesendet: Freitag, 22. Oktober 2010 08:44
            > An: Martin Hepp
            > Cc: Kingsley Idehen; public-lod; Enrico Motta; Chris Bizer; Thomas
            Steiner;
            > Semantic Web; Anja Jentzsch; semanticweb; Giovanni Tummarello; Mathieu
            > d'Aquin
            > Betreff: Re: AW: ANN: LOD Cloud - Statistics and compliance with best
            > practices
            >
            > I usually dislike to comment on such discussions, as I don't find them
            > particularly productive, but 1) since the number of people pointing me to
            > this thread is growing, 2) it contains some wrong statements, and 3) I
            feel
            > that this thread has been hijacked from a topic that I consider productive
            and
            > important, I hope you won't mind me giving a comment. I wanted to keep it
            > brief, but I failed.
            >
            > Let's start with the wrong statements:
            >
            > First, although I take responsibility as a co-creator for Linked Open
            Numbers,
            > I surely cannot take full credit for it. The dataset was a shared effort
            by a
            > number of people in Karlsruhe over a few days, and thus calling the whole
            > thing "Denny's numbers dataset" is simply wrong due to the
            effort spent by
            > my colleagues on it. It is fine to call it "Karlsruhe's numbers
            dataset"
            or simply
            > Linked Open Numbers, but providing me with the sole attribution is too
            > much of an honor.
            >
            > Second, although it is claimed that Linked Open Numbers are "by
            design and
            > known to everybody in the core community, not data but noise", being
            one
            > of the co-designers of the system I have to disagree. It is "noise by
            design".
            > One of my motivations for LON was to raise a few points for discussion,
            and
            > at the same time provide with a dataset fully adhering to Linked Open Data
            > principles. We were obviously able to get the first goal right, and we
            didn't do
            > too bad on the second, even though we got an interesting list of bugs by
            > Richard Cyganiak, which, pitily, we still did not fix. I am very sorry for
            that.
            > But, to make the point very clear again, this dataset was designed to
            follow
            > LOD principles as good as possible, to be correct, and to have an
            > implementation that is so simple that we are usually up, so anyone can use
            > LON as a testing ground. Due to a number of mails and personal
            > communications I know that LON has been used in that sense, and some
            > developers even found it useful for other features, like our provision of
            > number names in several languages. So, what is called "noise by
            design"
            > here, is actually an actively used dataset, that managed to raise, as we
            have
            > hoped, discussions about the point of counting triples, was a factor in
            the
            > discussion about literals as subjects, made us rethink the notion of
            > "semantics" and computational properties of RDF entities in a
            different
            way,
            > and is involved in the discussion about quality of LOD. With respect to
            that, in
            > my opinion, LON has achieved and exceeded its expectations, but I
            > understand anyone who disagrees. Besides that, it was, and is, huge fun.
            >
            > Now to some topics of the discussion:
            >
            > On the issue of the LOD cloud diagram. I want to express my gratitude to
            all
            > the people involved, for the effort they voluntarily put in its
            development
            > and maintenance. I find it especially great, that it is becoming
            increasingly
            > transparent how the diagram is created and how the datasets are selected.
            > Chris has refered to a set of conditions that are expected for inclusion,
            and
            > before the creation of the newest iteration there was an explicit call on
            this
            > mailing list to gather more information. I can only echo the sentiment
            that if
            > someone is unhappy with that diagram, they are free to create their own
            and
            > put it online. The data is available, the SVG is available and editable,
            and they
            > use licenses that allow the modification and republishing.
            >
            > Enrico is right that a system like Watson (or Sindice), that automatically
            > gathers datasets from the Web instead of using a manually submitted and
            > managed catalog, will probably turn out to be the better approach. Watson
            > used to have an overview with statistics on its current content, and I
            really
            > loved that overview, but this feature has been disabled since a few
            months.
            > If it was available, especially in any graphical format that can be easily
            reused
            > in slides -- for example, graphs on the growth of number of triples,
            datasets,
            > etc., graphs on the change of cohesion, vocabulary reuse, etc. over time,
            > within the Watson corpus -- I have no doubts that such graphs and data
            > would be widely reused, and would in many instances replace the current
            > usage of the cloud diagram. (I am furthermore curious about Enrico's
            > statement that the Semantic Web =/= Linked Open Data and wonder about
            > what he means here, but that is a completely different thread).
            >
            > Finally, to what I consider most important in this thread:
            >
            > I also find it a shame, that this thread has been hijacked, especially
            since the
            > original topic was so interesting. The original email by Anja was not
            about the
            > LOD cloud, but rather about -- as the title of the thread still suggests
            -- the
            > compliance of LOD with some best practices. Instead of the question
            "is X
            in
            > the diagram", I would much rather see a discussion on "are the
            selected
            > quality criteria good criteria? why are some of them so little followed?
            how
            > can we improve the situation?" Anja has pointed to a wealth of openly
            > available numbers (no pun intended), that have not been discussed at all.
            For
            > example, only 7.5% of the data source provide a mapping of
            "proprietary
            > vocabulary terms" to "other vocabulary terms". For anyone
            building
            > applications to work with LOD, this is a real problem.
            >
            > Whenever I was working on actual applications using LOD, I got
            disillusioned.
            > The current state of LOD is simply insufficient to sustain serious
            application
            > development on top of it. Current best practices (like follow-your-nose)
            are
            > theoretically sufficient, but not fully practical. To just give a few
            examples:
            > * imagine you get an RDF file with some 100 triples, including some 120
            > vocabulary terms. In order to actually display those, you need the label
            for
            > every single of these terms, preferably in the user's language. But most
            RDF
            > files do not provide such labels for terms they merely reference. In order
            to
            > actually display them, we need to resolve all these 120 terms, i.e. we
            need to
            > make more than a hundred calls to the Web -- and we are only talking about
            > the display of a single file! In Semantic MediaWiki we had, from the
            > beginning, made sure that all referenced terms are accompanied with some
            > minimum definition, providing labels, types, etc. which enables tools to
            at
            > least create a display quickly and then gather further data, but that
            practice
            > was not adopted. Nevermind the fact that language labels are basically not
            > used for multi-linguality (check out Chapter 4 of my thesis for the data,
            it's
            > devastating).
            > * URIs. Perfectly valid URIs like, e.g. used in Geonames, like
            > http://sws.geonames.org/3202326/
            suddenly cause trouble, because their
            > serialization as a QName is, well, problematic.
            > * missing definitions. E.g. DBpedia has the properties
            > http://dbpedia.org/ontology/capital
            and
            > http://dbpedia.org/property/capital
            -- used in the very same file about
            the
            > same country. Resolving them will not help you at all to figure out how
            they
            > relate to each other. As a human I may make an educated guess, but for a
            > machine agent? And in this case we are talking about the *same* data
            > provider, nevermind cross-data-provider mapping.
            >
            > I could go on for a while -- and these are just examples *on top* of the
            > problems that Anja raises in her original post, and I am sure that
            everyone
            > who has actually used LOD from the wild has stumbled upon even more such
            > problems. She is raising here a very important point, for the practical
            > application of the data. But instead of discussing these issues that
            actually
            > matter, we talk about bubble graphs, that are created and maintained
            > voluntarily, and why a dataset is included or not, even though the
            criteria
            > have been made transparent and explicit. All these issues seriously hamper
            > the uptake of usage of LOD and lead to the result that it is so much
            easier to
            > use dedicated, proprietary APIs in many cases.
            >
            > At one point it was stated that Chris' criteria were random and hard to
            fulfill
            > in certain cases. If you'd ask me, I would suggest much more draconian
            > criteria, in order to make data reuse as simple as we all envision. I
            really enjoy
            > the work of the pedantic web group with respect to this, providing
            validators
            > and guidelines, but in order to figure out what really needs to be done,
            and
            > how the criteria for good data on the Semantic Web need to look like, we
            > need to get back to Anja's original questions. I think that is a question
            we
            > may try to tackle in Shanghai in some form, I at least would find that an
            > interesting topic.
            >
            > Sorry again for the length of this rant, and I hope I have offended
            everyone
            > equally, I really tried not to single anyone out,
            > Denny
            >
            > P.S.: Finally, a major reason why I think I shouldn't have commented on
            this
            > thread is because it involves something I co-created, and thus I am afraid
            it
            > impossible to stay unbiased. I consider constant advertising of your own
            > ideas tiring, impolite, and bound to lead to unproductive discussions due
            to
            > emotional investment. If the work you do is good enough, you will find
            > champions for it. If not, improve it or do something else.
            >
            >
            >
            > On Oct 21, 2010, at 20:56, Martin Hepp wrote:
            >
            > > Hi all:
            > >
            > > I think that Enrico really made two very important points:
            > >
            > > 1. The LOD bubbles diagram has very high visibility inside and
            outside
            of the
            > community (up to the point that broad audiences believe the diagram would
            > define relevance or quality).
            > >
            > > 2. Its creators have a special responsibility (in particular as
            scientists) to
            > maintain the diagram in a way that enhances insight and understanding,
            > rather than conveying false facts and confusing people.
            > >
            > > So Kingsley's argument that anybody could provide a better diagram
            does
            > not really hold. It will harm the community as a whole, sooner or later,
            if the
            > diagram misses the point, simply based on the popularity of this diagram.
            > >
            > > And to be frank, despite other design decisions, it is really
            ridiculous
            that
            > Chris justifies the inclusion of Denny's numbers dataset as valid Linked
            Data,
            > because that dataset is, by design and known to everybody in the core
            > community, not data but noise.
            > >
            > > This is the "linked data landfill" mindset that I have kept
            on
            complaining
            > about. You make it very easy for others to discard the idea of linked data
            as a
            > whole.
            > >
            > > Best
            > >
            > > Martin
            > >
            > >

          Your message has been successfully submitted and would be delivered to recipients shortly.