Loading ...
Sorry, an error occurred while loading the content.

44Re: [APeLearning] Re: Dividing Subject Materials for Download

Expand Messages
  • Marc
    Nov 24 11:35 PM
    • 0 Attachment
      At 07:34 AM 11/13/02 +0000, you wrote:

      David, and the gang.

      >Keith, I have some question and further suggestions concerning your
      >great recommendations. I hope others will join in:

      Okay, one 'other' joining in ... [where is everyone else <G>< ??]

      > > 1. Library Cataloging. I believe that for this project to be
      > > truly helpful to our schools, there _must_ to be an easy way for
      > > user to access the materials available. I propose a two-fold
      > > method.
      > > a. use of an indexer on the local server. Presently our school
      > > is using a linux based lan and I have installed htDig, an open
      > > source indexer. This software automatically indexes all the
      > > materials in the intranet and makes it available to the user via
      > > a yahoo/google-type search page. It even has relevance scores
      > > for searches, etc.
      >Does htDig index everything in a PDF? What is being indexed? If htDig
      >can read the contents of both PFD and HTML files the whole nature of
      >cataloging changes.

      Technically speaking, htDig allows the use of an 'external parser' for
      selected file name extensions ... AND, there is in the linux world a
      reasonably good included parser available that works well with htDig and
      allows for the extraction of 'text' from a PDF. Thus what htDig indexes is
      a 'plain text' rendition(/representation/interpretation) of the original PDF.

      But, the PDF parser has some serious limitations, based upon PDF's internal
      storage of text/data, which can NOT be overcome.
      1) It does not understand text flows.
      2) Hyphenated words, or words across lines, stay broken and separated.
      3) ALL info about headers, size/font, importance, and other "weighings"
      used by htDig are lost and not available.
      4) The entire PDF document/file exists in page broken chunks, but for the
      indexer it is treated as if it were just =one= page.

      THEREFORE, if you were to have the same book/content ... one stored in HTML
      page(s), and the other stored in a PDF file ... the htDig indexing/indexes,
      and the index-search results would be very different for each.

      How that affects our strategy/approach, I do not know. ??
      I do know that PDF makes for =big= files which can be functional for
      Intranet, but NOT practical for Internet or slow bandwidths. Also, indexed
      PDF does not allow for 'weighing' since it treats ALL text equal and as
      seen on the screen (and thus sometimes in fragments). Basically, PDF is a
      page preservation and presentation format ... think of it as a frozen
      'Print to Screen'.

      YET, at this time, htDig and associated PDF parser is still the best
      available tool/method for this task. We are investigating it further.

      Another approach/solution which seems quite elegant is the one used by
      Google.com ... they parse the page to a pseudo text/html format ... and
      then save that in their cache ... and index that version/copy of the text.
      AND, when you ask to see the file, they show you their version, with an
      option of seeing the original -- result is much faster, and a bit more
      elegant (technically speaking).

      > > b. creation of a library catalog entry for each book or major
      > > resource (i.e. call number, subject listings). This is a much
      > > bigger task and will require human hands on involvement, but if
      > > the cataloging data can be presented in a format that most
      > > library programs can read, the data can be incorporated directly
      > > into the library catalog and be searchable in the same way all
      > > the other books are. I am looking into the z39.50 protocol which
      > > allows one to automatically search large library catalogs (eg.
      > > Library of Congress). If we can make this work, it will make
      > > things much easier.
      >As stupid as this sounds,

      Not stupid ... AND, all viable/interesting ideas should be given a chance
      to be thought through and/or thought out.

      >what would you all think if we named the
      >files by LC numbers, with the .pdf or .html? If we do this from the
      >beginning it will not be that much of a hassle since we can access
      >either LC or any Seminary Library online and get the call#.
      >Our files themselve would function like an open-stack library! And we
      >could actually use these other libraries to do our searching.

      At first crack/thought ... a good idea. I actually like it.

      BUT, there are some obvious difficulties.
      1) as you point out below, many many books are NOT in LC ... and since our
      focus is Indonesian materials, that would represent about 99.9%
      2) quite a few books we have found actually exist with several/multiple LC
      numbers for one book ... especially the older and public domain books for
      which we are likely to get the largest quantities and easiest permissions.
      [IF not convinced, try using California's Melvyl Catalog ... very nice]
      3) most 'html' books, and many 'pdf' books are in multiple files. AND, to
      rename all of the internal links and dependencies would be a guaranteed
      nightmare, and besides lots of extra work, it could even lead to potential
      claims of us 'changing/tampering' with the materials/originals.
      4) some books have more than one version/printing, and perhaps even a
      foreign language version (Indonesian we hope <G><)
      5) in world of MSWindows dominated/handicapped browsers, there are
      technical reasons for avoiding any second/extra "." (periods) in filenames.
      6) the resulting file name would be unreadable, and unmeaningful to the
      average human users (but then again, so are many other naming schemes).

      HOWEVER, I'm happy to report that "Library Cards" (especially electronic
      ones) can and do function in almost exactly that way ...

      BUT, I do like the idea, and I/we over here will give it some more thought.

      >Those works that are not catalogued in LC are a problem, however.

      Yep ... very much so.

      AND, at this point in my discussions with KEITH, the cataloguing is the one
      of the most CRITICAL/Important factors -- especially the ability, tools and
      underlying understanding for doing the cataloguing.

      > > 2. Download format. Ideally we should probably work in two basic
      > > formats, html and pdf (with pure text as an extension to html).
      > > To be searchable, the pdf files must have OCRed text in them.
      > > PDF's only containing scanned materials could not be indexed. It
      > > is also possible to have other formats, eg., word doc or rtf,
      > > which could easily be converted into either pdf or html.
      >I basically agree about the html and pdf format system. There are
      >some freeware drivers that will enable people to convert the txt,
      >rtf, doc files in to pdf.

      There are tools that help take *.DOC to RTF, and then there are good tools
      to take RTF to HTML and/or to PDF, or DOC directly to PDF. Along the way
      some information/formating is lost ...

      BUT, if the final/only format is PDF, that can be a bit dangerous ...
      ... since PDF is a ONEWAY and NON-REVERSIBLE formating trip.

      >Do we really want to use html? Or would it not be better to save them
      >also as pdf?

      Good question.
      Needs further thought.

      Although it is possible to save an entire site into a PDF document, it can
      also be quite annoying, and in my opinion the result is bulky!

      Remember, PDF and HTML are built on two fundamentally different 'page'
      paradigms --
      1) HTML is focused on document structure and is rather free form (and
      browser dependent) for the display, while
      2) PDF is focused on recreating the physical/printed page by saving the
      'print image' of the content; it thus throws away much of the document
      structure/info but it does succeed in being 'browser independent' (since
      you must use PDF's browser).

      As you pointed out, approach number is good for preserving the look and
      feel of fonts and layout. BUT, if the page(s) are already in HTML, that
      benefit disappears ... although it might be nice to wrap ALL of it into one

      AND, let me add that I've noticed that whenever I open your html web pages,
      my browser wants me to download Japanese Fonts/Extension ... ;-) ... since
      they were available when you made/formated/saved the page.

      >I'm thinking about font problems on the local computer,
      >the need to save a folder with the pictures, etc. PDF would be all
      >embeded. With Acrobat 5 you can export any of the data. However, I
      >really do like having the pics separate.

      Fonts, for OT/NT langauges seem to me to be the main issue.

      [In the near future, probably, we need to worry about the language fonts
      associated with Asia (ex, Japanese, Chinese, Thai, etc, ....). BTW; how
      well does PDF support any/all of those languages?]

      I think I/we need more input on this ...
      What is best, what is prefered ... ??
      AND, the answer might be different for Internet, versus Intranet; and again
      different for different Language Groups; or theo school types/catagories.

      > > 3. Subject areas. I suggest we start with what we have and
      > > develop it as we need.
      >Whatever is being downloaded by an individual should be reported.

      Keith has set up a good system for that, not to mention also providing a
      good precedent, -- he has made a good first step towards collection lots of
      materials ...

      AND, I agree ... start with what we have (Keith has copies of ALL of ours)
      ... catalog that, and keep getting more. :)
    • Show all 6 messages in this topic