44Re: [APeLearning] Re: Dividing Subject Materials for Download
- Nov 24, 2002At 07:34 AM 11/13/02 +0000, you wrote:
David, and the gang.
>Keith, I have some question and further suggestions concerning yourOkay, one 'other' joining in ... [where is everyone else <G>< ??]
>great recommendations. I hope others will join in:
> > 1. Library Cataloging. I believe that for this project to beTechnically speaking, htDig allows the use of an 'external parser' for
> > truly helpful to our schools, there _must_ to be an easy way for
> > user to access the materials available. I propose a two-fold
> > method.
> > a. use of an indexer on the local server. Presently our school
> > is using a linux based lan and I have installed htDig, an open
> > source indexer. This software automatically indexes all the
> > materials in the intranet and makes it available to the user via
> > a yahoo/google-type search page. It even has relevance scores
> > for searches, etc.
>Does htDig index everything in a PDF? What is being indexed? If htDig
>can read the contents of both PFD and HTML files the whole nature of
selected file name extensions ... AND, there is in the linux world a
reasonably good included parser available that works well with htDig and
allows for the extraction of 'text' from a PDF. Thus what htDig indexes is
a 'plain text' rendition(/representation/interpretation) of the original PDF.
But, the PDF parser has some serious limitations, based upon PDF's internal
storage of text/data, which can NOT be overcome.
1) It does not understand text flows.
2) Hyphenated words, or words across lines, stay broken and separated.
3) ALL info about headers, size/font, importance, and other "weighings"
used by htDig are lost and not available.
4) The entire PDF document/file exists in page broken chunks, but for the
indexer it is treated as if it were just =one= page.
THEREFORE, if you were to have the same book/content ... one stored in HTML
page(s), and the other stored in a PDF file ... the htDig indexing/indexes,
and the index-search results would be very different for each.
How that affects our strategy/approach, I do not know. ??
I do know that PDF makes for =big= files which can be functional for
Intranet, but NOT practical for Internet or slow bandwidths. Also, indexed
PDF does not allow for 'weighing' since it treats ALL text equal and as
seen on the screen (and thus sometimes in fragments). Basically, PDF is a
page preservation and presentation format ... think of it as a frozen
'Print to Screen'.
YET, at this time, htDig and associated PDF parser is still the best
available tool/method for this task. We are investigating it further.
Another approach/solution which seems quite elegant is the one used by
Google.com ... they parse the page to a pseudo text/html format ... and
then save that in their cache ... and index that version/copy of the text.
AND, when you ask to see the file, they show you their version, with an
option of seeing the original -- result is much faster, and a bit more
elegant (technically speaking).
> > b. creation of a library catalog entry for each book or majorNot stupid ... AND, all viable/interesting ideas should be given a chance
> > resource (i.e. call number, subject listings). This is a much
> > bigger task and will require human hands on involvement, but if
> > the cataloging data can be presented in a format that most
> > library programs can read, the data can be incorporated directly
> > into the library catalog and be searchable in the same way all
> > the other books are. I am looking into the z39.50 protocol which
> > allows one to automatically search large library catalogs (eg.
> > Library of Congress). If we can make this work, it will make
> > things much easier.
>As stupid as this sounds,
to be thought through and/or thought out.
>what would you all think if we named theAt first crack/thought ... a good idea. I actually like it.
>files by LC numbers, with the .pdf or .html? If we do this from the
>beginning it will not be that much of a hassle since we can access
>either LC or any Seminary Library online and get the call#.
>Our files themselve would function like an open-stack library! And we
>could actually use these other libraries to do our searching.
BUT, there are some obvious difficulties.
1) as you point out below, many many books are NOT in LC ... and since our
focus is Indonesian materials, that would represent about 99.9%
2) quite a few books we have found actually exist with several/multiple LC
numbers for one book ... especially the older and public domain books for
which we are likely to get the largest quantities and easiest permissions.
[IF not convinced, try using California's Melvyl Catalog ... very nice]
3) most 'html' books, and many 'pdf' books are in multiple files. AND, to
rename all of the internal links and dependencies would be a guaranteed
nightmare, and besides lots of extra work, it could even lead to potential
claims of us 'changing/tampering' with the materials/originals.
4) some books have more than one version/printing, and perhaps even a
foreign language version (Indonesian we hope <G><)
5) in world of MSWindows dominated/handicapped browsers, there are
technical reasons for avoiding any second/extra "." (periods) in filenames.
6) the resulting file name would be unreadable, and unmeaningful to the
average human users (but then again, so are many other naming schemes).
HOWEVER, I'm happy to report that "Library Cards" (especially electronic
ones) can and do function in almost exactly that way ...
BUT, I do like the idea, and I/we over here will give it some more thought.
>Those works that are not catalogued in LC are a problem, however.Yep ... very much so.
AND, at this point in my discussions with KEITH, the cataloguing is the one
of the most CRITICAL/Important factors -- especially the ability, tools and
underlying understanding for doing the cataloguing.
> > 2. Download format. Ideally we should probably work in two basicThere are tools that help take *.DOC to RTF, and then there are good tools
> > formats, html and pdf (with pure text as an extension to html).
> > To be searchable, the pdf files must have OCRed text in them.
> > PDF's only containing scanned materials could not be indexed. It
> > is also possible to have other formats, eg., word doc or rtf,
> > which could easily be converted into either pdf or html.
>I basically agree about the html and pdf format system. There are
>some freeware drivers that will enable people to convert the txt,
>rtf, doc files in to pdf.
to take RTF to HTML and/or to PDF, or DOC directly to PDF. Along the way
some information/formating is lost ...
BUT, if the final/only format is PDF, that can be a bit dangerous ...
... since PDF is a ONEWAY and NON-REVERSIBLE formating trip.
>Do we really want to use html? Or would it not be better to save themGood question.
>also as pdf?
Needs further thought.
Although it is possible to save an entire site into a PDF document, it can
also be quite annoying, and in my opinion the result is bulky!
Remember, PDF and HTML are built on two fundamentally different 'page'
1) HTML is focused on document structure and is rather free form (and
browser dependent) for the display, while
2) PDF is focused on recreating the physical/printed page by saving the
'print image' of the content; it thus throws away much of the document
structure/info but it does succeed in being 'browser independent' (since
you must use PDF's browser).
As you pointed out, approach number is good for preserving the look and
feel of fonts and layout. BUT, if the page(s) are already in HTML, that
benefit disappears ... although it might be nice to wrap ALL of it into one
AND, let me add that I've noticed that whenever I open your html web pages,
my browser wants me to download Japanese Fonts/Extension ... ;-) ... since
they were available when you made/formated/saved the page.
>I'm thinking about font problems on the local computer,Fonts, for OT/NT langauges seem to me to be the main issue.
>the need to save a folder with the pictures, etc. PDF would be all
>embeded. With Acrobat 5 you can export any of the data. However, I
>really do like having the pics separate.
[In the near future, probably, we need to worry about the language fonts
associated with Asia (ex, Japanese, Chinese, Thai, etc, ....). BTW; how
well does PDF support any/all of those languages?]
I think I/we need more input on this ...
What is best, what is prefered ... ??
AND, the answer might be different for Internet, versus Intranet; and again
different for different Language Groups; or theo school types/catagories.
> > 3. Subject areas. I suggest we start with what we have andKeith has set up a good system for that, not to mention also providing a
> > develop it as we need.
>Whatever is being downloaded by an individual should be reported.
good precedent, -- he has made a good first step towards collection lots of
AND, I agree ... start with what we have (Keith has copies of ALL of ours)
... catalog that, and keep getting more. :)
- << Previous post in topic Next post in topic >>