Loading ...
Sorry, an error occurred while loading the content.

Searching in large text files

Expand Messages
  • mehlng
    I m thinking of implementing a fsf sollution for searching a couple of words within a distance bound in a large text file. IE given a large text file, I m
    Message 1 of 2 , Aug 3, 2004
    • 0 Attachment
      I'm thinking of implementing a fsf sollution for searching a couple of
      words within a distance bound in a large text file.
      IE given a large text file, I'm giving some words [aba,ima,moomie] and
      a number say 10, and I want to determine whether and where in the big
      text files there is a range x, so that in [x-10:x+10] all words
      occurs. This is for searching in a large computerized books or so. Two
      commercial similat programs I've found on the web are the BIU responsa
      project http://www.biu.ac.il/JH/Responsa/ and CDI's takdin program
      http://www.takdinet.co.il/. Obviously this should support hebrew.
      A) anyone who knows similar opensource initiations is welcomed to let
      me know.
      B) I though of a general algorithm for the problem and I'd like to
      hear your opinion on my method:
      arise a big bsd (or alike) database of all words in the text, attach
      to every word *all* its occurences in the text (ie "hello" is in
      [134,1233,3004] Bytes from the begginning). Searching words list will
      choose the word with the smallest amount of occurences and will search
      in each occurence if the other words are aside it. Possible
      optimization for words with many occurences is adding to the database
      new entry foreach word occuring near it IE in a text with "rare3
      common rare1 rare2" we'll add common+rare1,common+rare2,common+rare3
      to the database, all pointing to the same paragraph in the text quoted
      before. My only fear is that the database will be too large.
      Another option is dividing (either naturally or by chapters/pages
      whatever) to k-bytes paragraphs, and then pointing each word to the
      paragraph where it occurs and then searching the paragraph manually
      for this word, so that the datbase's size is reduced.

      I'm eager to hear wiser solutions
    • Moshe Zadka
      ... Well, the algorithm seems heavy (very heavy, in fact) in memory (consider that many times you ll have enough garbage words to expand the database)
      Message 2 of 2 , Aug 8, 2004
      • 0 Attachment
        On Tue, 3 Aug 2004, mehlng wrote:

        > B) I though of a general algorithm for the problem and I'd like to
        > hear your opinion on my method:
        <snip>

        Well, the algorithm seems heavy (very heavy, in fact) in memory (consider
        that many times you'll have enough garbage words to expand the database)
        and also to be more complicated then necessary.

        Here is some pseudo-code (IOW, undebugged Python) that solves this using
        a Buffer and a (fairly specialized) multi-set, which always knows how many
        words it needs to complete a match. In my solution, I assume the GUI wants
        to highlight the whole range of where the words appear, but it is easy
        to give other solutions. For example, my solution is completely line/column
        based, and does not give byte-information. That, too, is easily fixed.

        The solution is 40 lines or so, and should be fast enough for all but the
        most demanding applications.

        An easy optimization is to reimplement Buffer to use a rolling pointer
        rather than moving all N elements on each new one -- I haven't done this
        mostly because this would give less clear code, and also because such an
        optimization is probably better done through using a custom C class.

        Of course, this algorithm is based on the
        assumption, probably not true for stuff like Takdin, that a document
        is not searched often enough for it to be worth it to produce a database.

        The first thing a database can do is, of course, have a fixed-field
        format it converts words into so the parsing (but not calculation)
        will be much faster. Also, it could convert words into CRC32 thingies
        which are then used to index the multi-set in a much faster way. This
        might give false positives, but false positives are just an inefficiency,
        not incorrectness, with a second layer that verifies matches using a slower
        more dependable algorithm. In all those cases, a database would be approximately
        as big as the original documents so the worst case scenario is twice the disk
        space necessary. For example, a database which only has word offset and
        the word CRC32 (which is good enough to pass to the second slow layer) will
        be 8 bytes per word. If we assume the average word is 5-6 characters (6-7
        including the space), then the database will take about the same size.

        Hell, for pseudo-code, you can use roughly the same code: all that needs
        to change is readWords.

        class MultiSet:
        def __init__(self, possible):
        self.possible = dict(zip(possible, itertools.repeat(0)))
        self.missing = len(possible)
        def add(self, word):
        if not self.possible[word]:
        self.missing -= 1
        self.possible[word] += 1
        def remove(self, word):
        self.possible[word] -= 1
        if not self.possible[word]:
        self.missing += 1

        class Buffer:
        def __init__(self, size, default=None):
        self.buffer = [default]*size
        def add(self, item):
        self.buffer.append(item)
        return self.buffer.pop(0)

        def search(words, size, lst):
        set = MultiSet(words)
        buffer = Buffer(size, (None, None))
        for (word, meta) in lst:
        if word not in words:
        word = None
        delword, _ = self.buffer.add((word, meta))
        if delword is not None:
        set.remove(delword)
        if word is not None:
        set.add(word)
        if not set.missing:
        yield [meta for word, meta in buffer.buffer if word is not None]

        def readWords(fin):
        word = re.compile('\w+')
        for num, line in enumerate(fin):
        for match in word.finditer(line):
        yield match.group(), (num, match.span())

        def rangesInFile(fin, words, size):
        for places in search(words, size, readWords(fin)):
        yield (places[0][0], places[0][1][0]), (places[-1][0], places[-1][1][1])
      Your message has been successfully submitted and would be delivered to recipients shortly.