Loading ...
Sorry, an error occurred while loading the content.

Re: Matching/Sorting line terminations: -- Solved

Expand Messages
  • J.B. Lethbridge
    On Tue, Oct 2, 2012 at 2:18 AM, Tony Mechelynck ... Dear Tony: Thanks for coming back to the problem. Indeed I wanted the lines grouped by number of
    Message 1 of 1 , Oct 1, 2012
    • 0 Attachment
      On Tue, Oct 2, 2012 at 2:18 AM, Tony Mechelynck
      <antoine.mechelynck@...> wrote:
      > On 30/09/12 20:18, jbl wrote:
      >>
      >> Thanks -- the lines are in English 400 years old, hence the eccentric
      >> spelling.
      >>
      >> The sorting you suggest is just what produced the first list, the raw
      >> file: it is mixed up with lines without repetitions, and those
      >> repetitions that occur are not ordered: so that in the whole file you
      >> might find two repetitions here and three there, with a six between
      >> them, and odd single lines getting in the way. Over 33,000 lines, this
      >> becomes impossible to manage by hand.
      >>
      >> 'his', 'her' etc are not related or linked: they are different terms
      >> and do not count as repetitions.
      >>
      >> In the first group (repeating two terms) there are three separate
      >> groups of terminations: 'to abate', 'gan abate' and 'by might'.
      >>
      >> In the next group (repeating three terms) I accidentally left out the
      >> second line in each case -- again there are supposed to be three
      >> different pairs. And in the six repeating terms group, I missed out
      >> the first line of the group.
      >>
      >> Sorry for the missing lines, and thanks for your comments,
      >>
      >> Julian
      >>
      >>
      >> On Sun, Sep 30, 2012 at 7:20 PM, Tony Mechelynck
      >> <antoine.mechelynck@...> wrote:
      >>>
      >>> On 30/09/12 18:14, jbl wrote:
      >>>>
      >>>>
      >>>> Hi: The first difficulty with the problem I describe below is that I
      >>>> don't know what the key terms would be to search Google accurately. I
      >>>> have searched for a long time already. So if anyone could even tell me
      >>>> what it is I am looking for I'd be very grateful.
      >>>>
      >>>> The problem is this: I have a large file of poetry in alphabetical
      >>>> order sorted on the last term in each line, I post an except in
      >>>> sample1 below. I want to sort it so that lines that share, say, the
      >>>> last two terms (on the right) with the last two terms of any other
      >>>> line are in one group, those lines that share the last three terms in
      >>>> another and so on up to seven places -- as in sample2 below.
      >>>>
      >>>> The first difficulty I have is getting the search terms into an :ex
      >>>> command -- I need to find for each line whether there are any others
      >>>> that match it to seven terminal places, then to six and so on. I could
      >>>> do the simple locating with something cumbersome like this:
      >>>>
      >>>> map ö $BB2yW: p0ig/ A$/m0 map ä $ByW: p0ig/ A$/m0
      >>>>
      >>>> and so on up to seven places. But it must be possible to generalize
      >>>> that somehow. What would be the general form of an expression for
      >>>> finding the last 'x' words of a line in the same position somewhere
      >>>> else in the file?
      >>>>
      >>>> Apart from the crudeness of the operation, the trouble would be
      >>>> exporting (redirecting?) the results automatically and keeping the
      >>>> exported results in order (as in sample2 below). And also how to
      >>>> iterate it usefully through the whole file.
      >>>>
      >>>> If I started at the top of the raw file, iterating something like
      >>>> these commands, checking each line and exporting the results to a
      >>>> single file, the resulting file would be identical to the original
      >>>> file. I need, I think, to be able to eliminate those lines which do
      >>>> not share any terminations with any other lines. I think starting
      >>>> (somehow) with seven places then six and down to two, would leave me
      >>>> the non-sharing lines by themselves in the original file(?).
      >>>>
      >>>> But I'm not even sure what the strategic logic should be: exactly what
      >>>> tasks should I be trying to get the program to perform? The process
      >>>> needs to be automated because the file is 33,000 lines long. As I say,
      >>>> if someone could tell me what key terms, what types of operations, I
      >>>> should be looking for on Google, it would help a great deal.
      >>>>
      >>>> Many thanks for any help, JBL
      >>>> Vim 7.x Debian/Win7
      >>>>
      >>>> Here are the samples, one before (from the raw file) and one after (as
      >>>> I'd like the whole thing organized).
      >>>>
      >>>> Raw Lines
      >>>> 6.4.30.7 All these our ioyes and all our blisse abate
      >>>> 2.12.15.9 And after them did driue with all her power and might
      >>>> 3.9.14.4 And both full liefe his boasting to abate
      >>>> 6.6.27.9 And layd at him amaine with all his will and might
      >>>> 6.1.38.2 At once did heaue with all their powre and might
      >>>> 6.1.12.7 But through misfortune which did me abase
      >>>> 5.11.57.9 Did set vpon those troupes with all his powre and might
      >>>> 6.2.26.5 For deare affection and vnfayned zeale
      >>>> 3.2.13.6 For hardy thing it is to weene by might
      >>>> 4.9.6.9 He her vnwares attacht and captiue held by might
      >>>> 6.1.32.9 He spide come pricking on with al his powre and might
      >>>> 6.6.31.9 He stayd his second strooke and did his hand abase
      >>>> 3.8.51.6 Mote not mislike you also to abate
      >>>> 3.8.28.7 Ne ought your burning fury mote abate
      >>>> 1.7.35.1 No magicke arts hereof had any might
      >>>> 5.8.46.8 She at her ran with all her force and might
      >>>> 1.10.2.8 She cast to bring him where he chearen might
      >>>> 3.7.35.3 That at the last his fiercenesse gan abate
      >>>> 4.8.17.8 That her inburning wrath she gan abate
      >>>> 1.10.47.7 That hill they scale with all their powre and might
      >>>> 4.6.3.4 The armes he bore his speare he gan abase
      >>>> 5.9.39.4 To all assayes; his name was called Zele
      >>>> 2.9.7.4 To serue that Queene with all my powre and might
      >>>> 2.1.26.7 When suddenly that warriour gan abace
      >>>> 6.12.23.9 Where he him found despoyling all with maine and might
      >>>> 1.5.1.8 With greatest honour he atchieuen might
      >>>> 4.8.1.7 With sufferaunce soft which rigour can abate
      >>>> 5.5.30.1 With that she turn'd her head as halfe abashed
      >>>>
      >>>> Sorted lines
      >>>> ---Lines not repeating final term (=Unique lines):
      >>>> FQ 2.1.26.7 When suddenly that warriour gan abace
      >>>> FQ 5.5.30.1 With that she turn'd her head as halfe abashed
      >>>> FQ 6.2.26.5 For deare affection and vnfayned zeale
      >>>> FQ 5.9.39.4 To all assayes; his name was called Zele
      >>>>
      >>>> ---Lines repeating final term only:
      >>>> FQ 6.1.12.7 But through misfortune which did me abase
      >>>> FQ 6.6.31.9 He stayd his second strooke and did his hand abase
      >>>> FQ 4.6.3.4 The armes he bore his speare he gan abase
      >>>> FQ 3.8.28.7 Ne ought your burning fury mote abate
      >>>> FQ 4.8.1.7 With sufferaunce soft which rigour can abate
      >>>> FQ 6.4.30.7 All these our ioyes and all our blisse abate
      >>>> FQ 1.7.35.1 No magicke arts hereof had any might
      >>>> FQ 1.10.2.8 She cast to bring him where he chearen might
      >>>> FQ 1.5.1.8 With greatest honour he atchieuen might
      >>>> FQ 6.6.27.9 And layd at him amaine with all his will and might
      >>>
      >>>
      >>> abase == abate == might? I guess I'm too stupid.
      >>>
      >>>>
      >>>> ---Lines repeating final two terms:
      >>>> FQ 3.9.14.4 And both full liefe his boasting to abate
      >>>> FQ 3.8.51.6 Mote not mislike you also to abate
      >>>> FQ 4.8.17.8 That her inburning wrath she gan abate
      >>>> FQ 3.7.35.3 That at the last his fiercenesse gan abate
      >>>> FQ 3.2.13.6 For hardy thing it is to weene by might
      >>>> FQ 4.9.6.9 He her vnwares attacht and captiue held by might
      >>>>
      >>>> ---Lines repeating final three terms:
      >>>> FQ 5.8.46.8 She at her ran with all her force and might
      >>>> FQ 6.12.23.9 Where he him found despoyling all with maine and might
      >>>> FQ 2.9.7.4 To serue that Queene with all my powre and might
      >>>
      >>>
      >>> force == maine == powre (sic) ? You will have to explain me that
      >>>
      >>>
      >>>> ........
      >>>>
      >>>> ---Lines repeating final six terms:
      >>>> FQ 2.12.15.9 And after them did driue with all her power and might
      >>>> FQ 5.11.57.9 Did set vpon those troupes with all his powre and might
      >>>> FQ 6.1.32.9 He spide come pricking on with all his powre and might
      >>>> FQ 1.10.47.7 That hill they scale with all their powre and might
      >>>> FQ 6.1.38.2 At once did heaue with all their powre and might
      >>>
      >>>
      >>> I suppose "powre" is four times a typo.
      >>> her == his == their? Or are there three different sets of lines, one of
      >>> them
      >>> a singleton?
      >>>
      >>>>
      >>>
      >>> This sounds like a "decorate - sort - undecorate" problem:
      >>> 1. Put each line into "sortable" order (in this case, reverse the order
      >>> of
      >>> the terms, so that the last term comes at the start of the line, then one
      >>> space, then the last but one, then one space, etc.);
      >>> 2. Sort
      >>> 3. Put the lines back like they used to be (i.e., reverse the order of
      >>> the
      >>> terms again).
      >>>
      >>> Note that no "dumb" computer will be able to find out that "his", "her"
      >>> and
      >>> "their" are to be sorted together, unless you somehow program it into the
      >>> logic of your steps 1 and 3.
      >>>
      >>>
      >>> Best regards,
      >>> Tony.
      >>> --
      >>> "I'd love to go out with you, but I'm taking punk totem pole carving."
      >>>
      >>> --
      >>> You received this message from the "vim_use" maillist.
      >>> Do not top-post! Type your reply below the text you are replying to.
      >>> For more information, visit http://www.vim.org/maillist.php
      >>
      >>
      >>
      >>
      > So you want to group lines so that the groups of lines which have most in
      > common come at the top. Okay, this is a little more difficult:
      >
      > 1. Reverse the order of the terms
      > 2. Sort
      > 3. Count how many words match, and add that count at the start of every line
      > in the group.
      > 4. Sort again, numerically.
      > 5. Un-decorate (remove the number and reverse the words)
      >
      > You will have to handle the case where lines with many common words are
      > included in larger groups with only two or three common words, as follows:
      > the following are after step 2, and I'm adding the number of matching words
      > at the end to show your choices:
      >
      > The length of the border between Scotland and England |1
      > Cambridge, Massachusetts and Cambridge, England |1
      > London, the capital of England |2 |1
      > The House of Commons invited the Queen of England |3 |2 |1
      > The flower gardens of the former Queen of England |3 |2 |1
      > Once upon a time when all the ships of England |4 |2 |1
      > All the ships of Spain couldn't sink the ships of England |4 |2 |1
      >
      > If you remove the lines with 4 matching words and those with 3 matching
      > words, will you bring them back to create a group with 2 matching words? Or
      > will the line "London…" find itself in only a group with 1 matching word?
      > And the "England" group (1 matching word) which started out with 7 lines,
      > will it be "amputated" of better matching lines, and reduced to 2 lines
      > (without "London…") or 3 (with it)?
      >
      >
      > Best regards,
      > Tony.
      > --
      > The wind doth taste so bitter sweet,
      > Like Jaspar wine and sugar,
      > It must have blown through someone's feet,
      > Like those of Caspar Weinberger.
      > -- P. Opus

      Dear Tony:

      Thanks for coming back to the problem. Indeed I wanted the lines
      grouped by number of repetitions at line end as you suggest. Using
      programmes I know I managed that as follows.


      I used the Python script Tim sent (which is in this thread, but
      somehow not quoted in this exchange), which gave me six files: one of
      7 terminating repetitions:

      2.1.10.6 As on the earth great mother of vs all
      7.7.17.6 And first the Earth great mother of vs all
      4.3.36.2 As if but then the battell had begonne
      4.9.27.2 As if but then the battell had begonne

      one of six, five and down to two (a big file). Each file contained in
      addition those lines also in the previous file. Thus the 6 repetitions
      file contained also the 7 repetitions.

      I then sorted the files in pairs (2 and 3, then 3 and 4 ...) and ran
      'uniq - u' on the results which removed the shared lines, and left me
      with only lines with two repetitions, three, and so on. Thus:

      sort -i 2words.txt 3words.txt | uniq -u > 2-3words.txt

      Then sorted 2-3words.txt again (reversing unreversing lines and so
      on) to get 2wordsONLY.txt file.

      Cumbersome as it happens, but at least I knew how to do it that way. I
      then ran 'cat' to combine the files with the two repetitions file at
      the top.

      I also made a copy of the 'cat' file and sorted again alphabetically
      on line termination; so now I have numbered repetitions separated in
      one file and combined in another. The combined file has no lines in it
      without repetition and taking those out in this manner has revealed a
      lot of patterns where the line-ending is substantively the same but
      has different pronouns. You noticed this earlier. For the computer of
      course 'his' is not related to 'her' for example as you spotted; but
      for a human being reading the line, both count as part of the same
      formula: 'he, she, it they, etc' returned back againe' are all the
      same formula. There are lots of cases of this sort of thing, synonyms
      of various sorts, prepositions and so on.

      Thank you very much,

      Best wishes,

      Julian

      >
      >
      > --
      > You received this message from the "vim_use" maillist.
      > Do not top-post! Type your reply below the text you are replying to.
      > For more information, visit http://www.vim.org/maillist.php

      --
      You received this message from the "vim_use" maillist.
      Do not top-post! Type your reply below the text you are replying to.
      For more information, visit http://www.vim.org/maillist.php
    Your message has been successfully submitted and would be delivered to recipients shortly.