Sstrhsrtj

Question

Are there any ready available tools for searching the most frequently occurring text patterns (word combinations, phrases) in the text you work on, so to give ideas, which ones to replace by short macros?

Interesting question, though in case of "patterns" longer than, say, one word, things might get a bit difficult due to space/newline treatment, various ways to get space after a macro etc. — Mar 8 '13 at 10:42
An interesting question, but sounds like a job for an external tool... — Mar 8 '13 at 16:19
I wonder what your goal is with this. For writing a document, you're better off analyzing which combinations you find significant and worth automating. You seem to be asking for a sort of "literate LZW" algorithm. — May 2 '13 at 21:35

score 22 · Accepted Answer · 2018-12-29 10:55:30Z

EDIT: I had implemented some version of the Lempel-Ziv-Welch algorithm, following a comment of Ryan Reich. I've now changed the code completely and I don't know if the algorithm has a name.

I fed the new code the 12 days of Christmas, as David Carlisle asked, and obtained the following result (for plain TeX)

longdefA#1#2{def#1{#2}A}AAG{sparACAAB}AH{AEWXVTU mVAD}AA

{CDEFG}AB{HIJAparbigskipB}AC{OPparMNKL nd}AD{RSparQ

A ring}AE{Y es dancV}AF{tZ a leapV}B{par On the }C{ day of C}D

{hristmas }E{my true l}F{ove gave }G{to mepar}H{a partrid}I{ge in a p}J

{ear tree.}K{two turtl}L{e dovespar a}M{three fre}N{nch henspar}O{four

call}P{ing birds}Q{five gold}R{six geese}S{ a laying}T{seven swa}U{ns a

swim}V{ingpar}W{eight mai}X{ds a milk}Y{nine ladi}Z{en lords } A{ }B

firstAAAB secondAAKL nd AB thirdAAMNKL nd AB fourthAAACAAB

fifthAAQA ringAG sixthAAADAG seventhAATU mVADAG eighthAAWXVT

U mVADAG ninthAAAHAG tenthAAAFAHAG eleventhAA eleven pipers pipV

AFAHAG twelfthAA twelve drummers drummV eleven pipers pipVAFAH spar

ACAHIJAbye

which is 878 bytes long. This should be compared to the 767 bytes of xii.tex, which produces the same lyrics.

RequirePackage{expl3}

ExplSyntaxOn

cs_generate_variant:Nn tl_replace_all:Nnn { Nxx, NVc, Nxn }

cs_generate_variant:Nn tl_replace_once:Nnn { Nxx }

cs_generate_variant:Nn tl_if_in:NnT { No }

cs_generate_variant:Nn int_min:nn { c, cc }

cs_generate_variant:Nn tl_if_eq:NNTF { cc }

cs_generate_variant:Nn tl_if_eq:nnTF { xx }

cs_generate_variant:Nn tl_if_eq:nnT { xx }

cs_generate_variant:Nn str_count:N { c }

cs_generate_variant:Nn regex_match:nnF { nV }

cs_generate_variant:Nn str_set:Nn { NV }

str_new:N l__find_prefix_str

int_new:N l__find_prefix_int

tl_new:N l__find_tl

tl_new:N l__find_chunks_tl

int_step_inline:nnnn { 1 } { 1 } { 9 }

  { seq_new:c { l__find_chunks_#1_seq } }

int_new:N l__find_common_int

int_new:N l__find_nesting_int

tl_new:N l__find_previous_tl

seq_new:N l__find_chunks_seq

int_new:N l__find_best_score_int

int_new:N l__find_macro_int

tl_new:N l__find_macros_tl

tl_new:N l__find_result_tl

int_new:N l__find_length_int

int_new:N l__find_previous_length_int

tl_new:N l__find_display_tl



cs_new_protected_nopar:Npn dot: { tex_message:D { . } }



cs_new_protected:Npn find_matches:nnN #1#2#3

  {

    % '#1' is the prefix, '#2' is the token list to study.

    %

    __find_set_prefix:n {#1}

    tl_set:Nn l__find_tl { ~! #2 }

    __find_escape_spaces:xN { l__find_prefix_str A } l__find_tl

    int_set_eq:NN l__find_macro_int c_one

    __find_get_length:V l__find_prefix_str

    iow_term:x { int_use:N l__find_length_int }

    int_set_eq:NN l__find_length_int c_max_int

    __find_matches_aux:V l__find_prefix_str

    tl_replace_once:Nnn l__find_tl { ! } { { ~ } }

    tl_set:Nx #3

      {

        __find_preamble:

        exp_not:V l__find_tl

      }

  }

cs_new_protected:Npn __find_escape_spaces:nN #1#2

  {

    regex_replace_all:nnN { cS  } { c{#1} } #2

    dot:

  }

cs_generate_variant:Nn __find_escape_spaces:nN { x }

cs_new_nopar:Npn __find_preamble:

  {

    exp_not:n { long def } exp_not:c { l__find_prefix_str A } ####1####2

      {

        exp_not:N def ####1{####2}

        exp_not:c { l__find_prefix_str A }

      }

    exp_not:c { l__find_prefix_str A }

  }

cs_new_protected:Npn __find_matches_aux:n #1

  {

    int_set_eq:NN l__find_previous_length_int l__find_length_int

    tl_set_eq:NN l__find_previous_tl l__find_tl

    __find_escape_tl:nN {#1} l__find_tl

    int_compare:nNnTF { tl_count:N l__find_tl } < c_nine

      { tl_set_eq:Nn l__find_tl l__find_previous_tl }

      {

        __find_set_chunks:

        __find_sort_chunks:

        __find_common:

        __find_best_macros:

        __find_undefine_tmp:

        tl_set_eq:NN l__find_tl l__find_result_tl

        __find_unescape_tl:nN {#1} l__find_tl

        __find_get_length:n {#1}

        iow_term:x { int_use:N l__find_length_int }

        int_compare:nNnTF l__find_length_int < l__find_previous_length_int

          { __find_matches_aux:n {#1} }

          {

            iow_term:n { }

            iow_term:x { l__find_display_tl }

            iow_term:n { }

            tl_set_eq:NN l__find_tl l__find_previous_tl

          }

      }

  }

cs_generate_variant:Nn __find_matches_aux:n { V }



cs_new_protected:Npn __find_get_length:n #1

  {

    tl_set_eq:NN l__find_display_tl l__find_tl

    tl_replace_once:Nxx l__find_display_tl

      { exp_not:c { #1 A } ! }

      { ~ exp_not:c { #1 A } {~} }

    str_set:NV l__find_display_tl l__find_display_tl

    tl_replace_all:Nxn l__find_display_tl

      { c_backslash_str #1 token_to_str:N A ~ } c_space_tl

    tl_replace_all:Nnn l__find_display_tl

      { ~ c_space_tl } { ~ exp_not:c { #1 A } }

    dot:

    str_set:Nx l__find_display_tl { __find_preamble: l__find_display_tl }

    tl_replace_all:Nxx l__find_display_tl

      { c_hash_str c_hash_str } { c_hash_str }

    dot:

    regex_replace_all:nnN

      { (\[A-Za-z]+)  ([A-Za-z]) }

      { 1   2 }

      l__find_display_tl

    dot:

    regex_replace_all:nnN

      { (\[A-Za-z]+)  }

      { 1 c{c__iow_wrap_zwbs_marker_tl} }

      l__find_display_tl

    dot:

    iow_wrap:nnnN { l__find_display_tl } { } { } __find_get_length_aux:n

  }

cs_generate_variant:Nn __find_get_length:n { V }

cs_new_protected:Npn __find_get_length_aux:n #1

  {

    int_set:Nn l__find_length_int { str_count:n {#1} }

    tl_set:Nn l__find_display_tl {#1}

    % iow_term:n { }

    % iow_term:n {#1}

    % iow_term:n { }

  }







cs_new_protected:Npn __find_set_prefix:n #1

  {

    % Check that the prefix |#1| is made only of alphabetic characters.

    %

    str_set:Nx l__find_prefix_str {#1}

    int_set:Nn l__find_prefix_int { str_count:N l__find_prefix_str }

    regex_match:nVF { Aw*Z } l__find_prefix_str

      {

        msg_error:nnx { find } { invalid-prefix }

          { l__find_prefix_str }

      }

    dot:

  }



cs_new_protected:Npn __find_escape_tl:nN #1#2

  {

    % During the 'study' step, we manipulate the token list |#2|

    % with all begin-group and end-group tokens replaced by a

    % control sequence built from the prefix.  We must change both

    % begin-group and end-group tokens in one pass, to avoid getting an

    % unbalanced result.  Also replace macro parameters because they

    % cannot be used as delimiters for macros.  Spaces have been

    % turned into a control sequence earlier.  At this stage, every

    % token in |#2| can be grabbed as an N-type argument.

    %

    regex_replace_all:nnN { cB. } { cB{ } #2

    dot:

    regex_replace_all:nnN { cE. } { cE} } #2

    dot:

    regex_replace_all:nnN { cP. } { c{ #1 # } } #2

    dot:

    regex_replace_all:nnN { c[BEP]. } { c{ #1  } } #2

    dot:

  }









cs_new_protected_nopar:Npn __find_set_chunks:

  {

    % Build a token list whose items are each nine consecutive tokens

    % of the original token list, in a running window.  So for instance

    % |ABCDEFGHIJKL| would lead to the following (12) items:

    % |ABCDEFGHI|, |BCDEFGHIJ|, |CDEFGHIJK|, |DEFGHIJKL|, |EFGHIJKL|,

    % |FGHIJKL|, |GHIJKL|, |HIJKL|, |IJKL|, |JKL|, |KL|, |L|.  The items

    % of this token list are built in an |x|-expanded loop.

    % A special case arises if the |find| token list is too short to

    % safely perform the loop: then our whole algorithm is not going to

    % do any good anyways, so we build an empty chunk list.

    %

    tl_set:Nx l__find_chunks_tl

      {

        exp_after:wN __find_set_chunks_loop:NNNNNNNNN l__find_tl

          q_recursion_tail q_recursion_stop

      }

  }

cs_new:Npn __find_set_chunks_loop:NNNNNNNNN #1#2#3#4#5#6#7#8#9

  {

    % As usual in a TeX loop, first check for the end-loop marker (here,

    % cs{q_recursion_tail}).  If it is reached, we fill in the last few

    % chunks (which become shorter and shorter as we go).  Otherwise,

    % add (to the token list we are building) an item containing (9)

    % tokens, and loop, dropping the first of the items.

    %

    quark_if_recursion_tail_stop_do:Nn #9

      { __find_set_chunks_end:NNNNNNNN #1 #2 #3 #4 #5 #6 #7 #8 }

    { exp_not:n { #1 #2 #3 #4 #5 #6 #7 #8 #9 } }

    __find_set_chunks_loop:NNNNNNNNN #2#3#4#5#6#7#8#9

  }

cs_new:Npn __find_set_chunks_end:NNNNNNNN #1#2#3#4#5#6#7#8

  {

    exp_not:n

      {

        { #1 #2 #3 #4 #5 #6 #7 #8 }

        { #2 #3 #4 #5 #6 #7 #8 }

        { #3 #4 #5 #6 #7 #8 }

        { #4 #5 #6 #7 #8 }

        { #5 #6 #7 #8 }

        { #6 #7 #8 }

        { #7 #8 }

        { #8 }

      }

  }

cs_new_protected:Npn __find_sort_chunks:

  {

    tl_sort:Nn l__find_chunks_tl

      {

        int_compare:nNnTF

          {

            pdftex_strcmp:D

              { exp_not:n {##1} }

              { exp_not:n {##2} }

          }

          > c_zero

          { sort_return_swapped: }

          { sort_return_same: }

      }

  }



cs_new_protected:Npn __find_common:

  {

    int_step_inline:nnnn { 1 } { 1 } { 9 }

      { seq_clear:c { l__find_chunks_##1_seq } }

    exp_after:wN __find_common_loop:nn l__find_chunks_tl

      q_recursion_tail q_recursion_tail q_recursion_stop

    int_step_inline:nnnn { 4 } { 1 } { 9 }

      {

        seq_map_inline:cn { l__find_chunks_##1_seq }

          {

            tl_if_exist:cTF { l__find_chunk_ ' tl_to_str:n {####1} ' _tl }

              {

                tl_put_right:cn

                  { l__find_chunk_ ' tl_to_str:n {####1} ' _tl } { i }

              }

              {

                cs_set_eq:cN

                  { l__find_chunk_ ' tl_to_str:n {####1} ' _tl } c_empty_tl

              }

          }

      }

  }



cs_new_protected:Npn __find_common_loop:nn #1#2

  {

    quark_if_recursion_tail_stop:n {#2}

    int_zero:N l__find_common_int

    __find_count_common_aux:nn {#1} {#2}

    use:c { __find_common_ int_use:N l__find_common_int :w }

      #1 X X X X X X X X X q_stop

    __find_common_loop:nn {#2}

  }

cs_new_protected:Npn __find_count_common_aux:nn #1#2

  {

    tl_if_empty:nF {#1}

      {

        tl_if_empty:nF {#2}

          {

            tl_if_eq:xxT { tl_head:n {#1} } { tl_head:n {#2} }

              {

                int_incr:N l__find_common_int

                __find_count_common_aux:xx

                  { tl_tail:n {#1} } { tl_tail:n {#2} }

              }

          }

      }

  }

cs_generate_variant:Nn __find_count_common_aux:nn { xx }



cs_new_eq:cN { __find_common_0:w } use_none_delimit_by_q_stop:w

cs_new_protected:cpn { __find_common_1:w } #1

  { __find_common_auxii:nnw { 1 } {#1} }

cs_new_protected:cpn { __find_common_2:w } #1#2

  { __find_common_auxii:nnw { 2 } { #1 #2 } }

cs_new_protected:cpn { __find_common_3:w } #1#2#3

  { __find_common_auxii:nnw { 3 } { #1 #2 #3 } }

cs_new_protected:cpn { __find_common_4:w } #1#2#3#4

  { __find_common_auxii:nnw { 4 } { #1 #2 #3 #4 } }

cs_new_protected:cpn { __find_common_5:w } #1#2#3#4#5

  {

    dot:

    __find_common_auxii:nnw { 5 } { #1 #2 #3 #4 #5 }

  }

cs_new_protected:cpn { __find_common_6:w } #1#2#3#4#5#6

  { __find_common_auxii:nnw { 6 } { #1 #2 #3 #4 #5 #6 } }

cs_new_protected:cpn { __find_common_7:w } #1#2#3#4#5#6#7

  { __find_common_auxii:nnw { 7 } { #1 #2 #3 #4 #5 #6 #7 } }

cs_new_protected:cpn { __find_common_8:w } #1#2#3#4#5#6#7#8

  { __find_common_auxii:nnw { 8 } { #1 #2 #3 #4 #5 #6 #7 #8 } }

cs_new_protected:cpn { __find_common_9:w } #1#2#3#4#5#6#7#8#9

  { __find_common_auxii:nnw { 9 } { #1 #2 #3 #4 #5 #6 #7 #8 #9 } }

cs_new_protected:Npn __find_common_auxii:nnw #1#2#3 q_stop

  {

    int_zero:N l__find_nesting_int

    tl_map_inline:nn {#2}

      {

        str_case_x:nnn { exp_not:n {##1} }

          {

            { exp_not:c { l__find_prefix_str c_lbrace_str } }

              { int_incr:N l__find_nesting_int }

            { exp_not:c { l__find_prefix_str c_rbrace_str } }

              {

                int_compare:nNnF l__find_nesting_int > c_zero

                  { use_none_delimit_by_q_stop:w }

                int_decr:N l__find_nesting_int

              }

          }

          { }

      }

    int_compare:nNnF l__find_nesting_int = c_zero

      { use_none_delimit_by_q_stop:w }

    seq_put_right:cn { l__find_chunks_#1_seq } {#2}

    use_none_delimit_by_q_stop:w

    q_stop

  }



cs_new_protected_nopar:Npn __find_best_macros:

  {

    tl_clear:N l__find_macros_tl

    tl_clear:N l__find_result_tl

    __find_best_macros_aux:

    tl_put_left:NV l__find_result_tl l__find_macros_tl

  }

cs_new_protected:Npn __find_best_macros_aux:

  {

    exp_after:wN __find_best_macros_auxii:NNNNNNNNN l__find_tl

      q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_stop

    tl_if_empty:NF l__find_tl { __find_best_macros_aux: }

  }

cs_new_protected:Npn __find_best_macros_auxii:NNNNNNNNN #1#2#3#4#5#6#7#8#9

  {

    int_zero:N l__find_best_score_int

    tl_clear:N l__find_best_chunk_tl

    tl_map_inline:nn

      {

        {#1} {#1#2} {#1#2#3} {#1#2#3#4} {#1#2#3#4#5} {#1#2#3#4#5#6}

        {#1#2#3#4#5#6#7} {#1#2#3#4#5#6#7#8} {#1#2#3#4#5#6#7#8#9}

      }

      {

        int_compare:nNnT

          { __find_score:n {##1} } > l__find_best_score_int

          {

            tl_set:Nn l__find_best_chunk_tl {##1}

            int_set:Nn l__find_best_score_int

              { __find_score:n {##1} }

          }

      }

    tl_if_empty:NF l__find_best_chunk_tl

      {

        int_incr:N l__find_macro_int

        tl_put_right:Nx l__find_macros_tl

          {

            exp_not:c

              { l__find_prefix_str int_to_Alph:n { l__find_macro_int } }

            { exp_not:V l__find_best_chunk_tl }

          }

        tl_replace_all:NVc l__find_tl l__find_best_chunk_tl

          { l__find_prefix_str int_to_Alph:n { l__find_macro_int } }

        dot:

      }

    tl_put_right:Nx l__find_result_tl { tl_head:N l__find_tl }

    tl_set:Nx l__find_tl { tl_tail:N l__find_tl }

    use_none_delimit_by_q_stop:w

  }

cs_new:Npn __find_score:n #1

  {

    % Turning ####1 into a length (p+2) macro

    % (e.g. |<prefix>A|) saves this number of chars.

    % Good if non-negative.

    cs_if_exist:cTF { l__find_chunk_ ' tl_to_str:n {#1} ' _tl }

      {

        int_eval:n

          {

            tl_count:c { l__find_chunk_ ' tl_to_str:n {#1} ' _tl }

            * ( str_count:n {#1} - 3 - l__find_prefix_int )

            - 2 * l__find_prefix_int - 9

          }

      }

      { -1 }

  }



cs_new_protected:Npn __find_undefine_tmp:

  {

    int_step_inline:nnnn { 4 } { 1 } { 9 }

      {

        seq_map_inline:cn { l__find_chunks_##1_seq }

          { cs_undefine:c { l__find_chunk_ ' tl_to_str:n {####1} ' _tl } }

      }

  }



cs_new_protected:Npn __find_unescape_tl:nN #1#2

  {

    regex_replace_all:nnN { c{#1{} } { cB{ cE? } #2

    dot:

    regex_replace_all:nnN { c{#1}} } { cB? cE} } #2

    dot:

    regex_replace_all:nnN { c[BE]? } { } #2

    dot:

    regex_replace_all:nnN { c{#1#} } { cP# } #2

    dot:

  }



%%%% Hack LaTeX3's iow_wrap to allow zero-width breaking spaces

tl_const:Nx c__iow_wrap_zwbs_marker_tl

  {

    c_catcode_other_space_tl

    c__iow_wrap_marker_tl

    c_catcode_other_space_tl

    zwbs

    c_catcode_other_space_tl

  }

cs_new_protected_nopar:Npn __iow_wrap_zwbs:

  {

    tl_put_right:Nn l__iow_current_line_tl

      { tex_romannumeral:D -`0 }

    int_decr:N l__iow_current_line_int

  }

%%% End of hack.



msg_new:nnn { find } { invalid-prefix }

  { The~prefix~used,~'#1',~must~be~made~of~letters. }



cs_set_protected:Npn FindMatches #1#2

  {

    find_matches:nnN {#1} {#2} l_tmpa_tl

    tl_use:N l_tmpa_tl

  }

ExplSyntaxOff



% Set up some basic things to behave a bit like plain TeX.

%

documentclass{article}

begin{document}

setlength{parindent}{0pt}

defbye{end{document}}

%

% End of setup.



FindMatches{}{% Let's go!



On the first day of Christmas my true love gave to mepar

a partridge in a pear tree.



bigskip



On the second day of Christmas my true love gave to mepar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the third day of Christmas my true love gave to mepar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the fourth day of Christmas my true love gave to mepar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the fifth day of Christmas my true love gave to mepar

five gold ringspar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the sixth day of Christmas my true love gave to mepar

six geese a layingpar

five gold ringspar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the seventh day of Christmas my true love gave to mepar

seven swans a swimmingpar

six geese a layingpar

five gold ringspar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the eighth day of Christmas my true love gave to mepar

eight maids a milkingpar

seven swans a swimmingpar

six geese a layingpar

five gold ringspar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the ninth day of Christmas my true love gave to mepar

nine ladies dancingpar

eight maids a milkingpar

seven swans a swimmingpar

six geese a layingpar

five gold ringspar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the tenth day of Christmas my true love gave to mepar

ten lords a leapingpar

nine ladies dancingpar

eight maids a milkingpar

seven swans a swimmingpar

six geese a layingpar

five gold ringspar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the eleventh day of Christmas my true love gave to mepar

eleven pipers pipingpar

ten lords a leapingpar

nine ladies dancingpar

eight maids a milkingpar

seven swans a swimmingpar

six geese a layingpar

five gold ringspar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the twelfth day of Christmas my true love gave to mepar

twelve drummers drummingpar

eleven pipers pipingpar

ten lords a leapingpar

nine ladies dancingpar

eight maids a milkingpar

seven swans a swimmingpar

six geese a layingpar

five gold ringspar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.

bye

}

This is one of the more outrageous things I've ever seen here. (It is very impressive!) — May 3 '13 at 0:56
Hehe! Thanks @RyanReich. By the way, do you know of any algorithm in the vein of LZW that would be able to produce macros with varying arguments? — May 3 '13 at 1:10
@Bruno All I know about LZW is the idea, unfortunately. You mean, you want the "dictionary" to be macros that take an argument, or that take a variable number of arguments? Isn't this the same as having an array and then having the macro argument index into it? — May 3 '13 at 4:15
@BrunoLeFloch no wonder we get stalled on the middle layer of L3, but at least nobody can any longer deny that expl3 is up to arbitrary programming tasks :-) — May 4 '13 at 20:16

score 5 · Accepted Answer · 2013-05-02 20:21:14Z

What I present is a total kludge, stealing things from my titlecaps package. So the format is not optimal, but it might give a place from which to proceed. Nice thing is that punctuation is screened out of the search in the target-text.

When a search-string and target-text (not exceeding one paragraph) is passed to the seekphrase, it will output where each word of the search-phrase appears in the target-text. I output it as a pair "target-location:search-word-index". To try to make sense of it, I output a [ before a search-word-index=1 match and a ]after a search-word-index=n match. Thus to qualify as a matching "phrase" in, for example, a 3-word search, the output would have to contain an instance of something like [214:1 215:2, 216:3]

Further refinement is clearly possible, but it would take more time than I have at the moment.

documentclass{article}

usepackage{titlecaps}

usepackage{lipsum}

makeatletter

renewcommandseek@lcwords{%

kill@punct%

  setcounter{word@count}{0}%

  whiledo{value{word@count} < narg}{%

    addtocounter{word@count}{1}%

protected@edefcurrent@word{csname argroman{word@count}endcsname}%

    deffound@word{F}%

    setcounter{lcword@index}{0}%

    expandafterdefcsname%

            found@wordroman{word@count}endcsname{F}%

    whiledo{value{lcword@index} < value{lc@words}}{%

      addtocounter{lcword@index}{1}%

      protected@edefcurrent@lcword{%

            csname lcwordroman{lcword@index}endcsname}%

%% THE FOLLOWING THREE LINES ARE FROM DAVID CARLISLE

  protected@edeftmp{noexpandscantokens{defnoexpandtmp%

   {noexpandifthenelse{noexpandequal{current@word}{current@lcword}}}}}%

  tmpifhmodeunskipfitmp

%%

      {expandafterdefcsname%

            found@wordroman{word@count}endcsname{T}%

ifthenelse{equal{value{lcword@index}}{1}}{[}{}%

arabic{word@count}:arabic{lcword@index}%

ifthenelse{equal{value{lcword@index}}{value{lc@words}}}{]}{ }%

       setcounter{lcword@index}{value{lc@words}}}%

      {}%

    }%

  }%

restore@punct%

}

letgetargsCget@argsC



newcommandseekphrase[2]{%

  Seeking phrase ``#1'':\%

  Addlcwords{#1}%

  redefine@tertius%

  getargsC{#2}%

  seek@lcwords%

  Resetlcwords%

  par%

}

makeatother



begin{document}



lipsum[1]

lipsum[1]



defx{%

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut

purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis.

Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget,

consectetuer id, vulputate a, magna.  Donec vehicula augue eu neque.

Pellentesque habitant morbi tristique senectus et netus et malesuada

fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem.

Nulla et lectus vestibulum urna fringilla ultrices.  Phasellus eu tellus

sit amet tortor gravida placerat. Integer sapien est, iaculis in,

pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices

bibendum.  Aenean faucibus.  Morbi dolor nulla, malesuada eu, pulvinar

at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci

eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis,

diam. Duis eget orci sit amet orci dignissim rutrum.

%

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut

purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis.

Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget,

consectetuer id, vulputate a, magna.  Donec vehicula augue eu neque.

Pellentesque habitant morbi tristique senectus et netus et malesuada

fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem.

Nulla et lectus vestibulum urna fringilla ultrices.  Phasellus eu tellus

sit amet tortor gravida placerat. Integer sapien est, iaculis in,

pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices

bibendum.  Aenean faucibus.  Morbi dolor nulla, malesuada eu, pulvinar

at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci

eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis,

diam. Duis eget orci sit amet orci dignissim rutrum.

}



seekphrase{et}{x}



seekphrase{et netus}{x}



seekphrase{bibendum Aenean}{x}



seekphrase{eget sem vel}{x}



end{document}

enter image description here

score 22 · Accepted Answer · 2018-12-29 10:55:30Z

EDIT: I had implemented some version of the Lempel-Ziv-Welch algorithm, following a comment of Ryan Reich. I've now changed the code completely and I don't know if the algorithm has a name.

I fed the new code the 12 days of Christmas, as David Carlisle asked, and obtained the following result (for plain TeX)

longdefA#1#2{def#1{#2}A}AAG{sparACAAB}AH{AEWXVTU mVAD}AA

{CDEFG}AB{HIJAparbigskipB}AC{OPparMNKL nd}AD{RSparQ

A ring}AE{Y es dancV}AF{tZ a leapV}B{par On the }C{ day of C}D

{hristmas }E{my true l}F{ove gave }G{to mepar}H{a partrid}I{ge in a p}J

{ear tree.}K{two turtl}L{e dovespar a}M{three fre}N{nch henspar}O{four

call}P{ing birds}Q{five gold}R{six geese}S{ a laying}T{seven swa}U{ns a

swim}V{ingpar}W{eight mai}X{ds a milk}Y{nine ladi}Z{en lords } A{ }B

firstAAAB secondAAKL nd AB thirdAAMNKL nd AB fourthAAACAAB

fifthAAQA ringAG sixthAAADAG seventhAATU mVADAG eighthAAWXVT

U mVADAG ninthAAAHAG tenthAAAFAHAG eleventhAA eleven pipers pipV

AFAHAG twelfthAA twelve drummers drummV eleven pipers pipVAFAH spar

ACAHIJAbye

which is 878 bytes long. This should be compared to the 767 bytes of xii.tex, which produces the same lyrics.

RequirePackage{expl3}

ExplSyntaxOn

cs_generate_variant:Nn tl_replace_all:Nnn { Nxx, NVc, Nxn }

cs_generate_variant:Nn tl_replace_once:Nnn { Nxx }

cs_generate_variant:Nn tl_if_in:NnT { No }

cs_generate_variant:Nn int_min:nn { c, cc }

cs_generate_variant:Nn tl_if_eq:NNTF { cc }

cs_generate_variant:Nn tl_if_eq:nnTF { xx }

cs_generate_variant:Nn tl_if_eq:nnT { xx }

cs_generate_variant:Nn str_count:N { c }

cs_generate_variant:Nn regex_match:nnF { nV }

cs_generate_variant:Nn str_set:Nn { NV }

str_new:N l__find_prefix_str

int_new:N l__find_prefix_int

tl_new:N l__find_tl

tl_new:N l__find_chunks_tl

int_step_inline:nnnn { 1 } { 1 } { 9 }

  { seq_new:c { l__find_chunks_#1_seq } }

int_new:N l__find_common_int

int_new:N l__find_nesting_int

tl_new:N l__find_previous_tl

seq_new:N l__find_chunks_seq

int_new:N l__find_best_score_int

int_new:N l__find_macro_int

tl_new:N l__find_macros_tl

tl_new:N l__find_result_tl

int_new:N l__find_length_int

int_new:N l__find_previous_length_int

tl_new:N l__find_display_tl



cs_new_protected_nopar:Npn dot: { tex_message:D { . } }



cs_new_protected:Npn find_matches:nnN #1#2#3

  {

    % '#1' is the prefix, '#2' is the token list to study.

    %

    __find_set_prefix:n {#1}

    tl_set:Nn l__find_tl { ~! #2 }

    __find_escape_spaces:xN { l__find_prefix_str A } l__find_tl

    int_set_eq:NN l__find_macro_int c_one

    __find_get_length:V l__find_prefix_str

    iow_term:x { int_use:N l__find_length_int }

    int_set_eq:NN l__find_length_int c_max_int

    __find_matches_aux:V l__find_prefix_str

    tl_replace_once:Nnn l__find_tl { ! } { { ~ } }

    tl_set:Nx #3

      {

        __find_preamble:

        exp_not:V l__find_tl

      }

  }

cs_new_protected:Npn __find_escape_spaces:nN #1#2

  {

    regex_replace_all:nnN { cS  } { c{#1} } #2

    dot:

  }

cs_generate_variant:Nn __find_escape_spaces:nN { x }

cs_new_nopar:Npn __find_preamble:

  {

    exp_not:n { long def } exp_not:c { l__find_prefix_str A } ####1####2

      {

        exp_not:N def ####1{####2}

        exp_not:c { l__find_prefix_str A }

      }

    exp_not:c { l__find_prefix_str A }

  }

cs_new_protected:Npn __find_matches_aux:n #1

  {

    int_set_eq:NN l__find_previous_length_int l__find_length_int

    tl_set_eq:NN l__find_previous_tl l__find_tl

    __find_escape_tl:nN {#1} l__find_tl

    int_compare:nNnTF { tl_count:N l__find_tl } < c_nine

      { tl_set_eq:Nn l__find_tl l__find_previous_tl }

      {

        __find_set_chunks:

        __find_sort_chunks:

        __find_common:

        __find_best_macros:

        __find_undefine_tmp:

        tl_set_eq:NN l__find_tl l__find_result_tl

        __find_unescape_tl:nN {#1} l__find_tl

        __find_get_length:n {#1}

        iow_term:x { int_use:N l__find_length_int }

        int_compare:nNnTF l__find_length_int < l__find_previous_length_int

          { __find_matches_aux:n {#1} }

          {

            iow_term:n { }

            iow_term:x { l__find_display_tl }

            iow_term:n { }

            tl_set_eq:NN l__find_tl l__find_previous_tl

          }

      }

  }

cs_generate_variant:Nn __find_matches_aux:n { V }



cs_new_protected:Npn __find_get_length:n #1

  {

    tl_set_eq:NN l__find_display_tl l__find_tl

    tl_replace_once:Nxx l__find_display_tl

      { exp_not:c { #1 A } ! }

      { ~ exp_not:c { #1 A } {~} }

    str_set:NV l__find_display_tl l__find_display_tl

    tl_replace_all:Nxn l__find_display_tl

      { c_backslash_str #1 token_to_str:N A ~ } c_space_tl

    tl_replace_all:Nnn l__find_display_tl

      { ~ c_space_tl } { ~ exp_not:c { #1 A } }

    dot:

    str_set:Nx l__find_display_tl { __find_preamble: l__find_display_tl }

    tl_replace_all:Nxx l__find_display_tl

      { c_hash_str c_hash_str } { c_hash_str }

    dot:

    regex_replace_all:nnN

      { (\[A-Za-z]+)  ([A-Za-z]) }

      { 1   2 }

      l__find_display_tl

    dot:

    regex_replace_all:nnN

      { (\[A-Za-z]+)  }

      { 1 c{c__iow_wrap_zwbs_marker_tl} }

      l__find_display_tl

    dot:

    iow_wrap:nnnN { l__find_display_tl } { } { } __find_get_length_aux:n

  }

cs_generate_variant:Nn __find_get_length:n { V }

cs_new_protected:Npn __find_get_length_aux:n #1

  {

    int_set:Nn l__find_length_int { str_count:n {#1} }

    tl_set:Nn l__find_display_tl {#1}

    % iow_term:n { }

    % iow_term:n {#1}

    % iow_term:n { }

  }







cs_new_protected:Npn __find_set_prefix:n #1

  {

    % Check that the prefix |#1| is made only of alphabetic characters.

    %

    str_set:Nx l__find_prefix_str {#1}

    int_set:Nn l__find_prefix_int { str_count:N l__find_prefix_str }

    regex_match:nVF { Aw*Z } l__find_prefix_str

      {

        msg_error:nnx { find } { invalid-prefix }

          { l__find_prefix_str }

      }

    dot:

  }



cs_new_protected:Npn __find_escape_tl:nN #1#2

  {

    % During the 'study' step, we manipulate the token list |#2|

    % with all begin-group and end-group tokens replaced by a

    % control sequence built from the prefix.  We must change both

    % begin-group and end-group tokens in one pass, to avoid getting an

    % unbalanced result.  Also replace macro parameters because they

    % cannot be used as delimiters for macros.  Spaces have been

    % turned into a control sequence earlier.  At this stage, every

    % token in |#2| can be grabbed as an N-type argument.

    %

    regex_replace_all:nnN { cB. } { cB{ } #2

    dot:

    regex_replace_all:nnN { cE. } { cE} } #2

    dot:

    regex_replace_all:nnN { cP. } { c{ #1 # } } #2

    dot:

    regex_replace_all:nnN { c[BEP]. } { c{ #1  } } #2

    dot:

  }









cs_new_protected_nopar:Npn __find_set_chunks:

  {

    % Build a token list whose items are each nine consecutive tokens

    % of the original token list, in a running window.  So for instance

    % |ABCDEFGHIJKL| would lead to the following (12) items:

    % |ABCDEFGHI|, |BCDEFGHIJ|, |CDEFGHIJK|, |DEFGHIJKL|, |EFGHIJKL|,

    % |FGHIJKL|, |GHIJKL|, |HIJKL|, |IJKL|, |JKL|, |KL|, |L|.  The items

    % of this token list are built in an |x|-expanded loop.

    % A special case arises if the |find| token list is too short to

    % safely perform the loop: then our whole algorithm is not going to

    % do any good anyways, so we build an empty chunk list.

    %

    tl_set:Nx l__find_chunks_tl

      {

        exp_after:wN __find_set_chunks_loop:NNNNNNNNN l__find_tl

          q_recursion_tail q_recursion_stop

      }

  }

cs_new:Npn __find_set_chunks_loop:NNNNNNNNN #1#2#3#4#5#6#7#8#9

  {

    % As usual in a TeX loop, first check for the end-loop marker (here,

    % cs{q_recursion_tail}).  If it is reached, we fill in the last few

    % chunks (which become shorter and shorter as we go).  Otherwise,

    % add (to the token list we are building) an item containing (9)

    % tokens, and loop, dropping the first of the items.

    %

    quark_if_recursion_tail_stop_do:Nn #9

      { __find_set_chunks_end:NNNNNNNN #1 #2 #3 #4 #5 #6 #7 #8 }

    { exp_not:n { #1 #2 #3 #4 #5 #6 #7 #8 #9 } }

    __find_set_chunks_loop:NNNNNNNNN #2#3#4#5#6#7#8#9

  }

cs_new:Npn __find_set_chunks_end:NNNNNNNN #1#2#3#4#5#6#7#8

  {

    exp_not:n

      {

        { #1 #2 #3 #4 #5 #6 #7 #8 }

        { #2 #3 #4 #5 #6 #7 #8 }

        { #3 #4 #5 #6 #7 #8 }

        { #4 #5 #6 #7 #8 }

        { #5 #6 #7 #8 }

        { #6 #7 #8 }

        { #7 #8 }

        { #8 }

      }

  }

cs_new_protected:Npn __find_sort_chunks:

  {

    tl_sort:Nn l__find_chunks_tl

      {

        int_compare:nNnTF

          {

            pdftex_strcmp:D

              { exp_not:n {##1} }

              { exp_not:n {##2} }

          }

          > c_zero

          { sort_return_swapped: }

          { sort_return_same: }

      }

  }



cs_new_protected:Npn __find_common:

  {

    int_step_inline:nnnn { 1 } { 1 } { 9 }

      { seq_clear:c { l__find_chunks_##1_seq } }

    exp_after:wN __find_common_loop:nn l__find_chunks_tl

      q_recursion_tail q_recursion_tail q_recursion_stop

    int_step_inline:nnnn { 4 } { 1 } { 9 }

      {

        seq_map_inline:cn { l__find_chunks_##1_seq }

          {

            tl_if_exist:cTF { l__find_chunk_ ' tl_to_str:n {####1} ' _tl }

              {

                tl_put_right:cn

                  { l__find_chunk_ ' tl_to_str:n {####1} ' _tl } { i }

              }

              {

                cs_set_eq:cN

                  { l__find_chunk_ ' tl_to_str:n {####1} ' _tl } c_empty_tl

              }

          }

      }

  }



cs_new_protected:Npn __find_common_loop:nn #1#2

  {

    quark_if_recursion_tail_stop:n {#2}

    int_zero:N l__find_common_int

    __find_count_common_aux:nn {#1} {#2}

    use:c { __find_common_ int_use:N l__find_common_int :w }

      #1 X X X X X X X X X q_stop

    __find_common_loop:nn {#2}

  }

cs_new_protected:Npn __find_count_common_aux:nn #1#2

  {

    tl_if_empty:nF {#1}

      {

        tl_if_empty:nF {#2}

          {

            tl_if_eq:xxT { tl_head:n {#1} } { tl_head:n {#2} }

              {

                int_incr:N l__find_common_int

                __find_count_common_aux:xx

                  { tl_tail:n {#1} } { tl_tail:n {#2} }

              }

          }

      }

  }

cs_generate_variant:Nn __find_count_common_aux:nn { xx }



cs_new_eq:cN { __find_common_0:w } use_none_delimit_by_q_stop:w

cs_new_protected:cpn { __find_common_1:w } #1

  { __find_common_auxii:nnw { 1 } {#1} }

cs_new_protected:cpn { __find_common_2:w } #1#2

  { __find_common_auxii:nnw { 2 } { #1 #2 } }

cs_new_protected:cpn { __find_common_3:w } #1#2#3

  { __find_common_auxii:nnw { 3 } { #1 #2 #3 } }

cs_new_protected:cpn { __find_common_4:w } #1#2#3#4

  { __find_common_auxii:nnw { 4 } { #1 #2 #3 #4 } }

cs_new_protected:cpn { __find_common_5:w } #1#2#3#4#5

  {

    dot:

    __find_common_auxii:nnw { 5 } { #1 #2 #3 #4 #5 }

  }

cs_new_protected:cpn { __find_common_6:w } #1#2#3#4#5#6

  { __find_common_auxii:nnw { 6 } { #1 #2 #3 #4 #5 #6 } }

cs_new_protected:cpn { __find_common_7:w } #1#2#3#4#5#6#7

  { __find_common_auxii:nnw { 7 } { #1 #2 #3 #4 #5 #6 #7 } }

cs_new_protected:cpn { __find_common_8:w } #1#2#3#4#5#6#7#8

  { __find_common_auxii:nnw { 8 } { #1 #2 #3 #4 #5 #6 #7 #8 } }

cs_new_protected:cpn { __find_common_9:w } #1#2#3#4#5#6#7#8#9

  { __find_common_auxii:nnw { 9 } { #1 #2 #3 #4 #5 #6 #7 #8 #9 } }

cs_new_protected:Npn __find_common_auxii:nnw #1#2#3 q_stop

  {

    int_zero:N l__find_nesting_int

    tl_map_inline:nn {#2}

      {

        str_case_x:nnn { exp_not:n {##1} }

          {

            { exp_not:c { l__find_prefix_str c_lbrace_str } }

              { int_incr:N l__find_nesting_int }

            { exp_not:c { l__find_prefix_str c_rbrace_str } }

              {

                int_compare:nNnF l__find_nesting_int > c_zero

                  { use_none_delimit_by_q_stop:w }

                int_decr:N l__find_nesting_int

              }

          }

          { }

      }

    int_compare:nNnF l__find_nesting_int = c_zero

      { use_none_delimit_by_q_stop:w }

    seq_put_right:cn { l__find_chunks_#1_seq } {#2}

    use_none_delimit_by_q_stop:w

    q_stop

  }



cs_new_protected_nopar:Npn __find_best_macros:

  {

    tl_clear:N l__find_macros_tl

    tl_clear:N l__find_result_tl

    __find_best_macros_aux:

    tl_put_left:NV l__find_result_tl l__find_macros_tl

  }

cs_new_protected:Npn __find_best_macros_aux:

  {

    exp_after:wN __find_best_macros_auxii:NNNNNNNNN l__find_tl

      q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_stop

    tl_if_empty:NF l__find_tl { __find_best_macros_aux: }

  }

cs_new_protected:Npn __find_best_macros_auxii:NNNNNNNNN #1#2#3#4#5#6#7#8#9

  {

    int_zero:N l__find_best_score_int

    tl_clear:N l__find_best_chunk_tl

    tl_map_inline:nn

      {

        {#1} {#1#2} {#1#2#3} {#1#2#3#4} {#1#2#3#4#5} {#1#2#3#4#5#6}

        {#1#2#3#4#5#6#7} {#1#2#3#4#5#6#7#8} {#1#2#3#4#5#6#7#8#9}

      }

      {

        int_compare:nNnT

          { __find_score:n {##1} } > l__find_best_score_int

          {

            tl_set:Nn l__find_best_chunk_tl {##1}

            int_set:Nn l__find_best_score_int

              { __find_score:n {##1} }

          }

      }

    tl_if_empty:NF l__find_best_chunk_tl

      {

        int_incr:N l__find_macro_int

        tl_put_right:Nx l__find_macros_tl

          {

            exp_not:c

              { l__find_prefix_str int_to_Alph:n { l__find_macro_int } }

            { exp_not:V l__find_best_chunk_tl }

          }

        tl_replace_all:NVc l__find_tl l__find_best_chunk_tl

          { l__find_prefix_str int_to_Alph:n { l__find_macro_int } }

        dot:

      }

    tl_put_right:Nx l__find_result_tl { tl_head:N l__find_tl }

    tl_set:Nx l__find_tl { tl_tail:N l__find_tl }

    use_none_delimit_by_q_stop:w

  }

cs_new:Npn __find_score:n #1

  {

    % Turning ####1 into a length (p+2) macro

    % (e.g. |<prefix>A|) saves this number of chars.

    % Good if non-negative.

    cs_if_exist:cTF { l__find_chunk_ ' tl_to_str:n {#1} ' _tl }

      {

        int_eval:n

          {

            tl_count:c { l__find_chunk_ ' tl_to_str:n {#1} ' _tl }

            * ( str_count:n {#1} - 3 - l__find_prefix_int )

            - 2 * l__find_prefix_int - 9

          }

      }

      { -1 }

  }



cs_new_protected:Npn __find_undefine_tmp:

  {

    int_step_inline:nnnn { 4 } { 1 } { 9 }

      {

        seq_map_inline:cn { l__find_chunks_##1_seq }

          { cs_undefine:c { l__find_chunk_ ' tl_to_str:n {####1} ' _tl } }

      }

  }



cs_new_protected:Npn __find_unescape_tl:nN #1#2

  {

    regex_replace_all:nnN { c{#1{} } { cB{ cE? } #2

    dot:

    regex_replace_all:nnN { c{#1}} } { cB? cE} } #2

    dot:

    regex_replace_all:nnN { c[BE]? } { } #2

    dot:

    regex_replace_all:nnN { c{#1#} } { cP# } #2

    dot:

  }



%%%% Hack LaTeX3's iow_wrap to allow zero-width breaking spaces

tl_const:Nx c__iow_wrap_zwbs_marker_tl

  {

    c_catcode_other_space_tl

    c__iow_wrap_marker_tl

    c_catcode_other_space_tl

    zwbs

    c_catcode_other_space_tl

  }

cs_new_protected_nopar:Npn __iow_wrap_zwbs:

  {

    tl_put_right:Nn l__iow_current_line_tl

      { tex_romannumeral:D -`0 }

    int_decr:N l__iow_current_line_int

  }

%%% End of hack.



msg_new:nnn { find } { invalid-prefix }

  { The~prefix~used,~'#1',~must~be~made~of~letters. }



cs_set_protected:Npn FindMatches #1#2

  {

    find_matches:nnN {#1} {#2} l_tmpa_tl

    tl_use:N l_tmpa_tl

  }

ExplSyntaxOff



% Set up some basic things to behave a bit like plain TeX.

%

documentclass{article}

begin{document}

setlength{parindent}{0pt}

defbye{end{document}}

%

% End of setup.



FindMatches{}{% Let's go!



On the first day of Christmas my true love gave to mepar

a partridge in a pear tree.



bigskip



On the second day of Christmas my true love gave to mepar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the third day of Christmas my true love gave to mepar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the fourth day of Christmas my true love gave to mepar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the fifth day of Christmas my true love gave to mepar

five gold ringspar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the sixth day of Christmas my true love gave to mepar

six geese a layingpar

five gold ringspar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the seventh day of Christmas my true love gave to mepar

seven swans a swimmingpar

six geese a layingpar

five gold ringspar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the eighth day of Christmas my true love gave to mepar

eight maids a milkingpar

seven swans a swimmingpar

six geese a layingpar

five gold ringspar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the ninth day of Christmas my true love gave to mepar

nine ladies dancingpar

eight maids a milkingpar

seven swans a swimmingpar

six geese a layingpar

five gold ringspar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the tenth day of Christmas my true love gave to mepar

ten lords a leapingpar

nine ladies dancingpar

eight maids a milkingpar

seven swans a swimmingpar

six geese a layingpar

five gold ringspar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the eleventh day of Christmas my true love gave to mepar

eleven pipers pipingpar

ten lords a leapingpar

nine ladies dancingpar

eight maids a milkingpar

seven swans a swimmingpar

six geese a layingpar

five gold ringspar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.



bigskip



On the twelfth day of Christmas my true love gave to mepar

twelve drummers drummingpar

eleven pipers pipingpar

ten lords a leapingpar

nine ladies dancingpar

eight maids a milkingpar

seven swans a swimmingpar

six geese a layingpar

five gold ringspar

four calling birdspar

three french henspar

two turtle dovespar

and a partridge in a pear tree.

bye

}

This is one of the more outrageous things I've ever seen here. (It is very impressive!) — May 3 '13 at 0:56
Hehe! Thanks @RyanReich. By the way, do you know of any algorithm in the vein of LZW that would be able to produce macros with varying arguments? — May 3 '13 at 1:10
@Bruno All I know about LZW is the idea, unfortunately. You mean, you want the "dictionary" to be macros that take an argument, or that take a variable number of arguments? Isn't this the same as having an array and then having the macro argument index into it? — May 3 '13 at 4:15
@BrunoLeFloch no wonder we get stalled on the middle layer of L3, but at least nobody can any longer deny that expl3 is up to arbitrary programming tasks :-) — May 4 '13 at 20:16

score 5 · Accepted Answer · 2013-05-02 20:21:14Z

What I present is a total kludge, stealing things from my titlecaps package. So the format is not optimal, but it might give a place from which to proceed. Nice thing is that punctuation is screened out of the search in the target-text.

When a search-string and target-text (not exceeding one paragraph) is passed to the seekphrase, it will output where each word of the search-phrase appears in the target-text. I output it as a pair "target-location:search-word-index". To try to make sense of it, I output a [ before a search-word-index=1 match and a ]after a search-word-index=n match. Thus to qualify as a matching "phrase" in, for example, a 3-word search, the output would have to contain an instance of something like [214:1 215:2, 216:3]

Further refinement is clearly possible, but it would take more time than I have at the moment.

documentclass{article}

usepackage{titlecaps}

usepackage{lipsum}

makeatletter

renewcommandseek@lcwords{%

kill@punct%

  setcounter{word@count}{0}%

  whiledo{value{word@count} < narg}{%

    addtocounter{word@count}{1}%

protected@edefcurrent@word{csname argroman{word@count}endcsname}%

    deffound@word{F}%

    setcounter{lcword@index}{0}%

    expandafterdefcsname%

            found@wordroman{word@count}endcsname{F}%

    whiledo{value{lcword@index} < value{lc@words}}{%

      addtocounter{lcword@index}{1}%

      protected@edefcurrent@lcword{%

            csname lcwordroman{lcword@index}endcsname}%

%% THE FOLLOWING THREE LINES ARE FROM DAVID CARLISLE

  protected@edeftmp{noexpandscantokens{defnoexpandtmp%

   {noexpandifthenelse{noexpandequal{current@word}{current@lcword}}}}}%

  tmpifhmodeunskipfitmp

%%

      {expandafterdefcsname%

            found@wordroman{word@count}endcsname{T}%

ifthenelse{equal{value{lcword@index}}{1}}{[}{}%

arabic{word@count}:arabic{lcword@index}%

ifthenelse{equal{value{lcword@index}}{value{lc@words}}}{]}{ }%

       setcounter{lcword@index}{value{lc@words}}}%

      {}%

    }%

  }%

restore@punct%

}

letgetargsCget@argsC



newcommandseekphrase[2]{%

  Seeking phrase ``#1'':\%

  Addlcwords{#1}%

  redefine@tertius%

  getargsC{#2}%

  seek@lcwords%

  Resetlcwords%

  par%

}

makeatother



begin{document}



lipsum[1]

lipsum[1]



defx{%

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut

purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis.

Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget,

consectetuer id, vulputate a, magna.  Donec vehicula augue eu neque.

Pellentesque habitant morbi tristique senectus et netus et malesuada

fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem.

Nulla et lectus vestibulum urna fringilla ultrices.  Phasellus eu tellus

sit amet tortor gravida placerat. Integer sapien est, iaculis in,

pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices

bibendum.  Aenean faucibus.  Morbi dolor nulla, malesuada eu, pulvinar

at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci

eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis,

diam. Duis eget orci sit amet orci dignissim rutrum.

%

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut

purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis.

Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget,

consectetuer id, vulputate a, magna.  Donec vehicula augue eu neque.

Pellentesque habitant morbi tristique senectus et netus et malesuada

fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem.

Nulla et lectus vestibulum urna fringilla ultrices.  Phasellus eu tellus

sit amet tortor gravida placerat. Integer sapien est, iaculis in,

pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices

bibendum.  Aenean faucibus.  Morbi dolor nulla, malesuada eu, pulvinar

at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci

eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis,

diam. Duis eget orci sit amet orci dignissim rutrum.

}



seekphrase{et}{x}



seekphrase{et netus}{x}



seekphrase{bibendum Aenean}{x}



seekphrase{eget sem vel}{x}



end{document}

enter image description here

搜尋此網誌

Sstrhsrtj

Search for most frequently occurring patterns in text (to replace them by macros)

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

サソリ

広島県道265号伴広島線

Setup Asymptote in Texstudio

Search for most frequently occurring patterns in text (to replace them by macros)

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

サソリ

広島県道265号伴広島線

Setup Asymptote in Texstudio

2 Answers
2

2 Answers
2

2 Answers
2