Search for most frequently occurring patterns in text (to replace them by macros)
Are there any ready available tools for searching the most frequently occurring text patterns (word combinations, phrases) in the text you work on, so to give ideas, which ones to replace by short macros?
macros
|
show 1 more comment
Are there any ready available tools for searching the most frequently occurring text patterns (word combinations, phrases) in the text you work on, so to give ideas, which ones to replace by short macros?
macros
Interesting question, though in case of "patterns" longer than, say, one word, things might get a bit difficult due to space/newline treatment, various ways to get space after a macro etc.
– mbork
Mar 8 '13 at 10:42
Often used word combinations, phrases...
– Dennis Yurichev
Mar 8 '13 at 10:54
Welcome to TeX.SX.
– Claudio Fiandrino
Mar 8 '13 at 10:55
An interesting question, but sounds like a job for an external tool...
– cmhughes
Mar 8 '13 at 16:19
1
I wonder what your goal is with this. For writing a document, you're better off analyzing which combinations you find significant and worth automating. You seem to be asking for a sort of "literate LZW" algorithm.
– Ryan Reich
May 2 '13 at 21:35
|
show 1 more comment
Are there any ready available tools for searching the most frequently occurring text patterns (word combinations, phrases) in the text you work on, so to give ideas, which ones to replace by short macros?
macros
Are there any ready available tools for searching the most frequently occurring text patterns (word combinations, phrases) in the text you work on, so to give ideas, which ones to replace by short macros?
macros
macros
edited Mar 8 '13 at 11:11
Mico
273k30369756
273k30369756
asked Mar 8 '13 at 10:37
Dennis Yurichev
21017
21017
Interesting question, though in case of "patterns" longer than, say, one word, things might get a bit difficult due to space/newline treatment, various ways to get space after a macro etc.
– mbork
Mar 8 '13 at 10:42
Often used word combinations, phrases...
– Dennis Yurichev
Mar 8 '13 at 10:54
Welcome to TeX.SX.
– Claudio Fiandrino
Mar 8 '13 at 10:55
An interesting question, but sounds like a job for an external tool...
– cmhughes
Mar 8 '13 at 16:19
1
I wonder what your goal is with this. For writing a document, you're better off analyzing which combinations you find significant and worth automating. You seem to be asking for a sort of "literate LZW" algorithm.
– Ryan Reich
May 2 '13 at 21:35
|
show 1 more comment
Interesting question, though in case of "patterns" longer than, say, one word, things might get a bit difficult due to space/newline treatment, various ways to get space after a macro etc.
– mbork
Mar 8 '13 at 10:42
Often used word combinations, phrases...
– Dennis Yurichev
Mar 8 '13 at 10:54
Welcome to TeX.SX.
– Claudio Fiandrino
Mar 8 '13 at 10:55
An interesting question, but sounds like a job for an external tool...
– cmhughes
Mar 8 '13 at 16:19
1
I wonder what your goal is with this. For writing a document, you're better off analyzing which combinations you find significant and worth automating. You seem to be asking for a sort of "literate LZW" algorithm.
– Ryan Reich
May 2 '13 at 21:35
Interesting question, though in case of "patterns" longer than, say, one word, things might get a bit difficult due to space/newline treatment, various ways to get space after a macro etc.
– mbork
Mar 8 '13 at 10:42
Interesting question, though in case of "patterns" longer than, say, one word, things might get a bit difficult due to space/newline treatment, various ways to get space after a macro etc.
– mbork
Mar 8 '13 at 10:42
Often used word combinations, phrases...
– Dennis Yurichev
Mar 8 '13 at 10:54
Often used word combinations, phrases...
– Dennis Yurichev
Mar 8 '13 at 10:54
Welcome to TeX.SX.
– Claudio Fiandrino
Mar 8 '13 at 10:55
Welcome to TeX.SX.
– Claudio Fiandrino
Mar 8 '13 at 10:55
An interesting question, but sounds like a job for an external tool...
– cmhughes
Mar 8 '13 at 16:19
An interesting question, but sounds like a job for an external tool...
– cmhughes
Mar 8 '13 at 16:19
1
1
I wonder what your goal is with this. For writing a document, you're better off analyzing which combinations you find significant and worth automating. You seem to be asking for a sort of "literate LZW" algorithm.
– Ryan Reich
May 2 '13 at 21:35
I wonder what your goal is with this. For writing a document, you're better off analyzing which combinations you find significant and worth automating. You seem to be asking for a sort of "literate LZW" algorithm.
– Ryan Reich
May 2 '13 at 21:35
|
show 1 more comment
2 Answers
2
active
oldest
votes
EDIT: I had implemented some version of the Lempel-Ziv-Welch algorithm, following a comment of Ryan Reich. I've now changed the code completely and I don't know if the algorithm has a name.
I fed the new code the 12 days of Christmas, as David Carlisle asked, and obtained the following result (for plain TeX)
longdefA#1#2{def#1{#2}A}AAG{sparACAAB}AH{AEWXVTU mVAD}AA
{CDEFG}AB{HIJAparbigskipB}AC{OPparMNKL nd}AD{RSparQ
A ring}AE{Y es dancV}AF{tZ a leapV}B{par On the }C{ day of C}D
{hristmas }E{my true l}F{ove gave }G{to mepar}H{a partrid}I{ge in a p}J
{ear tree.}K{two turtl}L{e dovespar a}M{three fre}N{nch henspar}O{four
call}P{ing birds}Q{five gold}R{six geese}S{ a laying}T{seven swa}U{ns a
swim}V{ingpar}W{eight mai}X{ds a milk}Y{nine ladi}Z{en lords } A{ }B
firstAAAB secondAAKL nd AB thirdAAMNKL nd AB fourthAAACAAB
fifthAAQA ringAG sixthAAADAG seventhAATU mVADAG eighthAAWXVT
U mVADAG ninthAAAHAG tenthAAAFAHAG eleventhAA eleven pipers pipV
AFAHAG twelfthAA twelve drummers drummV eleven pipers pipVAFAH spar
ACAHIJAbye
which is 878 bytes long. This should be compared to the 767 bytes of xii.tex
, which produces the same lyrics.
RequirePackage{expl3}
ExplSyntaxOn
cs_generate_variant:Nn tl_replace_all:Nnn { Nxx, NVc, Nxn }
cs_generate_variant:Nn tl_replace_once:Nnn { Nxx }
cs_generate_variant:Nn tl_if_in:NnT { No }
cs_generate_variant:Nn int_min:nn { c, cc }
cs_generate_variant:Nn tl_if_eq:NNTF { cc }
cs_generate_variant:Nn tl_if_eq:nnTF { xx }
cs_generate_variant:Nn tl_if_eq:nnT { xx }
cs_generate_variant:Nn str_count:N { c }
cs_generate_variant:Nn regex_match:nnF { nV }
cs_generate_variant:Nn str_set:Nn { NV }
str_new:N l__find_prefix_str
int_new:N l__find_prefix_int
tl_new:N l__find_tl
tl_new:N l__find_chunks_tl
int_step_inline:nnnn { 1 } { 1 } { 9 }
{ seq_new:c { l__find_chunks_#1_seq } }
int_new:N l__find_common_int
int_new:N l__find_nesting_int
tl_new:N l__find_previous_tl
seq_new:N l__find_chunks_seq
int_new:N l__find_best_score_int
int_new:N l__find_macro_int
tl_new:N l__find_macros_tl
tl_new:N l__find_result_tl
int_new:N l__find_length_int
int_new:N l__find_previous_length_int
tl_new:N l__find_display_tl
cs_new_protected_nopar:Npn dot: { tex_message:D { . } }
cs_new_protected:Npn find_matches:nnN #1#2#3
{
% '#1' is the prefix, '#2' is the token list to study.
%
__find_set_prefix:n {#1}
tl_set:Nn l__find_tl { ~! #2 }
__find_escape_spaces:xN { l__find_prefix_str A } l__find_tl
int_set_eq:NN l__find_macro_int c_one
__find_get_length:V l__find_prefix_str
iow_term:x { int_use:N l__find_length_int }
int_set_eq:NN l__find_length_int c_max_int
__find_matches_aux:V l__find_prefix_str
tl_replace_once:Nnn l__find_tl { ! } { { ~ } }
tl_set:Nx #3
{
__find_preamble:
exp_not:V l__find_tl
}
}
cs_new_protected:Npn __find_escape_spaces:nN #1#2
{
regex_replace_all:nnN { cS } { c{#1} } #2
dot:
}
cs_generate_variant:Nn __find_escape_spaces:nN { x }
cs_new_nopar:Npn __find_preamble:
{
exp_not:n { long def } exp_not:c { l__find_prefix_str A } ####1####2
{
exp_not:N def ####1{####2}
exp_not:c { l__find_prefix_str A }
}
exp_not:c { l__find_prefix_str A }
}
cs_new_protected:Npn __find_matches_aux:n #1
{
int_set_eq:NN l__find_previous_length_int l__find_length_int
tl_set_eq:NN l__find_previous_tl l__find_tl
__find_escape_tl:nN {#1} l__find_tl
int_compare:nNnTF { tl_count:N l__find_tl } < c_nine
{ tl_set_eq:Nn l__find_tl l__find_previous_tl }
{
__find_set_chunks:
__find_sort_chunks:
__find_common:
__find_best_macros:
__find_undefine_tmp:
tl_set_eq:NN l__find_tl l__find_result_tl
__find_unescape_tl:nN {#1} l__find_tl
__find_get_length:n {#1}
iow_term:x { int_use:N l__find_length_int }
int_compare:nNnTF l__find_length_int < l__find_previous_length_int
{ __find_matches_aux:n {#1} }
{
iow_term:n { }
iow_term:x { l__find_display_tl }
iow_term:n { }
tl_set_eq:NN l__find_tl l__find_previous_tl
}
}
}
cs_generate_variant:Nn __find_matches_aux:n { V }
cs_new_protected:Npn __find_get_length:n #1
{
tl_set_eq:NN l__find_display_tl l__find_tl
tl_replace_once:Nxx l__find_display_tl
{ exp_not:c { #1 A } ! }
{ ~ exp_not:c { #1 A } {~} }
str_set:NV l__find_display_tl l__find_display_tl
tl_replace_all:Nxn l__find_display_tl
{ c_backslash_str #1 token_to_str:N A ~ } c_space_tl
tl_replace_all:Nnn l__find_display_tl
{ ~ c_space_tl } { ~ exp_not:c { #1 A } }
dot:
str_set:Nx l__find_display_tl { __find_preamble: l__find_display_tl }
tl_replace_all:Nxx l__find_display_tl
{ c_hash_str c_hash_str } { c_hash_str }
dot:
regex_replace_all:nnN
{ (\[A-Za-z]+) ([A-Za-z]) }
{ 1 2 }
l__find_display_tl
dot:
regex_replace_all:nnN
{ (\[A-Za-z]+) }
{ 1 c{c__iow_wrap_zwbs_marker_tl} }
l__find_display_tl
dot:
iow_wrap:nnnN { l__find_display_tl } { } { } __find_get_length_aux:n
}
cs_generate_variant:Nn __find_get_length:n { V }
cs_new_protected:Npn __find_get_length_aux:n #1
{
int_set:Nn l__find_length_int { str_count:n {#1} }
tl_set:Nn l__find_display_tl {#1}
% iow_term:n { }
% iow_term:n {#1}
% iow_term:n { }
}
cs_new_protected:Npn __find_set_prefix:n #1
{
% Check that the prefix |#1| is made only of alphabetic characters.
%
str_set:Nx l__find_prefix_str {#1}
int_set:Nn l__find_prefix_int { str_count:N l__find_prefix_str }
regex_match:nVF { Aw*Z } l__find_prefix_str
{
msg_error:nnx { find } { invalid-prefix }
{ l__find_prefix_str }
}
dot:
}
cs_new_protected:Npn __find_escape_tl:nN #1#2
{
% During the 'study' step, we manipulate the token list |#2|
% with all begin-group and end-group tokens replaced by a
% control sequence built from the prefix. We must change both
% begin-group and end-group tokens in one pass, to avoid getting an
% unbalanced result. Also replace macro parameters because they
% cannot be used as delimiters for macros. Spaces have been
% turned into a control sequence earlier. At this stage, every
% token in |#2| can be grabbed as an N-type argument.
%
regex_replace_all:nnN { cB. } { cB{ } #2
dot:
regex_replace_all:nnN { cE. } { cE} } #2
dot:
regex_replace_all:nnN { cP. } { c{ #1 # } } #2
dot:
regex_replace_all:nnN { c[BEP]. } { c{ #1 } } #2
dot:
}
cs_new_protected_nopar:Npn __find_set_chunks:
{
% Build a token list whose items are each nine consecutive tokens
% of the original token list, in a running window. So for instance
% |ABCDEFGHIJKL| would lead to the following (12) items:
% |ABCDEFGHI|, |BCDEFGHIJ|, |CDEFGHIJK|, |DEFGHIJKL|, |EFGHIJKL|,
% |FGHIJKL|, |GHIJKL|, |HIJKL|, |IJKL|, |JKL|, |KL|, |L|. The items
% of this token list are built in an |x|-expanded loop.
% A special case arises if the |find| token list is too short to
% safely perform the loop: then our whole algorithm is not going to
% do any good anyways, so we build an empty chunk list.
%
tl_set:Nx l__find_chunks_tl
{
exp_after:wN __find_set_chunks_loop:NNNNNNNNN l__find_tl
q_recursion_tail q_recursion_stop
}
}
cs_new:Npn __find_set_chunks_loop:NNNNNNNNN #1#2#3#4#5#6#7#8#9
{
% As usual in a TeX loop, first check for the end-loop marker (here,
% cs{q_recursion_tail}). If it is reached, we fill in the last few
% chunks (which become shorter and shorter as we go). Otherwise,
% add (to the token list we are building) an item containing (9)
% tokens, and loop, dropping the first of the items.
%
quark_if_recursion_tail_stop_do:Nn #9
{ __find_set_chunks_end:NNNNNNNN #1 #2 #3 #4 #5 #6 #7 #8 }
{ exp_not:n { #1 #2 #3 #4 #5 #6 #7 #8 #9 } }
__find_set_chunks_loop:NNNNNNNNN #2#3#4#5#6#7#8#9
}
cs_new:Npn __find_set_chunks_end:NNNNNNNN #1#2#3#4#5#6#7#8
{
exp_not:n
{
{ #1 #2 #3 #4 #5 #6 #7 #8 }
{ #2 #3 #4 #5 #6 #7 #8 }
{ #3 #4 #5 #6 #7 #8 }
{ #4 #5 #6 #7 #8 }
{ #5 #6 #7 #8 }
{ #6 #7 #8 }
{ #7 #8 }
{ #8 }
}
}
cs_new_protected:Npn __find_sort_chunks:
{
tl_sort:Nn l__find_chunks_tl
{
int_compare:nNnTF
{
pdftex_strcmp:D
{ exp_not:n {##1} }
{ exp_not:n {##2} }
}
> c_zero
{ sort_return_swapped: }
{ sort_return_same: }
}
}
cs_new_protected:Npn __find_common:
{
int_step_inline:nnnn { 1 } { 1 } { 9 }
{ seq_clear:c { l__find_chunks_##1_seq } }
exp_after:wN __find_common_loop:nn l__find_chunks_tl
q_recursion_tail q_recursion_tail q_recursion_stop
int_step_inline:nnnn { 4 } { 1 } { 9 }
{
seq_map_inline:cn { l__find_chunks_##1_seq }
{
tl_if_exist:cTF { l__find_chunk_ ' tl_to_str:n {####1} ' _tl }
{
tl_put_right:cn
{ l__find_chunk_ ' tl_to_str:n {####1} ' _tl } { i }
}
{
cs_set_eq:cN
{ l__find_chunk_ ' tl_to_str:n {####1} ' _tl } c_empty_tl
}
}
}
}
cs_new_protected:Npn __find_common_loop:nn #1#2
{
quark_if_recursion_tail_stop:n {#2}
int_zero:N l__find_common_int
__find_count_common_aux:nn {#1} {#2}
use:c { __find_common_ int_use:N l__find_common_int :w }
#1 X X X X X X X X X q_stop
__find_common_loop:nn {#2}
}
cs_new_protected:Npn __find_count_common_aux:nn #1#2
{
tl_if_empty:nF {#1}
{
tl_if_empty:nF {#2}
{
tl_if_eq:xxT { tl_head:n {#1} } { tl_head:n {#2} }
{
int_incr:N l__find_common_int
__find_count_common_aux:xx
{ tl_tail:n {#1} } { tl_tail:n {#2} }
}
}
}
}
cs_generate_variant:Nn __find_count_common_aux:nn { xx }
cs_new_eq:cN { __find_common_0:w } use_none_delimit_by_q_stop:w
cs_new_protected:cpn { __find_common_1:w } #1
{ __find_common_auxii:nnw { 1 } {#1} }
cs_new_protected:cpn { __find_common_2:w } #1#2
{ __find_common_auxii:nnw { 2 } { #1 #2 } }
cs_new_protected:cpn { __find_common_3:w } #1#2#3
{ __find_common_auxii:nnw { 3 } { #1 #2 #3 } }
cs_new_protected:cpn { __find_common_4:w } #1#2#3#4
{ __find_common_auxii:nnw { 4 } { #1 #2 #3 #4 } }
cs_new_protected:cpn { __find_common_5:w } #1#2#3#4#5
{
dot:
__find_common_auxii:nnw { 5 } { #1 #2 #3 #4 #5 }
}
cs_new_protected:cpn { __find_common_6:w } #1#2#3#4#5#6
{ __find_common_auxii:nnw { 6 } { #1 #2 #3 #4 #5 #6 } }
cs_new_protected:cpn { __find_common_7:w } #1#2#3#4#5#6#7
{ __find_common_auxii:nnw { 7 } { #1 #2 #3 #4 #5 #6 #7 } }
cs_new_protected:cpn { __find_common_8:w } #1#2#3#4#5#6#7#8
{ __find_common_auxii:nnw { 8 } { #1 #2 #3 #4 #5 #6 #7 #8 } }
cs_new_protected:cpn { __find_common_9:w } #1#2#3#4#5#6#7#8#9
{ __find_common_auxii:nnw { 9 } { #1 #2 #3 #4 #5 #6 #7 #8 #9 } }
cs_new_protected:Npn __find_common_auxii:nnw #1#2#3 q_stop
{
int_zero:N l__find_nesting_int
tl_map_inline:nn {#2}
{
str_case_x:nnn { exp_not:n {##1} }
{
{ exp_not:c { l__find_prefix_str c_lbrace_str } }
{ int_incr:N l__find_nesting_int }
{ exp_not:c { l__find_prefix_str c_rbrace_str } }
{
int_compare:nNnF l__find_nesting_int > c_zero
{ use_none_delimit_by_q_stop:w }
int_decr:N l__find_nesting_int
}
}
{ }
}
int_compare:nNnF l__find_nesting_int = c_zero
{ use_none_delimit_by_q_stop:w }
seq_put_right:cn { l__find_chunks_#1_seq } {#2}
use_none_delimit_by_q_stop:w
q_stop
}
cs_new_protected_nopar:Npn __find_best_macros:
{
tl_clear:N l__find_macros_tl
tl_clear:N l__find_result_tl
__find_best_macros_aux:
tl_put_left:NV l__find_result_tl l__find_macros_tl
}
cs_new_protected:Npn __find_best_macros_aux:
{
exp_after:wN __find_best_macros_auxii:NNNNNNNNN l__find_tl
q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_stop
tl_if_empty:NF l__find_tl { __find_best_macros_aux: }
}
cs_new_protected:Npn __find_best_macros_auxii:NNNNNNNNN #1#2#3#4#5#6#7#8#9
{
int_zero:N l__find_best_score_int
tl_clear:N l__find_best_chunk_tl
tl_map_inline:nn
{
{#1} {#1#2} {#1#2#3} {#1#2#3#4} {#1#2#3#4#5} {#1#2#3#4#5#6}
{#1#2#3#4#5#6#7} {#1#2#3#4#5#6#7#8} {#1#2#3#4#5#6#7#8#9}
}
{
int_compare:nNnT
{ __find_score:n {##1} } > l__find_best_score_int
{
tl_set:Nn l__find_best_chunk_tl {##1}
int_set:Nn l__find_best_score_int
{ __find_score:n {##1} }
}
}
tl_if_empty:NF l__find_best_chunk_tl
{
int_incr:N l__find_macro_int
tl_put_right:Nx l__find_macros_tl
{
exp_not:c
{ l__find_prefix_str int_to_Alph:n { l__find_macro_int } }
{ exp_not:V l__find_best_chunk_tl }
}
tl_replace_all:NVc l__find_tl l__find_best_chunk_tl
{ l__find_prefix_str int_to_Alph:n { l__find_macro_int } }
dot:
}
tl_put_right:Nx l__find_result_tl { tl_head:N l__find_tl }
tl_set:Nx l__find_tl { tl_tail:N l__find_tl }
use_none_delimit_by_q_stop:w
}
cs_new:Npn __find_score:n #1
{
% Turning ####1 into a length (p+2) macro
% (e.g. |<prefix>A|) saves this number of chars.
% Good if non-negative.
cs_if_exist:cTF { l__find_chunk_ ' tl_to_str:n {#1} ' _tl }
{
int_eval:n
{
tl_count:c { l__find_chunk_ ' tl_to_str:n {#1} ' _tl }
* ( str_count:n {#1} - 3 - l__find_prefix_int )
- 2 * l__find_prefix_int - 9
}
}
{ -1 }
}
cs_new_protected:Npn __find_undefine_tmp:
{
int_step_inline:nnnn { 4 } { 1 } { 9 }
{
seq_map_inline:cn { l__find_chunks_##1_seq }
{ cs_undefine:c { l__find_chunk_ ' tl_to_str:n {####1} ' _tl } }
}
}
cs_new_protected:Npn __find_unescape_tl:nN #1#2
{
regex_replace_all:nnN { c{#1{} } { cB{ cE? } #2
dot:
regex_replace_all:nnN { c{#1}} } { cB? cE} } #2
dot:
regex_replace_all:nnN { c[BE]? } { } #2
dot:
regex_replace_all:nnN { c{#1#} } { cP# } #2
dot:
}
%%%% Hack LaTeX3's iow_wrap to allow zero-width breaking spaces
tl_const:Nx c__iow_wrap_zwbs_marker_tl
{
c_catcode_other_space_tl
c__iow_wrap_marker_tl
c_catcode_other_space_tl
zwbs
c_catcode_other_space_tl
}
cs_new_protected_nopar:Npn __iow_wrap_zwbs:
{
tl_put_right:Nn l__iow_current_line_tl
{ tex_romannumeral:D -`0 }
int_decr:N l__iow_current_line_int
}
%%% End of hack.
msg_new:nnn { find } { invalid-prefix }
{ The~prefix~used,~'#1',~must~be~made~of~letters. }
cs_set_protected:Npn FindMatches #1#2
{
find_matches:nnN {#1} {#2} l_tmpa_tl
tl_use:N l_tmpa_tl
}
ExplSyntaxOff
% Set up some basic things to behave a bit like plain TeX.
%
documentclass{article}
begin{document}
setlength{parindent}{0pt}
defbye{end{document}}
%
% End of setup.
FindMatches{}{% Let's go!
On the first day of Christmas my true love gave to mepar
a partridge in a pear tree.
bigskip
On the second day of Christmas my true love gave to mepar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the third day of Christmas my true love gave to mepar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the fourth day of Christmas my true love gave to mepar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the fifth day of Christmas my true love gave to mepar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the sixth day of Christmas my true love gave to mepar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the seventh day of Christmas my true love gave to mepar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the eighth day of Christmas my true love gave to mepar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the ninth day of Christmas my true love gave to mepar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the tenth day of Christmas my true love gave to mepar
ten lords a leapingpar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the eleventh day of Christmas my true love gave to mepar
eleven pipers pipingpar
ten lords a leapingpar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the twelfth day of Christmas my true love gave to mepar
twelve drummers drummingpar
eleven pipers pipingpar
ten lords a leapingpar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bye
}
5
Awesome as always, Bruno!:)
– Paulo Cereda
May 3 '13 at 0:54
5
This is one of the more outrageous things I've ever seen here. (It is very impressive!)
– Ryan Reich
May 3 '13 at 0:56
Hehe! Thanks @RyanReich. By the way, do you know of any algorithm in the vein of LZW that would be able to produce macros with varying arguments?
– Bruno Le Floch
May 3 '13 at 1:10
@Bruno All I know about LZW is the idea, unfortunately. You mean, you want the "dictionary" to be macros that take an argument, or that take a variable number of arguments? Isn't this the same as having an array and then having the macro argument index into it?
– Ryan Reich
May 3 '13 at 4:15
2
@BrunoLeFloch no wonder we get stalled on the middle layer of L3, but at least nobody can any longer deny that expl3 is up to arbitrary programming tasks :-)
– Frank Mittelbach
May 4 '13 at 20:16
|
show 3 more comments
What I present is a total kludge, stealing things from my titlecaps
package. So the format is not optimal, but it might give a place from which to proceed. Nice thing is that punctuation is screened out of the search in the target-text.
When a search-string and target-text (not exceeding one paragraph) is passed to the seekphrase
, it will output where each word of the search-phrase appears in the target-text. I output it as a pair "target-location:search-word-index". To try to make sense of it, I output a [
before a search-word-index=1 match and a ]
after a search-word-index=n match. Thus to qualify as a matching "phrase" in, for example, a 3-word search, the output would have to contain an instance of something like [214:1 215:2, 216:3]
Further refinement is clearly possible, but it would take more time than I have at the moment.
documentclass{article}
usepackage{titlecaps}
usepackage{lipsum}
makeatletter
renewcommandseek@lcwords{%
kill@punct%
setcounter{word@count}{0}%
whiledo{value{word@count} < narg}{%
addtocounter{word@count}{1}%
protected@edefcurrent@word{csname argroman{word@count}endcsname}%
deffound@word{F}%
setcounter{lcword@index}{0}%
expandafterdefcsname%
found@wordroman{word@count}endcsname{F}%
whiledo{value{lcword@index} < value{lc@words}}{%
addtocounter{lcword@index}{1}%
protected@edefcurrent@lcword{%
csname lcwordroman{lcword@index}endcsname}%
%% THE FOLLOWING THREE LINES ARE FROM DAVID CARLISLE
protected@edeftmp{noexpandscantokens{defnoexpandtmp%
{noexpandifthenelse{noexpandequal{current@word}{current@lcword}}}}}%
tmpifhmodeunskipfitmp
%%
{expandafterdefcsname%
found@wordroman{word@count}endcsname{T}%
ifthenelse{equal{value{lcword@index}}{1}}{[}{}%
arabic{word@count}:arabic{lcword@index}%
ifthenelse{equal{value{lcword@index}}{value{lc@words}}}{]}{ }%
setcounter{lcword@index}{value{lc@words}}}%
{}%
}%
}%
restore@punct%
}
letgetargsCget@argsC
newcommandseekphrase[2]{%
Seeking phrase ``#1'':\%
Addlcwords{#1}%
redefine@tertius%
getargsC{#2}%
seek@lcwords%
Resetlcwords%
par%
}
makeatother
begin{document}
lipsum[1]
lipsum[1]
defx{%
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut
purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis.
Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget,
consectetuer id, vulputate a, magna. Donec vehicula augue eu neque.
Pellentesque habitant morbi tristique senectus et netus et malesuada
fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem.
Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus
sit amet tortor gravida placerat. Integer sapien est, iaculis in,
pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices
bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar
at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci
eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis,
diam. Duis eget orci sit amet orci dignissim rutrum.
%
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut
purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis.
Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget,
consectetuer id, vulputate a, magna. Donec vehicula augue eu neque.
Pellentesque habitant morbi tristique senectus et netus et malesuada
fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem.
Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus
sit amet tortor gravida placerat. Integer sapien est, iaculis in,
pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices
bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar
at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci
eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis,
diam. Duis eget orci sit amet orci dignissim rutrum.
}
seekphrase{et}{x}
seekphrase{et netus}{x}
seekphrase{bibendum Aenean}{x}
seekphrase{eget sem vel}{x}
end{document}
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "85"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2ftex.stackexchange.com%2fquestions%2f101434%2fsearch-for-most-frequently-occurring-patterns-in-text-to-replace-them-by-macros%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
EDIT: I had implemented some version of the Lempel-Ziv-Welch algorithm, following a comment of Ryan Reich. I've now changed the code completely and I don't know if the algorithm has a name.
I fed the new code the 12 days of Christmas, as David Carlisle asked, and obtained the following result (for plain TeX)
longdefA#1#2{def#1{#2}A}AAG{sparACAAB}AH{AEWXVTU mVAD}AA
{CDEFG}AB{HIJAparbigskipB}AC{OPparMNKL nd}AD{RSparQ
A ring}AE{Y es dancV}AF{tZ a leapV}B{par On the }C{ day of C}D
{hristmas }E{my true l}F{ove gave }G{to mepar}H{a partrid}I{ge in a p}J
{ear tree.}K{two turtl}L{e dovespar a}M{three fre}N{nch henspar}O{four
call}P{ing birds}Q{five gold}R{six geese}S{ a laying}T{seven swa}U{ns a
swim}V{ingpar}W{eight mai}X{ds a milk}Y{nine ladi}Z{en lords } A{ }B
firstAAAB secondAAKL nd AB thirdAAMNKL nd AB fourthAAACAAB
fifthAAQA ringAG sixthAAADAG seventhAATU mVADAG eighthAAWXVT
U mVADAG ninthAAAHAG tenthAAAFAHAG eleventhAA eleven pipers pipV
AFAHAG twelfthAA twelve drummers drummV eleven pipers pipVAFAH spar
ACAHIJAbye
which is 878 bytes long. This should be compared to the 767 bytes of xii.tex
, which produces the same lyrics.
RequirePackage{expl3}
ExplSyntaxOn
cs_generate_variant:Nn tl_replace_all:Nnn { Nxx, NVc, Nxn }
cs_generate_variant:Nn tl_replace_once:Nnn { Nxx }
cs_generate_variant:Nn tl_if_in:NnT { No }
cs_generate_variant:Nn int_min:nn { c, cc }
cs_generate_variant:Nn tl_if_eq:NNTF { cc }
cs_generate_variant:Nn tl_if_eq:nnTF { xx }
cs_generate_variant:Nn tl_if_eq:nnT { xx }
cs_generate_variant:Nn str_count:N { c }
cs_generate_variant:Nn regex_match:nnF { nV }
cs_generate_variant:Nn str_set:Nn { NV }
str_new:N l__find_prefix_str
int_new:N l__find_prefix_int
tl_new:N l__find_tl
tl_new:N l__find_chunks_tl
int_step_inline:nnnn { 1 } { 1 } { 9 }
{ seq_new:c { l__find_chunks_#1_seq } }
int_new:N l__find_common_int
int_new:N l__find_nesting_int
tl_new:N l__find_previous_tl
seq_new:N l__find_chunks_seq
int_new:N l__find_best_score_int
int_new:N l__find_macro_int
tl_new:N l__find_macros_tl
tl_new:N l__find_result_tl
int_new:N l__find_length_int
int_new:N l__find_previous_length_int
tl_new:N l__find_display_tl
cs_new_protected_nopar:Npn dot: { tex_message:D { . } }
cs_new_protected:Npn find_matches:nnN #1#2#3
{
% '#1' is the prefix, '#2' is the token list to study.
%
__find_set_prefix:n {#1}
tl_set:Nn l__find_tl { ~! #2 }
__find_escape_spaces:xN { l__find_prefix_str A } l__find_tl
int_set_eq:NN l__find_macro_int c_one
__find_get_length:V l__find_prefix_str
iow_term:x { int_use:N l__find_length_int }
int_set_eq:NN l__find_length_int c_max_int
__find_matches_aux:V l__find_prefix_str
tl_replace_once:Nnn l__find_tl { ! } { { ~ } }
tl_set:Nx #3
{
__find_preamble:
exp_not:V l__find_tl
}
}
cs_new_protected:Npn __find_escape_spaces:nN #1#2
{
regex_replace_all:nnN { cS } { c{#1} } #2
dot:
}
cs_generate_variant:Nn __find_escape_spaces:nN { x }
cs_new_nopar:Npn __find_preamble:
{
exp_not:n { long def } exp_not:c { l__find_prefix_str A } ####1####2
{
exp_not:N def ####1{####2}
exp_not:c { l__find_prefix_str A }
}
exp_not:c { l__find_prefix_str A }
}
cs_new_protected:Npn __find_matches_aux:n #1
{
int_set_eq:NN l__find_previous_length_int l__find_length_int
tl_set_eq:NN l__find_previous_tl l__find_tl
__find_escape_tl:nN {#1} l__find_tl
int_compare:nNnTF { tl_count:N l__find_tl } < c_nine
{ tl_set_eq:Nn l__find_tl l__find_previous_tl }
{
__find_set_chunks:
__find_sort_chunks:
__find_common:
__find_best_macros:
__find_undefine_tmp:
tl_set_eq:NN l__find_tl l__find_result_tl
__find_unescape_tl:nN {#1} l__find_tl
__find_get_length:n {#1}
iow_term:x { int_use:N l__find_length_int }
int_compare:nNnTF l__find_length_int < l__find_previous_length_int
{ __find_matches_aux:n {#1} }
{
iow_term:n { }
iow_term:x { l__find_display_tl }
iow_term:n { }
tl_set_eq:NN l__find_tl l__find_previous_tl
}
}
}
cs_generate_variant:Nn __find_matches_aux:n { V }
cs_new_protected:Npn __find_get_length:n #1
{
tl_set_eq:NN l__find_display_tl l__find_tl
tl_replace_once:Nxx l__find_display_tl
{ exp_not:c { #1 A } ! }
{ ~ exp_not:c { #1 A } {~} }
str_set:NV l__find_display_tl l__find_display_tl
tl_replace_all:Nxn l__find_display_tl
{ c_backslash_str #1 token_to_str:N A ~ } c_space_tl
tl_replace_all:Nnn l__find_display_tl
{ ~ c_space_tl } { ~ exp_not:c { #1 A } }
dot:
str_set:Nx l__find_display_tl { __find_preamble: l__find_display_tl }
tl_replace_all:Nxx l__find_display_tl
{ c_hash_str c_hash_str } { c_hash_str }
dot:
regex_replace_all:nnN
{ (\[A-Za-z]+) ([A-Za-z]) }
{ 1 2 }
l__find_display_tl
dot:
regex_replace_all:nnN
{ (\[A-Za-z]+) }
{ 1 c{c__iow_wrap_zwbs_marker_tl} }
l__find_display_tl
dot:
iow_wrap:nnnN { l__find_display_tl } { } { } __find_get_length_aux:n
}
cs_generate_variant:Nn __find_get_length:n { V }
cs_new_protected:Npn __find_get_length_aux:n #1
{
int_set:Nn l__find_length_int { str_count:n {#1} }
tl_set:Nn l__find_display_tl {#1}
% iow_term:n { }
% iow_term:n {#1}
% iow_term:n { }
}
cs_new_protected:Npn __find_set_prefix:n #1
{
% Check that the prefix |#1| is made only of alphabetic characters.
%
str_set:Nx l__find_prefix_str {#1}
int_set:Nn l__find_prefix_int { str_count:N l__find_prefix_str }
regex_match:nVF { Aw*Z } l__find_prefix_str
{
msg_error:nnx { find } { invalid-prefix }
{ l__find_prefix_str }
}
dot:
}
cs_new_protected:Npn __find_escape_tl:nN #1#2
{
% During the 'study' step, we manipulate the token list |#2|
% with all begin-group and end-group tokens replaced by a
% control sequence built from the prefix. We must change both
% begin-group and end-group tokens in one pass, to avoid getting an
% unbalanced result. Also replace macro parameters because they
% cannot be used as delimiters for macros. Spaces have been
% turned into a control sequence earlier. At this stage, every
% token in |#2| can be grabbed as an N-type argument.
%
regex_replace_all:nnN { cB. } { cB{ } #2
dot:
regex_replace_all:nnN { cE. } { cE} } #2
dot:
regex_replace_all:nnN { cP. } { c{ #1 # } } #2
dot:
regex_replace_all:nnN { c[BEP]. } { c{ #1 } } #2
dot:
}
cs_new_protected_nopar:Npn __find_set_chunks:
{
% Build a token list whose items are each nine consecutive tokens
% of the original token list, in a running window. So for instance
% |ABCDEFGHIJKL| would lead to the following (12) items:
% |ABCDEFGHI|, |BCDEFGHIJ|, |CDEFGHIJK|, |DEFGHIJKL|, |EFGHIJKL|,
% |FGHIJKL|, |GHIJKL|, |HIJKL|, |IJKL|, |JKL|, |KL|, |L|. The items
% of this token list are built in an |x|-expanded loop.
% A special case arises if the |find| token list is too short to
% safely perform the loop: then our whole algorithm is not going to
% do any good anyways, so we build an empty chunk list.
%
tl_set:Nx l__find_chunks_tl
{
exp_after:wN __find_set_chunks_loop:NNNNNNNNN l__find_tl
q_recursion_tail q_recursion_stop
}
}
cs_new:Npn __find_set_chunks_loop:NNNNNNNNN #1#2#3#4#5#6#7#8#9
{
% As usual in a TeX loop, first check for the end-loop marker (here,
% cs{q_recursion_tail}). If it is reached, we fill in the last few
% chunks (which become shorter and shorter as we go). Otherwise,
% add (to the token list we are building) an item containing (9)
% tokens, and loop, dropping the first of the items.
%
quark_if_recursion_tail_stop_do:Nn #9
{ __find_set_chunks_end:NNNNNNNN #1 #2 #3 #4 #5 #6 #7 #8 }
{ exp_not:n { #1 #2 #3 #4 #5 #6 #7 #8 #9 } }
__find_set_chunks_loop:NNNNNNNNN #2#3#4#5#6#7#8#9
}
cs_new:Npn __find_set_chunks_end:NNNNNNNN #1#2#3#4#5#6#7#8
{
exp_not:n
{
{ #1 #2 #3 #4 #5 #6 #7 #8 }
{ #2 #3 #4 #5 #6 #7 #8 }
{ #3 #4 #5 #6 #7 #8 }
{ #4 #5 #6 #7 #8 }
{ #5 #6 #7 #8 }
{ #6 #7 #8 }
{ #7 #8 }
{ #8 }
}
}
cs_new_protected:Npn __find_sort_chunks:
{
tl_sort:Nn l__find_chunks_tl
{
int_compare:nNnTF
{
pdftex_strcmp:D
{ exp_not:n {##1} }
{ exp_not:n {##2} }
}
> c_zero
{ sort_return_swapped: }
{ sort_return_same: }
}
}
cs_new_protected:Npn __find_common:
{
int_step_inline:nnnn { 1 } { 1 } { 9 }
{ seq_clear:c { l__find_chunks_##1_seq } }
exp_after:wN __find_common_loop:nn l__find_chunks_tl
q_recursion_tail q_recursion_tail q_recursion_stop
int_step_inline:nnnn { 4 } { 1 } { 9 }
{
seq_map_inline:cn { l__find_chunks_##1_seq }
{
tl_if_exist:cTF { l__find_chunk_ ' tl_to_str:n {####1} ' _tl }
{
tl_put_right:cn
{ l__find_chunk_ ' tl_to_str:n {####1} ' _tl } { i }
}
{
cs_set_eq:cN
{ l__find_chunk_ ' tl_to_str:n {####1} ' _tl } c_empty_tl
}
}
}
}
cs_new_protected:Npn __find_common_loop:nn #1#2
{
quark_if_recursion_tail_stop:n {#2}
int_zero:N l__find_common_int
__find_count_common_aux:nn {#1} {#2}
use:c { __find_common_ int_use:N l__find_common_int :w }
#1 X X X X X X X X X q_stop
__find_common_loop:nn {#2}
}
cs_new_protected:Npn __find_count_common_aux:nn #1#2
{
tl_if_empty:nF {#1}
{
tl_if_empty:nF {#2}
{
tl_if_eq:xxT { tl_head:n {#1} } { tl_head:n {#2} }
{
int_incr:N l__find_common_int
__find_count_common_aux:xx
{ tl_tail:n {#1} } { tl_tail:n {#2} }
}
}
}
}
cs_generate_variant:Nn __find_count_common_aux:nn { xx }
cs_new_eq:cN { __find_common_0:w } use_none_delimit_by_q_stop:w
cs_new_protected:cpn { __find_common_1:w } #1
{ __find_common_auxii:nnw { 1 } {#1} }
cs_new_protected:cpn { __find_common_2:w } #1#2
{ __find_common_auxii:nnw { 2 } { #1 #2 } }
cs_new_protected:cpn { __find_common_3:w } #1#2#3
{ __find_common_auxii:nnw { 3 } { #1 #2 #3 } }
cs_new_protected:cpn { __find_common_4:w } #1#2#3#4
{ __find_common_auxii:nnw { 4 } { #1 #2 #3 #4 } }
cs_new_protected:cpn { __find_common_5:w } #1#2#3#4#5
{
dot:
__find_common_auxii:nnw { 5 } { #1 #2 #3 #4 #5 }
}
cs_new_protected:cpn { __find_common_6:w } #1#2#3#4#5#6
{ __find_common_auxii:nnw { 6 } { #1 #2 #3 #4 #5 #6 } }
cs_new_protected:cpn { __find_common_7:w } #1#2#3#4#5#6#7
{ __find_common_auxii:nnw { 7 } { #1 #2 #3 #4 #5 #6 #7 } }
cs_new_protected:cpn { __find_common_8:w } #1#2#3#4#5#6#7#8
{ __find_common_auxii:nnw { 8 } { #1 #2 #3 #4 #5 #6 #7 #8 } }
cs_new_protected:cpn { __find_common_9:w } #1#2#3#4#5#6#7#8#9
{ __find_common_auxii:nnw { 9 } { #1 #2 #3 #4 #5 #6 #7 #8 #9 } }
cs_new_protected:Npn __find_common_auxii:nnw #1#2#3 q_stop
{
int_zero:N l__find_nesting_int
tl_map_inline:nn {#2}
{
str_case_x:nnn { exp_not:n {##1} }
{
{ exp_not:c { l__find_prefix_str c_lbrace_str } }
{ int_incr:N l__find_nesting_int }
{ exp_not:c { l__find_prefix_str c_rbrace_str } }
{
int_compare:nNnF l__find_nesting_int > c_zero
{ use_none_delimit_by_q_stop:w }
int_decr:N l__find_nesting_int
}
}
{ }
}
int_compare:nNnF l__find_nesting_int = c_zero
{ use_none_delimit_by_q_stop:w }
seq_put_right:cn { l__find_chunks_#1_seq } {#2}
use_none_delimit_by_q_stop:w
q_stop
}
cs_new_protected_nopar:Npn __find_best_macros:
{
tl_clear:N l__find_macros_tl
tl_clear:N l__find_result_tl
__find_best_macros_aux:
tl_put_left:NV l__find_result_tl l__find_macros_tl
}
cs_new_protected:Npn __find_best_macros_aux:
{
exp_after:wN __find_best_macros_auxii:NNNNNNNNN l__find_tl
q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_stop
tl_if_empty:NF l__find_tl { __find_best_macros_aux: }
}
cs_new_protected:Npn __find_best_macros_auxii:NNNNNNNNN #1#2#3#4#5#6#7#8#9
{
int_zero:N l__find_best_score_int
tl_clear:N l__find_best_chunk_tl
tl_map_inline:nn
{
{#1} {#1#2} {#1#2#3} {#1#2#3#4} {#1#2#3#4#5} {#1#2#3#4#5#6}
{#1#2#3#4#5#6#7} {#1#2#3#4#5#6#7#8} {#1#2#3#4#5#6#7#8#9}
}
{
int_compare:nNnT
{ __find_score:n {##1} } > l__find_best_score_int
{
tl_set:Nn l__find_best_chunk_tl {##1}
int_set:Nn l__find_best_score_int
{ __find_score:n {##1} }
}
}
tl_if_empty:NF l__find_best_chunk_tl
{
int_incr:N l__find_macro_int
tl_put_right:Nx l__find_macros_tl
{
exp_not:c
{ l__find_prefix_str int_to_Alph:n { l__find_macro_int } }
{ exp_not:V l__find_best_chunk_tl }
}
tl_replace_all:NVc l__find_tl l__find_best_chunk_tl
{ l__find_prefix_str int_to_Alph:n { l__find_macro_int } }
dot:
}
tl_put_right:Nx l__find_result_tl { tl_head:N l__find_tl }
tl_set:Nx l__find_tl { tl_tail:N l__find_tl }
use_none_delimit_by_q_stop:w
}
cs_new:Npn __find_score:n #1
{
% Turning ####1 into a length (p+2) macro
% (e.g. |<prefix>A|) saves this number of chars.
% Good if non-negative.
cs_if_exist:cTF { l__find_chunk_ ' tl_to_str:n {#1} ' _tl }
{
int_eval:n
{
tl_count:c { l__find_chunk_ ' tl_to_str:n {#1} ' _tl }
* ( str_count:n {#1} - 3 - l__find_prefix_int )
- 2 * l__find_prefix_int - 9
}
}
{ -1 }
}
cs_new_protected:Npn __find_undefine_tmp:
{
int_step_inline:nnnn { 4 } { 1 } { 9 }
{
seq_map_inline:cn { l__find_chunks_##1_seq }
{ cs_undefine:c { l__find_chunk_ ' tl_to_str:n {####1} ' _tl } }
}
}
cs_new_protected:Npn __find_unescape_tl:nN #1#2
{
regex_replace_all:nnN { c{#1{} } { cB{ cE? } #2
dot:
regex_replace_all:nnN { c{#1}} } { cB? cE} } #2
dot:
regex_replace_all:nnN { c[BE]? } { } #2
dot:
regex_replace_all:nnN { c{#1#} } { cP# } #2
dot:
}
%%%% Hack LaTeX3's iow_wrap to allow zero-width breaking spaces
tl_const:Nx c__iow_wrap_zwbs_marker_tl
{
c_catcode_other_space_tl
c__iow_wrap_marker_tl
c_catcode_other_space_tl
zwbs
c_catcode_other_space_tl
}
cs_new_protected_nopar:Npn __iow_wrap_zwbs:
{
tl_put_right:Nn l__iow_current_line_tl
{ tex_romannumeral:D -`0 }
int_decr:N l__iow_current_line_int
}
%%% End of hack.
msg_new:nnn { find } { invalid-prefix }
{ The~prefix~used,~'#1',~must~be~made~of~letters. }
cs_set_protected:Npn FindMatches #1#2
{
find_matches:nnN {#1} {#2} l_tmpa_tl
tl_use:N l_tmpa_tl
}
ExplSyntaxOff
% Set up some basic things to behave a bit like plain TeX.
%
documentclass{article}
begin{document}
setlength{parindent}{0pt}
defbye{end{document}}
%
% End of setup.
FindMatches{}{% Let's go!
On the first day of Christmas my true love gave to mepar
a partridge in a pear tree.
bigskip
On the second day of Christmas my true love gave to mepar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the third day of Christmas my true love gave to mepar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the fourth day of Christmas my true love gave to mepar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the fifth day of Christmas my true love gave to mepar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the sixth day of Christmas my true love gave to mepar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the seventh day of Christmas my true love gave to mepar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the eighth day of Christmas my true love gave to mepar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the ninth day of Christmas my true love gave to mepar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the tenth day of Christmas my true love gave to mepar
ten lords a leapingpar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the eleventh day of Christmas my true love gave to mepar
eleven pipers pipingpar
ten lords a leapingpar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the twelfth day of Christmas my true love gave to mepar
twelve drummers drummingpar
eleven pipers pipingpar
ten lords a leapingpar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bye
}
5
Awesome as always, Bruno!:)
– Paulo Cereda
May 3 '13 at 0:54
5
This is one of the more outrageous things I've ever seen here. (It is very impressive!)
– Ryan Reich
May 3 '13 at 0:56
Hehe! Thanks @RyanReich. By the way, do you know of any algorithm in the vein of LZW that would be able to produce macros with varying arguments?
– Bruno Le Floch
May 3 '13 at 1:10
@Bruno All I know about LZW is the idea, unfortunately. You mean, you want the "dictionary" to be macros that take an argument, or that take a variable number of arguments? Isn't this the same as having an array and then having the macro argument index into it?
– Ryan Reich
May 3 '13 at 4:15
2
@BrunoLeFloch no wonder we get stalled on the middle layer of L3, but at least nobody can any longer deny that expl3 is up to arbitrary programming tasks :-)
– Frank Mittelbach
May 4 '13 at 20:16
|
show 3 more comments
EDIT: I had implemented some version of the Lempel-Ziv-Welch algorithm, following a comment of Ryan Reich. I've now changed the code completely and I don't know if the algorithm has a name.
I fed the new code the 12 days of Christmas, as David Carlisle asked, and obtained the following result (for plain TeX)
longdefA#1#2{def#1{#2}A}AAG{sparACAAB}AH{AEWXVTU mVAD}AA
{CDEFG}AB{HIJAparbigskipB}AC{OPparMNKL nd}AD{RSparQ
A ring}AE{Y es dancV}AF{tZ a leapV}B{par On the }C{ day of C}D
{hristmas }E{my true l}F{ove gave }G{to mepar}H{a partrid}I{ge in a p}J
{ear tree.}K{two turtl}L{e dovespar a}M{three fre}N{nch henspar}O{four
call}P{ing birds}Q{five gold}R{six geese}S{ a laying}T{seven swa}U{ns a
swim}V{ingpar}W{eight mai}X{ds a milk}Y{nine ladi}Z{en lords } A{ }B
firstAAAB secondAAKL nd AB thirdAAMNKL nd AB fourthAAACAAB
fifthAAQA ringAG sixthAAADAG seventhAATU mVADAG eighthAAWXVT
U mVADAG ninthAAAHAG tenthAAAFAHAG eleventhAA eleven pipers pipV
AFAHAG twelfthAA twelve drummers drummV eleven pipers pipVAFAH spar
ACAHIJAbye
which is 878 bytes long. This should be compared to the 767 bytes of xii.tex
, which produces the same lyrics.
RequirePackage{expl3}
ExplSyntaxOn
cs_generate_variant:Nn tl_replace_all:Nnn { Nxx, NVc, Nxn }
cs_generate_variant:Nn tl_replace_once:Nnn { Nxx }
cs_generate_variant:Nn tl_if_in:NnT { No }
cs_generate_variant:Nn int_min:nn { c, cc }
cs_generate_variant:Nn tl_if_eq:NNTF { cc }
cs_generate_variant:Nn tl_if_eq:nnTF { xx }
cs_generate_variant:Nn tl_if_eq:nnT { xx }
cs_generate_variant:Nn str_count:N { c }
cs_generate_variant:Nn regex_match:nnF { nV }
cs_generate_variant:Nn str_set:Nn { NV }
str_new:N l__find_prefix_str
int_new:N l__find_prefix_int
tl_new:N l__find_tl
tl_new:N l__find_chunks_tl
int_step_inline:nnnn { 1 } { 1 } { 9 }
{ seq_new:c { l__find_chunks_#1_seq } }
int_new:N l__find_common_int
int_new:N l__find_nesting_int
tl_new:N l__find_previous_tl
seq_new:N l__find_chunks_seq
int_new:N l__find_best_score_int
int_new:N l__find_macro_int
tl_new:N l__find_macros_tl
tl_new:N l__find_result_tl
int_new:N l__find_length_int
int_new:N l__find_previous_length_int
tl_new:N l__find_display_tl
cs_new_protected_nopar:Npn dot: { tex_message:D { . } }
cs_new_protected:Npn find_matches:nnN #1#2#3
{
% '#1' is the prefix, '#2' is the token list to study.
%
__find_set_prefix:n {#1}
tl_set:Nn l__find_tl { ~! #2 }
__find_escape_spaces:xN { l__find_prefix_str A } l__find_tl
int_set_eq:NN l__find_macro_int c_one
__find_get_length:V l__find_prefix_str
iow_term:x { int_use:N l__find_length_int }
int_set_eq:NN l__find_length_int c_max_int
__find_matches_aux:V l__find_prefix_str
tl_replace_once:Nnn l__find_tl { ! } { { ~ } }
tl_set:Nx #3
{
__find_preamble:
exp_not:V l__find_tl
}
}
cs_new_protected:Npn __find_escape_spaces:nN #1#2
{
regex_replace_all:nnN { cS } { c{#1} } #2
dot:
}
cs_generate_variant:Nn __find_escape_spaces:nN { x }
cs_new_nopar:Npn __find_preamble:
{
exp_not:n { long def } exp_not:c { l__find_prefix_str A } ####1####2
{
exp_not:N def ####1{####2}
exp_not:c { l__find_prefix_str A }
}
exp_not:c { l__find_prefix_str A }
}
cs_new_protected:Npn __find_matches_aux:n #1
{
int_set_eq:NN l__find_previous_length_int l__find_length_int
tl_set_eq:NN l__find_previous_tl l__find_tl
__find_escape_tl:nN {#1} l__find_tl
int_compare:nNnTF { tl_count:N l__find_tl } < c_nine
{ tl_set_eq:Nn l__find_tl l__find_previous_tl }
{
__find_set_chunks:
__find_sort_chunks:
__find_common:
__find_best_macros:
__find_undefine_tmp:
tl_set_eq:NN l__find_tl l__find_result_tl
__find_unescape_tl:nN {#1} l__find_tl
__find_get_length:n {#1}
iow_term:x { int_use:N l__find_length_int }
int_compare:nNnTF l__find_length_int < l__find_previous_length_int
{ __find_matches_aux:n {#1} }
{
iow_term:n { }
iow_term:x { l__find_display_tl }
iow_term:n { }
tl_set_eq:NN l__find_tl l__find_previous_tl
}
}
}
cs_generate_variant:Nn __find_matches_aux:n { V }
cs_new_protected:Npn __find_get_length:n #1
{
tl_set_eq:NN l__find_display_tl l__find_tl
tl_replace_once:Nxx l__find_display_tl
{ exp_not:c { #1 A } ! }
{ ~ exp_not:c { #1 A } {~} }
str_set:NV l__find_display_tl l__find_display_tl
tl_replace_all:Nxn l__find_display_tl
{ c_backslash_str #1 token_to_str:N A ~ } c_space_tl
tl_replace_all:Nnn l__find_display_tl
{ ~ c_space_tl } { ~ exp_not:c { #1 A } }
dot:
str_set:Nx l__find_display_tl { __find_preamble: l__find_display_tl }
tl_replace_all:Nxx l__find_display_tl
{ c_hash_str c_hash_str } { c_hash_str }
dot:
regex_replace_all:nnN
{ (\[A-Za-z]+) ([A-Za-z]) }
{ 1 2 }
l__find_display_tl
dot:
regex_replace_all:nnN
{ (\[A-Za-z]+) }
{ 1 c{c__iow_wrap_zwbs_marker_tl} }
l__find_display_tl
dot:
iow_wrap:nnnN { l__find_display_tl } { } { } __find_get_length_aux:n
}
cs_generate_variant:Nn __find_get_length:n { V }
cs_new_protected:Npn __find_get_length_aux:n #1
{
int_set:Nn l__find_length_int { str_count:n {#1} }
tl_set:Nn l__find_display_tl {#1}
% iow_term:n { }
% iow_term:n {#1}
% iow_term:n { }
}
cs_new_protected:Npn __find_set_prefix:n #1
{
% Check that the prefix |#1| is made only of alphabetic characters.
%
str_set:Nx l__find_prefix_str {#1}
int_set:Nn l__find_prefix_int { str_count:N l__find_prefix_str }
regex_match:nVF { Aw*Z } l__find_prefix_str
{
msg_error:nnx { find } { invalid-prefix }
{ l__find_prefix_str }
}
dot:
}
cs_new_protected:Npn __find_escape_tl:nN #1#2
{
% During the 'study' step, we manipulate the token list |#2|
% with all begin-group and end-group tokens replaced by a
% control sequence built from the prefix. We must change both
% begin-group and end-group tokens in one pass, to avoid getting an
% unbalanced result. Also replace macro parameters because they
% cannot be used as delimiters for macros. Spaces have been
% turned into a control sequence earlier. At this stage, every
% token in |#2| can be grabbed as an N-type argument.
%
regex_replace_all:nnN { cB. } { cB{ } #2
dot:
regex_replace_all:nnN { cE. } { cE} } #2
dot:
regex_replace_all:nnN { cP. } { c{ #1 # } } #2
dot:
regex_replace_all:nnN { c[BEP]. } { c{ #1 } } #2
dot:
}
cs_new_protected_nopar:Npn __find_set_chunks:
{
% Build a token list whose items are each nine consecutive tokens
% of the original token list, in a running window. So for instance
% |ABCDEFGHIJKL| would lead to the following (12) items:
% |ABCDEFGHI|, |BCDEFGHIJ|, |CDEFGHIJK|, |DEFGHIJKL|, |EFGHIJKL|,
% |FGHIJKL|, |GHIJKL|, |HIJKL|, |IJKL|, |JKL|, |KL|, |L|. The items
% of this token list are built in an |x|-expanded loop.
% A special case arises if the |find| token list is too short to
% safely perform the loop: then our whole algorithm is not going to
% do any good anyways, so we build an empty chunk list.
%
tl_set:Nx l__find_chunks_tl
{
exp_after:wN __find_set_chunks_loop:NNNNNNNNN l__find_tl
q_recursion_tail q_recursion_stop
}
}
cs_new:Npn __find_set_chunks_loop:NNNNNNNNN #1#2#3#4#5#6#7#8#9
{
% As usual in a TeX loop, first check for the end-loop marker (here,
% cs{q_recursion_tail}). If it is reached, we fill in the last few
% chunks (which become shorter and shorter as we go). Otherwise,
% add (to the token list we are building) an item containing (9)
% tokens, and loop, dropping the first of the items.
%
quark_if_recursion_tail_stop_do:Nn #9
{ __find_set_chunks_end:NNNNNNNN #1 #2 #3 #4 #5 #6 #7 #8 }
{ exp_not:n { #1 #2 #3 #4 #5 #6 #7 #8 #9 } }
__find_set_chunks_loop:NNNNNNNNN #2#3#4#5#6#7#8#9
}
cs_new:Npn __find_set_chunks_end:NNNNNNNN #1#2#3#4#5#6#7#8
{
exp_not:n
{
{ #1 #2 #3 #4 #5 #6 #7 #8 }
{ #2 #3 #4 #5 #6 #7 #8 }
{ #3 #4 #5 #6 #7 #8 }
{ #4 #5 #6 #7 #8 }
{ #5 #6 #7 #8 }
{ #6 #7 #8 }
{ #7 #8 }
{ #8 }
}
}
cs_new_protected:Npn __find_sort_chunks:
{
tl_sort:Nn l__find_chunks_tl
{
int_compare:nNnTF
{
pdftex_strcmp:D
{ exp_not:n {##1} }
{ exp_not:n {##2} }
}
> c_zero
{ sort_return_swapped: }
{ sort_return_same: }
}
}
cs_new_protected:Npn __find_common:
{
int_step_inline:nnnn { 1 } { 1 } { 9 }
{ seq_clear:c { l__find_chunks_##1_seq } }
exp_after:wN __find_common_loop:nn l__find_chunks_tl
q_recursion_tail q_recursion_tail q_recursion_stop
int_step_inline:nnnn { 4 } { 1 } { 9 }
{
seq_map_inline:cn { l__find_chunks_##1_seq }
{
tl_if_exist:cTF { l__find_chunk_ ' tl_to_str:n {####1} ' _tl }
{
tl_put_right:cn
{ l__find_chunk_ ' tl_to_str:n {####1} ' _tl } { i }
}
{
cs_set_eq:cN
{ l__find_chunk_ ' tl_to_str:n {####1} ' _tl } c_empty_tl
}
}
}
}
cs_new_protected:Npn __find_common_loop:nn #1#2
{
quark_if_recursion_tail_stop:n {#2}
int_zero:N l__find_common_int
__find_count_common_aux:nn {#1} {#2}
use:c { __find_common_ int_use:N l__find_common_int :w }
#1 X X X X X X X X X q_stop
__find_common_loop:nn {#2}
}
cs_new_protected:Npn __find_count_common_aux:nn #1#2
{
tl_if_empty:nF {#1}
{
tl_if_empty:nF {#2}
{
tl_if_eq:xxT { tl_head:n {#1} } { tl_head:n {#2} }
{
int_incr:N l__find_common_int
__find_count_common_aux:xx
{ tl_tail:n {#1} } { tl_tail:n {#2} }
}
}
}
}
cs_generate_variant:Nn __find_count_common_aux:nn { xx }
cs_new_eq:cN { __find_common_0:w } use_none_delimit_by_q_stop:w
cs_new_protected:cpn { __find_common_1:w } #1
{ __find_common_auxii:nnw { 1 } {#1} }
cs_new_protected:cpn { __find_common_2:w } #1#2
{ __find_common_auxii:nnw { 2 } { #1 #2 } }
cs_new_protected:cpn { __find_common_3:w } #1#2#3
{ __find_common_auxii:nnw { 3 } { #1 #2 #3 } }
cs_new_protected:cpn { __find_common_4:w } #1#2#3#4
{ __find_common_auxii:nnw { 4 } { #1 #2 #3 #4 } }
cs_new_protected:cpn { __find_common_5:w } #1#2#3#4#5
{
dot:
__find_common_auxii:nnw { 5 } { #1 #2 #3 #4 #5 }
}
cs_new_protected:cpn { __find_common_6:w } #1#2#3#4#5#6
{ __find_common_auxii:nnw { 6 } { #1 #2 #3 #4 #5 #6 } }
cs_new_protected:cpn { __find_common_7:w } #1#2#3#4#5#6#7
{ __find_common_auxii:nnw { 7 } { #1 #2 #3 #4 #5 #6 #7 } }
cs_new_protected:cpn { __find_common_8:w } #1#2#3#4#5#6#7#8
{ __find_common_auxii:nnw { 8 } { #1 #2 #3 #4 #5 #6 #7 #8 } }
cs_new_protected:cpn { __find_common_9:w } #1#2#3#4#5#6#7#8#9
{ __find_common_auxii:nnw { 9 } { #1 #2 #3 #4 #5 #6 #7 #8 #9 } }
cs_new_protected:Npn __find_common_auxii:nnw #1#2#3 q_stop
{
int_zero:N l__find_nesting_int
tl_map_inline:nn {#2}
{
str_case_x:nnn { exp_not:n {##1} }
{
{ exp_not:c { l__find_prefix_str c_lbrace_str } }
{ int_incr:N l__find_nesting_int }
{ exp_not:c { l__find_prefix_str c_rbrace_str } }
{
int_compare:nNnF l__find_nesting_int > c_zero
{ use_none_delimit_by_q_stop:w }
int_decr:N l__find_nesting_int
}
}
{ }
}
int_compare:nNnF l__find_nesting_int = c_zero
{ use_none_delimit_by_q_stop:w }
seq_put_right:cn { l__find_chunks_#1_seq } {#2}
use_none_delimit_by_q_stop:w
q_stop
}
cs_new_protected_nopar:Npn __find_best_macros:
{
tl_clear:N l__find_macros_tl
tl_clear:N l__find_result_tl
__find_best_macros_aux:
tl_put_left:NV l__find_result_tl l__find_macros_tl
}
cs_new_protected:Npn __find_best_macros_aux:
{
exp_after:wN __find_best_macros_auxii:NNNNNNNNN l__find_tl
q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_stop
tl_if_empty:NF l__find_tl { __find_best_macros_aux: }
}
cs_new_protected:Npn __find_best_macros_auxii:NNNNNNNNN #1#2#3#4#5#6#7#8#9
{
int_zero:N l__find_best_score_int
tl_clear:N l__find_best_chunk_tl
tl_map_inline:nn
{
{#1} {#1#2} {#1#2#3} {#1#2#3#4} {#1#2#3#4#5} {#1#2#3#4#5#6}
{#1#2#3#4#5#6#7} {#1#2#3#4#5#6#7#8} {#1#2#3#4#5#6#7#8#9}
}
{
int_compare:nNnT
{ __find_score:n {##1} } > l__find_best_score_int
{
tl_set:Nn l__find_best_chunk_tl {##1}
int_set:Nn l__find_best_score_int
{ __find_score:n {##1} }
}
}
tl_if_empty:NF l__find_best_chunk_tl
{
int_incr:N l__find_macro_int
tl_put_right:Nx l__find_macros_tl
{
exp_not:c
{ l__find_prefix_str int_to_Alph:n { l__find_macro_int } }
{ exp_not:V l__find_best_chunk_tl }
}
tl_replace_all:NVc l__find_tl l__find_best_chunk_tl
{ l__find_prefix_str int_to_Alph:n { l__find_macro_int } }
dot:
}
tl_put_right:Nx l__find_result_tl { tl_head:N l__find_tl }
tl_set:Nx l__find_tl { tl_tail:N l__find_tl }
use_none_delimit_by_q_stop:w
}
cs_new:Npn __find_score:n #1
{
% Turning ####1 into a length (p+2) macro
% (e.g. |<prefix>A|) saves this number of chars.
% Good if non-negative.
cs_if_exist:cTF { l__find_chunk_ ' tl_to_str:n {#1} ' _tl }
{
int_eval:n
{
tl_count:c { l__find_chunk_ ' tl_to_str:n {#1} ' _tl }
* ( str_count:n {#1} - 3 - l__find_prefix_int )
- 2 * l__find_prefix_int - 9
}
}
{ -1 }
}
cs_new_protected:Npn __find_undefine_tmp:
{
int_step_inline:nnnn { 4 } { 1 } { 9 }
{
seq_map_inline:cn { l__find_chunks_##1_seq }
{ cs_undefine:c { l__find_chunk_ ' tl_to_str:n {####1} ' _tl } }
}
}
cs_new_protected:Npn __find_unescape_tl:nN #1#2
{
regex_replace_all:nnN { c{#1{} } { cB{ cE? } #2
dot:
regex_replace_all:nnN { c{#1}} } { cB? cE} } #2
dot:
regex_replace_all:nnN { c[BE]? } { } #2
dot:
regex_replace_all:nnN { c{#1#} } { cP# } #2
dot:
}
%%%% Hack LaTeX3's iow_wrap to allow zero-width breaking spaces
tl_const:Nx c__iow_wrap_zwbs_marker_tl
{
c_catcode_other_space_tl
c__iow_wrap_marker_tl
c_catcode_other_space_tl
zwbs
c_catcode_other_space_tl
}
cs_new_protected_nopar:Npn __iow_wrap_zwbs:
{
tl_put_right:Nn l__iow_current_line_tl
{ tex_romannumeral:D -`0 }
int_decr:N l__iow_current_line_int
}
%%% End of hack.
msg_new:nnn { find } { invalid-prefix }
{ The~prefix~used,~'#1',~must~be~made~of~letters. }
cs_set_protected:Npn FindMatches #1#2
{
find_matches:nnN {#1} {#2} l_tmpa_tl
tl_use:N l_tmpa_tl
}
ExplSyntaxOff
% Set up some basic things to behave a bit like plain TeX.
%
documentclass{article}
begin{document}
setlength{parindent}{0pt}
defbye{end{document}}
%
% End of setup.
FindMatches{}{% Let's go!
On the first day of Christmas my true love gave to mepar
a partridge in a pear tree.
bigskip
On the second day of Christmas my true love gave to mepar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the third day of Christmas my true love gave to mepar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the fourth day of Christmas my true love gave to mepar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the fifth day of Christmas my true love gave to mepar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the sixth day of Christmas my true love gave to mepar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the seventh day of Christmas my true love gave to mepar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the eighth day of Christmas my true love gave to mepar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the ninth day of Christmas my true love gave to mepar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the tenth day of Christmas my true love gave to mepar
ten lords a leapingpar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the eleventh day of Christmas my true love gave to mepar
eleven pipers pipingpar
ten lords a leapingpar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the twelfth day of Christmas my true love gave to mepar
twelve drummers drummingpar
eleven pipers pipingpar
ten lords a leapingpar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bye
}
5
Awesome as always, Bruno!:)
– Paulo Cereda
May 3 '13 at 0:54
5
This is one of the more outrageous things I've ever seen here. (It is very impressive!)
– Ryan Reich
May 3 '13 at 0:56
Hehe! Thanks @RyanReich. By the way, do you know of any algorithm in the vein of LZW that would be able to produce macros with varying arguments?
– Bruno Le Floch
May 3 '13 at 1:10
@Bruno All I know about LZW is the idea, unfortunately. You mean, you want the "dictionary" to be macros that take an argument, or that take a variable number of arguments? Isn't this the same as having an array and then having the macro argument index into it?
– Ryan Reich
May 3 '13 at 4:15
2
@BrunoLeFloch no wonder we get stalled on the middle layer of L3, but at least nobody can any longer deny that expl3 is up to arbitrary programming tasks :-)
– Frank Mittelbach
May 4 '13 at 20:16
|
show 3 more comments
EDIT: I had implemented some version of the Lempel-Ziv-Welch algorithm, following a comment of Ryan Reich. I've now changed the code completely and I don't know if the algorithm has a name.
I fed the new code the 12 days of Christmas, as David Carlisle asked, and obtained the following result (for plain TeX)
longdefA#1#2{def#1{#2}A}AAG{sparACAAB}AH{AEWXVTU mVAD}AA
{CDEFG}AB{HIJAparbigskipB}AC{OPparMNKL nd}AD{RSparQ
A ring}AE{Y es dancV}AF{tZ a leapV}B{par On the }C{ day of C}D
{hristmas }E{my true l}F{ove gave }G{to mepar}H{a partrid}I{ge in a p}J
{ear tree.}K{two turtl}L{e dovespar a}M{three fre}N{nch henspar}O{four
call}P{ing birds}Q{five gold}R{six geese}S{ a laying}T{seven swa}U{ns a
swim}V{ingpar}W{eight mai}X{ds a milk}Y{nine ladi}Z{en lords } A{ }B
firstAAAB secondAAKL nd AB thirdAAMNKL nd AB fourthAAACAAB
fifthAAQA ringAG sixthAAADAG seventhAATU mVADAG eighthAAWXVT
U mVADAG ninthAAAHAG tenthAAAFAHAG eleventhAA eleven pipers pipV
AFAHAG twelfthAA twelve drummers drummV eleven pipers pipVAFAH spar
ACAHIJAbye
which is 878 bytes long. This should be compared to the 767 bytes of xii.tex
, which produces the same lyrics.
RequirePackage{expl3}
ExplSyntaxOn
cs_generate_variant:Nn tl_replace_all:Nnn { Nxx, NVc, Nxn }
cs_generate_variant:Nn tl_replace_once:Nnn { Nxx }
cs_generate_variant:Nn tl_if_in:NnT { No }
cs_generate_variant:Nn int_min:nn { c, cc }
cs_generate_variant:Nn tl_if_eq:NNTF { cc }
cs_generate_variant:Nn tl_if_eq:nnTF { xx }
cs_generate_variant:Nn tl_if_eq:nnT { xx }
cs_generate_variant:Nn str_count:N { c }
cs_generate_variant:Nn regex_match:nnF { nV }
cs_generate_variant:Nn str_set:Nn { NV }
str_new:N l__find_prefix_str
int_new:N l__find_prefix_int
tl_new:N l__find_tl
tl_new:N l__find_chunks_tl
int_step_inline:nnnn { 1 } { 1 } { 9 }
{ seq_new:c { l__find_chunks_#1_seq } }
int_new:N l__find_common_int
int_new:N l__find_nesting_int
tl_new:N l__find_previous_tl
seq_new:N l__find_chunks_seq
int_new:N l__find_best_score_int
int_new:N l__find_macro_int
tl_new:N l__find_macros_tl
tl_new:N l__find_result_tl
int_new:N l__find_length_int
int_new:N l__find_previous_length_int
tl_new:N l__find_display_tl
cs_new_protected_nopar:Npn dot: { tex_message:D { . } }
cs_new_protected:Npn find_matches:nnN #1#2#3
{
% '#1' is the prefix, '#2' is the token list to study.
%
__find_set_prefix:n {#1}
tl_set:Nn l__find_tl { ~! #2 }
__find_escape_spaces:xN { l__find_prefix_str A } l__find_tl
int_set_eq:NN l__find_macro_int c_one
__find_get_length:V l__find_prefix_str
iow_term:x { int_use:N l__find_length_int }
int_set_eq:NN l__find_length_int c_max_int
__find_matches_aux:V l__find_prefix_str
tl_replace_once:Nnn l__find_tl { ! } { { ~ } }
tl_set:Nx #3
{
__find_preamble:
exp_not:V l__find_tl
}
}
cs_new_protected:Npn __find_escape_spaces:nN #1#2
{
regex_replace_all:nnN { cS } { c{#1} } #2
dot:
}
cs_generate_variant:Nn __find_escape_spaces:nN { x }
cs_new_nopar:Npn __find_preamble:
{
exp_not:n { long def } exp_not:c { l__find_prefix_str A } ####1####2
{
exp_not:N def ####1{####2}
exp_not:c { l__find_prefix_str A }
}
exp_not:c { l__find_prefix_str A }
}
cs_new_protected:Npn __find_matches_aux:n #1
{
int_set_eq:NN l__find_previous_length_int l__find_length_int
tl_set_eq:NN l__find_previous_tl l__find_tl
__find_escape_tl:nN {#1} l__find_tl
int_compare:nNnTF { tl_count:N l__find_tl } < c_nine
{ tl_set_eq:Nn l__find_tl l__find_previous_tl }
{
__find_set_chunks:
__find_sort_chunks:
__find_common:
__find_best_macros:
__find_undefine_tmp:
tl_set_eq:NN l__find_tl l__find_result_tl
__find_unescape_tl:nN {#1} l__find_tl
__find_get_length:n {#1}
iow_term:x { int_use:N l__find_length_int }
int_compare:nNnTF l__find_length_int < l__find_previous_length_int
{ __find_matches_aux:n {#1} }
{
iow_term:n { }
iow_term:x { l__find_display_tl }
iow_term:n { }
tl_set_eq:NN l__find_tl l__find_previous_tl
}
}
}
cs_generate_variant:Nn __find_matches_aux:n { V }
cs_new_protected:Npn __find_get_length:n #1
{
tl_set_eq:NN l__find_display_tl l__find_tl
tl_replace_once:Nxx l__find_display_tl
{ exp_not:c { #1 A } ! }
{ ~ exp_not:c { #1 A } {~} }
str_set:NV l__find_display_tl l__find_display_tl
tl_replace_all:Nxn l__find_display_tl
{ c_backslash_str #1 token_to_str:N A ~ } c_space_tl
tl_replace_all:Nnn l__find_display_tl
{ ~ c_space_tl } { ~ exp_not:c { #1 A } }
dot:
str_set:Nx l__find_display_tl { __find_preamble: l__find_display_tl }
tl_replace_all:Nxx l__find_display_tl
{ c_hash_str c_hash_str } { c_hash_str }
dot:
regex_replace_all:nnN
{ (\[A-Za-z]+) ([A-Za-z]) }
{ 1 2 }
l__find_display_tl
dot:
regex_replace_all:nnN
{ (\[A-Za-z]+) }
{ 1 c{c__iow_wrap_zwbs_marker_tl} }
l__find_display_tl
dot:
iow_wrap:nnnN { l__find_display_tl } { } { } __find_get_length_aux:n
}
cs_generate_variant:Nn __find_get_length:n { V }
cs_new_protected:Npn __find_get_length_aux:n #1
{
int_set:Nn l__find_length_int { str_count:n {#1} }
tl_set:Nn l__find_display_tl {#1}
% iow_term:n { }
% iow_term:n {#1}
% iow_term:n { }
}
cs_new_protected:Npn __find_set_prefix:n #1
{
% Check that the prefix |#1| is made only of alphabetic characters.
%
str_set:Nx l__find_prefix_str {#1}
int_set:Nn l__find_prefix_int { str_count:N l__find_prefix_str }
regex_match:nVF { Aw*Z } l__find_prefix_str
{
msg_error:nnx { find } { invalid-prefix }
{ l__find_prefix_str }
}
dot:
}
cs_new_protected:Npn __find_escape_tl:nN #1#2
{
% During the 'study' step, we manipulate the token list |#2|
% with all begin-group and end-group tokens replaced by a
% control sequence built from the prefix. We must change both
% begin-group and end-group tokens in one pass, to avoid getting an
% unbalanced result. Also replace macro parameters because they
% cannot be used as delimiters for macros. Spaces have been
% turned into a control sequence earlier. At this stage, every
% token in |#2| can be grabbed as an N-type argument.
%
regex_replace_all:nnN { cB. } { cB{ } #2
dot:
regex_replace_all:nnN { cE. } { cE} } #2
dot:
regex_replace_all:nnN { cP. } { c{ #1 # } } #2
dot:
regex_replace_all:nnN { c[BEP]. } { c{ #1 } } #2
dot:
}
cs_new_protected_nopar:Npn __find_set_chunks:
{
% Build a token list whose items are each nine consecutive tokens
% of the original token list, in a running window. So for instance
% |ABCDEFGHIJKL| would lead to the following (12) items:
% |ABCDEFGHI|, |BCDEFGHIJ|, |CDEFGHIJK|, |DEFGHIJKL|, |EFGHIJKL|,
% |FGHIJKL|, |GHIJKL|, |HIJKL|, |IJKL|, |JKL|, |KL|, |L|. The items
% of this token list are built in an |x|-expanded loop.
% A special case arises if the |find| token list is too short to
% safely perform the loop: then our whole algorithm is not going to
% do any good anyways, so we build an empty chunk list.
%
tl_set:Nx l__find_chunks_tl
{
exp_after:wN __find_set_chunks_loop:NNNNNNNNN l__find_tl
q_recursion_tail q_recursion_stop
}
}
cs_new:Npn __find_set_chunks_loop:NNNNNNNNN #1#2#3#4#5#6#7#8#9
{
% As usual in a TeX loop, first check for the end-loop marker (here,
% cs{q_recursion_tail}). If it is reached, we fill in the last few
% chunks (which become shorter and shorter as we go). Otherwise,
% add (to the token list we are building) an item containing (9)
% tokens, and loop, dropping the first of the items.
%
quark_if_recursion_tail_stop_do:Nn #9
{ __find_set_chunks_end:NNNNNNNN #1 #2 #3 #4 #5 #6 #7 #8 }
{ exp_not:n { #1 #2 #3 #4 #5 #6 #7 #8 #9 } }
__find_set_chunks_loop:NNNNNNNNN #2#3#4#5#6#7#8#9
}
cs_new:Npn __find_set_chunks_end:NNNNNNNN #1#2#3#4#5#6#7#8
{
exp_not:n
{
{ #1 #2 #3 #4 #5 #6 #7 #8 }
{ #2 #3 #4 #5 #6 #7 #8 }
{ #3 #4 #5 #6 #7 #8 }
{ #4 #5 #6 #7 #8 }
{ #5 #6 #7 #8 }
{ #6 #7 #8 }
{ #7 #8 }
{ #8 }
}
}
cs_new_protected:Npn __find_sort_chunks:
{
tl_sort:Nn l__find_chunks_tl
{
int_compare:nNnTF
{
pdftex_strcmp:D
{ exp_not:n {##1} }
{ exp_not:n {##2} }
}
> c_zero
{ sort_return_swapped: }
{ sort_return_same: }
}
}
cs_new_protected:Npn __find_common:
{
int_step_inline:nnnn { 1 } { 1 } { 9 }
{ seq_clear:c { l__find_chunks_##1_seq } }
exp_after:wN __find_common_loop:nn l__find_chunks_tl
q_recursion_tail q_recursion_tail q_recursion_stop
int_step_inline:nnnn { 4 } { 1 } { 9 }
{
seq_map_inline:cn { l__find_chunks_##1_seq }
{
tl_if_exist:cTF { l__find_chunk_ ' tl_to_str:n {####1} ' _tl }
{
tl_put_right:cn
{ l__find_chunk_ ' tl_to_str:n {####1} ' _tl } { i }
}
{
cs_set_eq:cN
{ l__find_chunk_ ' tl_to_str:n {####1} ' _tl } c_empty_tl
}
}
}
}
cs_new_protected:Npn __find_common_loop:nn #1#2
{
quark_if_recursion_tail_stop:n {#2}
int_zero:N l__find_common_int
__find_count_common_aux:nn {#1} {#2}
use:c { __find_common_ int_use:N l__find_common_int :w }
#1 X X X X X X X X X q_stop
__find_common_loop:nn {#2}
}
cs_new_protected:Npn __find_count_common_aux:nn #1#2
{
tl_if_empty:nF {#1}
{
tl_if_empty:nF {#2}
{
tl_if_eq:xxT { tl_head:n {#1} } { tl_head:n {#2} }
{
int_incr:N l__find_common_int
__find_count_common_aux:xx
{ tl_tail:n {#1} } { tl_tail:n {#2} }
}
}
}
}
cs_generate_variant:Nn __find_count_common_aux:nn { xx }
cs_new_eq:cN { __find_common_0:w } use_none_delimit_by_q_stop:w
cs_new_protected:cpn { __find_common_1:w } #1
{ __find_common_auxii:nnw { 1 } {#1} }
cs_new_protected:cpn { __find_common_2:w } #1#2
{ __find_common_auxii:nnw { 2 } { #1 #2 } }
cs_new_protected:cpn { __find_common_3:w } #1#2#3
{ __find_common_auxii:nnw { 3 } { #1 #2 #3 } }
cs_new_protected:cpn { __find_common_4:w } #1#2#3#4
{ __find_common_auxii:nnw { 4 } { #1 #2 #3 #4 } }
cs_new_protected:cpn { __find_common_5:w } #1#2#3#4#5
{
dot:
__find_common_auxii:nnw { 5 } { #1 #2 #3 #4 #5 }
}
cs_new_protected:cpn { __find_common_6:w } #1#2#3#4#5#6
{ __find_common_auxii:nnw { 6 } { #1 #2 #3 #4 #5 #6 } }
cs_new_protected:cpn { __find_common_7:w } #1#2#3#4#5#6#7
{ __find_common_auxii:nnw { 7 } { #1 #2 #3 #4 #5 #6 #7 } }
cs_new_protected:cpn { __find_common_8:w } #1#2#3#4#5#6#7#8
{ __find_common_auxii:nnw { 8 } { #1 #2 #3 #4 #5 #6 #7 #8 } }
cs_new_protected:cpn { __find_common_9:w } #1#2#3#4#5#6#7#8#9
{ __find_common_auxii:nnw { 9 } { #1 #2 #3 #4 #5 #6 #7 #8 #9 } }
cs_new_protected:Npn __find_common_auxii:nnw #1#2#3 q_stop
{
int_zero:N l__find_nesting_int
tl_map_inline:nn {#2}
{
str_case_x:nnn { exp_not:n {##1} }
{
{ exp_not:c { l__find_prefix_str c_lbrace_str } }
{ int_incr:N l__find_nesting_int }
{ exp_not:c { l__find_prefix_str c_rbrace_str } }
{
int_compare:nNnF l__find_nesting_int > c_zero
{ use_none_delimit_by_q_stop:w }
int_decr:N l__find_nesting_int
}
}
{ }
}
int_compare:nNnF l__find_nesting_int = c_zero
{ use_none_delimit_by_q_stop:w }
seq_put_right:cn { l__find_chunks_#1_seq } {#2}
use_none_delimit_by_q_stop:w
q_stop
}
cs_new_protected_nopar:Npn __find_best_macros:
{
tl_clear:N l__find_macros_tl
tl_clear:N l__find_result_tl
__find_best_macros_aux:
tl_put_left:NV l__find_result_tl l__find_macros_tl
}
cs_new_protected:Npn __find_best_macros_aux:
{
exp_after:wN __find_best_macros_auxii:NNNNNNNNN l__find_tl
q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_stop
tl_if_empty:NF l__find_tl { __find_best_macros_aux: }
}
cs_new_protected:Npn __find_best_macros_auxii:NNNNNNNNN #1#2#3#4#5#6#7#8#9
{
int_zero:N l__find_best_score_int
tl_clear:N l__find_best_chunk_tl
tl_map_inline:nn
{
{#1} {#1#2} {#1#2#3} {#1#2#3#4} {#1#2#3#4#5} {#1#2#3#4#5#6}
{#1#2#3#4#5#6#7} {#1#2#3#4#5#6#7#8} {#1#2#3#4#5#6#7#8#9}
}
{
int_compare:nNnT
{ __find_score:n {##1} } > l__find_best_score_int
{
tl_set:Nn l__find_best_chunk_tl {##1}
int_set:Nn l__find_best_score_int
{ __find_score:n {##1} }
}
}
tl_if_empty:NF l__find_best_chunk_tl
{
int_incr:N l__find_macro_int
tl_put_right:Nx l__find_macros_tl
{
exp_not:c
{ l__find_prefix_str int_to_Alph:n { l__find_macro_int } }
{ exp_not:V l__find_best_chunk_tl }
}
tl_replace_all:NVc l__find_tl l__find_best_chunk_tl
{ l__find_prefix_str int_to_Alph:n { l__find_macro_int } }
dot:
}
tl_put_right:Nx l__find_result_tl { tl_head:N l__find_tl }
tl_set:Nx l__find_tl { tl_tail:N l__find_tl }
use_none_delimit_by_q_stop:w
}
cs_new:Npn __find_score:n #1
{
% Turning ####1 into a length (p+2) macro
% (e.g. |<prefix>A|) saves this number of chars.
% Good if non-negative.
cs_if_exist:cTF { l__find_chunk_ ' tl_to_str:n {#1} ' _tl }
{
int_eval:n
{
tl_count:c { l__find_chunk_ ' tl_to_str:n {#1} ' _tl }
* ( str_count:n {#1} - 3 - l__find_prefix_int )
- 2 * l__find_prefix_int - 9
}
}
{ -1 }
}
cs_new_protected:Npn __find_undefine_tmp:
{
int_step_inline:nnnn { 4 } { 1 } { 9 }
{
seq_map_inline:cn { l__find_chunks_##1_seq }
{ cs_undefine:c { l__find_chunk_ ' tl_to_str:n {####1} ' _tl } }
}
}
cs_new_protected:Npn __find_unescape_tl:nN #1#2
{
regex_replace_all:nnN { c{#1{} } { cB{ cE? } #2
dot:
regex_replace_all:nnN { c{#1}} } { cB? cE} } #2
dot:
regex_replace_all:nnN { c[BE]? } { } #2
dot:
regex_replace_all:nnN { c{#1#} } { cP# } #2
dot:
}
%%%% Hack LaTeX3's iow_wrap to allow zero-width breaking spaces
tl_const:Nx c__iow_wrap_zwbs_marker_tl
{
c_catcode_other_space_tl
c__iow_wrap_marker_tl
c_catcode_other_space_tl
zwbs
c_catcode_other_space_tl
}
cs_new_protected_nopar:Npn __iow_wrap_zwbs:
{
tl_put_right:Nn l__iow_current_line_tl
{ tex_romannumeral:D -`0 }
int_decr:N l__iow_current_line_int
}
%%% End of hack.
msg_new:nnn { find } { invalid-prefix }
{ The~prefix~used,~'#1',~must~be~made~of~letters. }
cs_set_protected:Npn FindMatches #1#2
{
find_matches:nnN {#1} {#2} l_tmpa_tl
tl_use:N l_tmpa_tl
}
ExplSyntaxOff
% Set up some basic things to behave a bit like plain TeX.
%
documentclass{article}
begin{document}
setlength{parindent}{0pt}
defbye{end{document}}
%
% End of setup.
FindMatches{}{% Let's go!
On the first day of Christmas my true love gave to mepar
a partridge in a pear tree.
bigskip
On the second day of Christmas my true love gave to mepar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the third day of Christmas my true love gave to mepar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the fourth day of Christmas my true love gave to mepar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the fifth day of Christmas my true love gave to mepar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the sixth day of Christmas my true love gave to mepar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the seventh day of Christmas my true love gave to mepar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the eighth day of Christmas my true love gave to mepar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the ninth day of Christmas my true love gave to mepar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the tenth day of Christmas my true love gave to mepar
ten lords a leapingpar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the eleventh day of Christmas my true love gave to mepar
eleven pipers pipingpar
ten lords a leapingpar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the twelfth day of Christmas my true love gave to mepar
twelve drummers drummingpar
eleven pipers pipingpar
ten lords a leapingpar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bye
}
EDIT: I had implemented some version of the Lempel-Ziv-Welch algorithm, following a comment of Ryan Reich. I've now changed the code completely and I don't know if the algorithm has a name.
I fed the new code the 12 days of Christmas, as David Carlisle asked, and obtained the following result (for plain TeX)
longdefA#1#2{def#1{#2}A}AAG{sparACAAB}AH{AEWXVTU mVAD}AA
{CDEFG}AB{HIJAparbigskipB}AC{OPparMNKL nd}AD{RSparQ
A ring}AE{Y es dancV}AF{tZ a leapV}B{par On the }C{ day of C}D
{hristmas }E{my true l}F{ove gave }G{to mepar}H{a partrid}I{ge in a p}J
{ear tree.}K{two turtl}L{e dovespar a}M{three fre}N{nch henspar}O{four
call}P{ing birds}Q{five gold}R{six geese}S{ a laying}T{seven swa}U{ns a
swim}V{ingpar}W{eight mai}X{ds a milk}Y{nine ladi}Z{en lords } A{ }B
firstAAAB secondAAKL nd AB thirdAAMNKL nd AB fourthAAACAAB
fifthAAQA ringAG sixthAAADAG seventhAATU mVADAG eighthAAWXVT
U mVADAG ninthAAAHAG tenthAAAFAHAG eleventhAA eleven pipers pipV
AFAHAG twelfthAA twelve drummers drummV eleven pipers pipVAFAH spar
ACAHIJAbye
which is 878 bytes long. This should be compared to the 767 bytes of xii.tex
, which produces the same lyrics.
RequirePackage{expl3}
ExplSyntaxOn
cs_generate_variant:Nn tl_replace_all:Nnn { Nxx, NVc, Nxn }
cs_generate_variant:Nn tl_replace_once:Nnn { Nxx }
cs_generate_variant:Nn tl_if_in:NnT { No }
cs_generate_variant:Nn int_min:nn { c, cc }
cs_generate_variant:Nn tl_if_eq:NNTF { cc }
cs_generate_variant:Nn tl_if_eq:nnTF { xx }
cs_generate_variant:Nn tl_if_eq:nnT { xx }
cs_generate_variant:Nn str_count:N { c }
cs_generate_variant:Nn regex_match:nnF { nV }
cs_generate_variant:Nn str_set:Nn { NV }
str_new:N l__find_prefix_str
int_new:N l__find_prefix_int
tl_new:N l__find_tl
tl_new:N l__find_chunks_tl
int_step_inline:nnnn { 1 } { 1 } { 9 }
{ seq_new:c { l__find_chunks_#1_seq } }
int_new:N l__find_common_int
int_new:N l__find_nesting_int
tl_new:N l__find_previous_tl
seq_new:N l__find_chunks_seq
int_new:N l__find_best_score_int
int_new:N l__find_macro_int
tl_new:N l__find_macros_tl
tl_new:N l__find_result_tl
int_new:N l__find_length_int
int_new:N l__find_previous_length_int
tl_new:N l__find_display_tl
cs_new_protected_nopar:Npn dot: { tex_message:D { . } }
cs_new_protected:Npn find_matches:nnN #1#2#3
{
% '#1' is the prefix, '#2' is the token list to study.
%
__find_set_prefix:n {#1}
tl_set:Nn l__find_tl { ~! #2 }
__find_escape_spaces:xN { l__find_prefix_str A } l__find_tl
int_set_eq:NN l__find_macro_int c_one
__find_get_length:V l__find_prefix_str
iow_term:x { int_use:N l__find_length_int }
int_set_eq:NN l__find_length_int c_max_int
__find_matches_aux:V l__find_prefix_str
tl_replace_once:Nnn l__find_tl { ! } { { ~ } }
tl_set:Nx #3
{
__find_preamble:
exp_not:V l__find_tl
}
}
cs_new_protected:Npn __find_escape_spaces:nN #1#2
{
regex_replace_all:nnN { cS } { c{#1} } #2
dot:
}
cs_generate_variant:Nn __find_escape_spaces:nN { x }
cs_new_nopar:Npn __find_preamble:
{
exp_not:n { long def } exp_not:c { l__find_prefix_str A } ####1####2
{
exp_not:N def ####1{####2}
exp_not:c { l__find_prefix_str A }
}
exp_not:c { l__find_prefix_str A }
}
cs_new_protected:Npn __find_matches_aux:n #1
{
int_set_eq:NN l__find_previous_length_int l__find_length_int
tl_set_eq:NN l__find_previous_tl l__find_tl
__find_escape_tl:nN {#1} l__find_tl
int_compare:nNnTF { tl_count:N l__find_tl } < c_nine
{ tl_set_eq:Nn l__find_tl l__find_previous_tl }
{
__find_set_chunks:
__find_sort_chunks:
__find_common:
__find_best_macros:
__find_undefine_tmp:
tl_set_eq:NN l__find_tl l__find_result_tl
__find_unescape_tl:nN {#1} l__find_tl
__find_get_length:n {#1}
iow_term:x { int_use:N l__find_length_int }
int_compare:nNnTF l__find_length_int < l__find_previous_length_int
{ __find_matches_aux:n {#1} }
{
iow_term:n { }
iow_term:x { l__find_display_tl }
iow_term:n { }
tl_set_eq:NN l__find_tl l__find_previous_tl
}
}
}
cs_generate_variant:Nn __find_matches_aux:n { V }
cs_new_protected:Npn __find_get_length:n #1
{
tl_set_eq:NN l__find_display_tl l__find_tl
tl_replace_once:Nxx l__find_display_tl
{ exp_not:c { #1 A } ! }
{ ~ exp_not:c { #1 A } {~} }
str_set:NV l__find_display_tl l__find_display_tl
tl_replace_all:Nxn l__find_display_tl
{ c_backslash_str #1 token_to_str:N A ~ } c_space_tl
tl_replace_all:Nnn l__find_display_tl
{ ~ c_space_tl } { ~ exp_not:c { #1 A } }
dot:
str_set:Nx l__find_display_tl { __find_preamble: l__find_display_tl }
tl_replace_all:Nxx l__find_display_tl
{ c_hash_str c_hash_str } { c_hash_str }
dot:
regex_replace_all:nnN
{ (\[A-Za-z]+) ([A-Za-z]) }
{ 1 2 }
l__find_display_tl
dot:
regex_replace_all:nnN
{ (\[A-Za-z]+) }
{ 1 c{c__iow_wrap_zwbs_marker_tl} }
l__find_display_tl
dot:
iow_wrap:nnnN { l__find_display_tl } { } { } __find_get_length_aux:n
}
cs_generate_variant:Nn __find_get_length:n { V }
cs_new_protected:Npn __find_get_length_aux:n #1
{
int_set:Nn l__find_length_int { str_count:n {#1} }
tl_set:Nn l__find_display_tl {#1}
% iow_term:n { }
% iow_term:n {#1}
% iow_term:n { }
}
cs_new_protected:Npn __find_set_prefix:n #1
{
% Check that the prefix |#1| is made only of alphabetic characters.
%
str_set:Nx l__find_prefix_str {#1}
int_set:Nn l__find_prefix_int { str_count:N l__find_prefix_str }
regex_match:nVF { Aw*Z } l__find_prefix_str
{
msg_error:nnx { find } { invalid-prefix }
{ l__find_prefix_str }
}
dot:
}
cs_new_protected:Npn __find_escape_tl:nN #1#2
{
% During the 'study' step, we manipulate the token list |#2|
% with all begin-group and end-group tokens replaced by a
% control sequence built from the prefix. We must change both
% begin-group and end-group tokens in one pass, to avoid getting an
% unbalanced result. Also replace macro parameters because they
% cannot be used as delimiters for macros. Spaces have been
% turned into a control sequence earlier. At this stage, every
% token in |#2| can be grabbed as an N-type argument.
%
regex_replace_all:nnN { cB. } { cB{ } #2
dot:
regex_replace_all:nnN { cE. } { cE} } #2
dot:
regex_replace_all:nnN { cP. } { c{ #1 # } } #2
dot:
regex_replace_all:nnN { c[BEP]. } { c{ #1 } } #2
dot:
}
cs_new_protected_nopar:Npn __find_set_chunks:
{
% Build a token list whose items are each nine consecutive tokens
% of the original token list, in a running window. So for instance
% |ABCDEFGHIJKL| would lead to the following (12) items:
% |ABCDEFGHI|, |BCDEFGHIJ|, |CDEFGHIJK|, |DEFGHIJKL|, |EFGHIJKL|,
% |FGHIJKL|, |GHIJKL|, |HIJKL|, |IJKL|, |JKL|, |KL|, |L|. The items
% of this token list are built in an |x|-expanded loop.
% A special case arises if the |find| token list is too short to
% safely perform the loop: then our whole algorithm is not going to
% do any good anyways, so we build an empty chunk list.
%
tl_set:Nx l__find_chunks_tl
{
exp_after:wN __find_set_chunks_loop:NNNNNNNNN l__find_tl
q_recursion_tail q_recursion_stop
}
}
cs_new:Npn __find_set_chunks_loop:NNNNNNNNN #1#2#3#4#5#6#7#8#9
{
% As usual in a TeX loop, first check for the end-loop marker (here,
% cs{q_recursion_tail}). If it is reached, we fill in the last few
% chunks (which become shorter and shorter as we go). Otherwise,
% add (to the token list we are building) an item containing (9)
% tokens, and loop, dropping the first of the items.
%
quark_if_recursion_tail_stop_do:Nn #9
{ __find_set_chunks_end:NNNNNNNN #1 #2 #3 #4 #5 #6 #7 #8 }
{ exp_not:n { #1 #2 #3 #4 #5 #6 #7 #8 #9 } }
__find_set_chunks_loop:NNNNNNNNN #2#3#4#5#6#7#8#9
}
cs_new:Npn __find_set_chunks_end:NNNNNNNN #1#2#3#4#5#6#7#8
{
exp_not:n
{
{ #1 #2 #3 #4 #5 #6 #7 #8 }
{ #2 #3 #4 #5 #6 #7 #8 }
{ #3 #4 #5 #6 #7 #8 }
{ #4 #5 #6 #7 #8 }
{ #5 #6 #7 #8 }
{ #6 #7 #8 }
{ #7 #8 }
{ #8 }
}
}
cs_new_protected:Npn __find_sort_chunks:
{
tl_sort:Nn l__find_chunks_tl
{
int_compare:nNnTF
{
pdftex_strcmp:D
{ exp_not:n {##1} }
{ exp_not:n {##2} }
}
> c_zero
{ sort_return_swapped: }
{ sort_return_same: }
}
}
cs_new_protected:Npn __find_common:
{
int_step_inline:nnnn { 1 } { 1 } { 9 }
{ seq_clear:c { l__find_chunks_##1_seq } }
exp_after:wN __find_common_loop:nn l__find_chunks_tl
q_recursion_tail q_recursion_tail q_recursion_stop
int_step_inline:nnnn { 4 } { 1 } { 9 }
{
seq_map_inline:cn { l__find_chunks_##1_seq }
{
tl_if_exist:cTF { l__find_chunk_ ' tl_to_str:n {####1} ' _tl }
{
tl_put_right:cn
{ l__find_chunk_ ' tl_to_str:n {####1} ' _tl } { i }
}
{
cs_set_eq:cN
{ l__find_chunk_ ' tl_to_str:n {####1} ' _tl } c_empty_tl
}
}
}
}
cs_new_protected:Npn __find_common_loop:nn #1#2
{
quark_if_recursion_tail_stop:n {#2}
int_zero:N l__find_common_int
__find_count_common_aux:nn {#1} {#2}
use:c { __find_common_ int_use:N l__find_common_int :w }
#1 X X X X X X X X X q_stop
__find_common_loop:nn {#2}
}
cs_new_protected:Npn __find_count_common_aux:nn #1#2
{
tl_if_empty:nF {#1}
{
tl_if_empty:nF {#2}
{
tl_if_eq:xxT { tl_head:n {#1} } { tl_head:n {#2} }
{
int_incr:N l__find_common_int
__find_count_common_aux:xx
{ tl_tail:n {#1} } { tl_tail:n {#2} }
}
}
}
}
cs_generate_variant:Nn __find_count_common_aux:nn { xx }
cs_new_eq:cN { __find_common_0:w } use_none_delimit_by_q_stop:w
cs_new_protected:cpn { __find_common_1:w } #1
{ __find_common_auxii:nnw { 1 } {#1} }
cs_new_protected:cpn { __find_common_2:w } #1#2
{ __find_common_auxii:nnw { 2 } { #1 #2 } }
cs_new_protected:cpn { __find_common_3:w } #1#2#3
{ __find_common_auxii:nnw { 3 } { #1 #2 #3 } }
cs_new_protected:cpn { __find_common_4:w } #1#2#3#4
{ __find_common_auxii:nnw { 4 } { #1 #2 #3 #4 } }
cs_new_protected:cpn { __find_common_5:w } #1#2#3#4#5
{
dot:
__find_common_auxii:nnw { 5 } { #1 #2 #3 #4 #5 }
}
cs_new_protected:cpn { __find_common_6:w } #1#2#3#4#5#6
{ __find_common_auxii:nnw { 6 } { #1 #2 #3 #4 #5 #6 } }
cs_new_protected:cpn { __find_common_7:w } #1#2#3#4#5#6#7
{ __find_common_auxii:nnw { 7 } { #1 #2 #3 #4 #5 #6 #7 } }
cs_new_protected:cpn { __find_common_8:w } #1#2#3#4#5#6#7#8
{ __find_common_auxii:nnw { 8 } { #1 #2 #3 #4 #5 #6 #7 #8 } }
cs_new_protected:cpn { __find_common_9:w } #1#2#3#4#5#6#7#8#9
{ __find_common_auxii:nnw { 9 } { #1 #2 #3 #4 #5 #6 #7 #8 #9 } }
cs_new_protected:Npn __find_common_auxii:nnw #1#2#3 q_stop
{
int_zero:N l__find_nesting_int
tl_map_inline:nn {#2}
{
str_case_x:nnn { exp_not:n {##1} }
{
{ exp_not:c { l__find_prefix_str c_lbrace_str } }
{ int_incr:N l__find_nesting_int }
{ exp_not:c { l__find_prefix_str c_rbrace_str } }
{
int_compare:nNnF l__find_nesting_int > c_zero
{ use_none_delimit_by_q_stop:w }
int_decr:N l__find_nesting_int
}
}
{ }
}
int_compare:nNnF l__find_nesting_int = c_zero
{ use_none_delimit_by_q_stop:w }
seq_put_right:cn { l__find_chunks_#1_seq } {#2}
use_none_delimit_by_q_stop:w
q_stop
}
cs_new_protected_nopar:Npn __find_best_macros:
{
tl_clear:N l__find_macros_tl
tl_clear:N l__find_result_tl
__find_best_macros_aux:
tl_put_left:NV l__find_result_tl l__find_macros_tl
}
cs_new_protected:Npn __find_best_macros_aux:
{
exp_after:wN __find_best_macros_auxii:NNNNNNNNN l__find_tl
q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_nil q_stop
tl_if_empty:NF l__find_tl { __find_best_macros_aux: }
}
cs_new_protected:Npn __find_best_macros_auxii:NNNNNNNNN #1#2#3#4#5#6#7#8#9
{
int_zero:N l__find_best_score_int
tl_clear:N l__find_best_chunk_tl
tl_map_inline:nn
{
{#1} {#1#2} {#1#2#3} {#1#2#3#4} {#1#2#3#4#5} {#1#2#3#4#5#6}
{#1#2#3#4#5#6#7} {#1#2#3#4#5#6#7#8} {#1#2#3#4#5#6#7#8#9}
}
{
int_compare:nNnT
{ __find_score:n {##1} } > l__find_best_score_int
{
tl_set:Nn l__find_best_chunk_tl {##1}
int_set:Nn l__find_best_score_int
{ __find_score:n {##1} }
}
}
tl_if_empty:NF l__find_best_chunk_tl
{
int_incr:N l__find_macro_int
tl_put_right:Nx l__find_macros_tl
{
exp_not:c
{ l__find_prefix_str int_to_Alph:n { l__find_macro_int } }
{ exp_not:V l__find_best_chunk_tl }
}
tl_replace_all:NVc l__find_tl l__find_best_chunk_tl
{ l__find_prefix_str int_to_Alph:n { l__find_macro_int } }
dot:
}
tl_put_right:Nx l__find_result_tl { tl_head:N l__find_tl }
tl_set:Nx l__find_tl { tl_tail:N l__find_tl }
use_none_delimit_by_q_stop:w
}
cs_new:Npn __find_score:n #1
{
% Turning ####1 into a length (p+2) macro
% (e.g. |<prefix>A|) saves this number of chars.
% Good if non-negative.
cs_if_exist:cTF { l__find_chunk_ ' tl_to_str:n {#1} ' _tl }
{
int_eval:n
{
tl_count:c { l__find_chunk_ ' tl_to_str:n {#1} ' _tl }
* ( str_count:n {#1} - 3 - l__find_prefix_int )
- 2 * l__find_prefix_int - 9
}
}
{ -1 }
}
cs_new_protected:Npn __find_undefine_tmp:
{
int_step_inline:nnnn { 4 } { 1 } { 9 }
{
seq_map_inline:cn { l__find_chunks_##1_seq }
{ cs_undefine:c { l__find_chunk_ ' tl_to_str:n {####1} ' _tl } }
}
}
cs_new_protected:Npn __find_unescape_tl:nN #1#2
{
regex_replace_all:nnN { c{#1{} } { cB{ cE? } #2
dot:
regex_replace_all:nnN { c{#1}} } { cB? cE} } #2
dot:
regex_replace_all:nnN { c[BE]? } { } #2
dot:
regex_replace_all:nnN { c{#1#} } { cP# } #2
dot:
}
%%%% Hack LaTeX3's iow_wrap to allow zero-width breaking spaces
tl_const:Nx c__iow_wrap_zwbs_marker_tl
{
c_catcode_other_space_tl
c__iow_wrap_marker_tl
c_catcode_other_space_tl
zwbs
c_catcode_other_space_tl
}
cs_new_protected_nopar:Npn __iow_wrap_zwbs:
{
tl_put_right:Nn l__iow_current_line_tl
{ tex_romannumeral:D -`0 }
int_decr:N l__iow_current_line_int
}
%%% End of hack.
msg_new:nnn { find } { invalid-prefix }
{ The~prefix~used,~'#1',~must~be~made~of~letters. }
cs_set_protected:Npn FindMatches #1#2
{
find_matches:nnN {#1} {#2} l_tmpa_tl
tl_use:N l_tmpa_tl
}
ExplSyntaxOff
% Set up some basic things to behave a bit like plain TeX.
%
documentclass{article}
begin{document}
setlength{parindent}{0pt}
defbye{end{document}}
%
% End of setup.
FindMatches{}{% Let's go!
On the first day of Christmas my true love gave to mepar
a partridge in a pear tree.
bigskip
On the second day of Christmas my true love gave to mepar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the third day of Christmas my true love gave to mepar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the fourth day of Christmas my true love gave to mepar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the fifth day of Christmas my true love gave to mepar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the sixth day of Christmas my true love gave to mepar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the seventh day of Christmas my true love gave to mepar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the eighth day of Christmas my true love gave to mepar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the ninth day of Christmas my true love gave to mepar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the tenth day of Christmas my true love gave to mepar
ten lords a leapingpar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the eleventh day of Christmas my true love gave to mepar
eleven pipers pipingpar
ten lords a leapingpar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bigskip
On the twelfth day of Christmas my true love gave to mepar
twelve drummers drummingpar
eleven pipers pipingpar
ten lords a leapingpar
nine ladies dancingpar
eight maids a milkingpar
seven swans a swimmingpar
six geese a layingpar
five gold ringspar
four calling birdspar
three french henspar
two turtle dovespar
and a partridge in a pear tree.
bye
}
edited 28 mins ago
answered May 3 '13 at 0:48
Bruno Le Floch
33.8k5114211
33.8k5114211
5
Awesome as always, Bruno!:)
– Paulo Cereda
May 3 '13 at 0:54
5
This is one of the more outrageous things I've ever seen here. (It is very impressive!)
– Ryan Reich
May 3 '13 at 0:56
Hehe! Thanks @RyanReich. By the way, do you know of any algorithm in the vein of LZW that would be able to produce macros with varying arguments?
– Bruno Le Floch
May 3 '13 at 1:10
@Bruno All I know about LZW is the idea, unfortunately. You mean, you want the "dictionary" to be macros that take an argument, or that take a variable number of arguments? Isn't this the same as having an array and then having the macro argument index into it?
– Ryan Reich
May 3 '13 at 4:15
2
@BrunoLeFloch no wonder we get stalled on the middle layer of L3, but at least nobody can any longer deny that expl3 is up to arbitrary programming tasks :-)
– Frank Mittelbach
May 4 '13 at 20:16
|
show 3 more comments
5
Awesome as always, Bruno!:)
– Paulo Cereda
May 3 '13 at 0:54
5
This is one of the more outrageous things I've ever seen here. (It is very impressive!)
– Ryan Reich
May 3 '13 at 0:56
Hehe! Thanks @RyanReich. By the way, do you know of any algorithm in the vein of LZW that would be able to produce macros with varying arguments?
– Bruno Le Floch
May 3 '13 at 1:10
@Bruno All I know about LZW is the idea, unfortunately. You mean, you want the "dictionary" to be macros that take an argument, or that take a variable number of arguments? Isn't this the same as having an array and then having the macro argument index into it?
– Ryan Reich
May 3 '13 at 4:15
2
@BrunoLeFloch no wonder we get stalled on the middle layer of L3, but at least nobody can any longer deny that expl3 is up to arbitrary programming tasks :-)
– Frank Mittelbach
May 4 '13 at 20:16
5
5
Awesome as always, Bruno!
:)
– Paulo Cereda
May 3 '13 at 0:54
Awesome as always, Bruno!
:)
– Paulo Cereda
May 3 '13 at 0:54
5
5
This is one of the more outrageous things I've ever seen here. (It is very impressive!)
– Ryan Reich
May 3 '13 at 0:56
This is one of the more outrageous things I've ever seen here. (It is very impressive!)
– Ryan Reich
May 3 '13 at 0:56
Hehe! Thanks @RyanReich. By the way, do you know of any algorithm in the vein of LZW that would be able to produce macros with varying arguments?
– Bruno Le Floch
May 3 '13 at 1:10
Hehe! Thanks @RyanReich. By the way, do you know of any algorithm in the vein of LZW that would be able to produce macros with varying arguments?
– Bruno Le Floch
May 3 '13 at 1:10
@Bruno All I know about LZW is the idea, unfortunately. You mean, you want the "dictionary" to be macros that take an argument, or that take a variable number of arguments? Isn't this the same as having an array and then having the macro argument index into it?
– Ryan Reich
May 3 '13 at 4:15
@Bruno All I know about LZW is the idea, unfortunately. You mean, you want the "dictionary" to be macros that take an argument, or that take a variable number of arguments? Isn't this the same as having an array and then having the macro argument index into it?
– Ryan Reich
May 3 '13 at 4:15
2
2
@BrunoLeFloch no wonder we get stalled on the middle layer of L3, but at least nobody can any longer deny that expl3 is up to arbitrary programming tasks :-)
– Frank Mittelbach
May 4 '13 at 20:16
@BrunoLeFloch no wonder we get stalled on the middle layer of L3, but at least nobody can any longer deny that expl3 is up to arbitrary programming tasks :-)
– Frank Mittelbach
May 4 '13 at 20:16
|
show 3 more comments
What I present is a total kludge, stealing things from my titlecaps
package. So the format is not optimal, but it might give a place from which to proceed. Nice thing is that punctuation is screened out of the search in the target-text.
When a search-string and target-text (not exceeding one paragraph) is passed to the seekphrase
, it will output where each word of the search-phrase appears in the target-text. I output it as a pair "target-location:search-word-index". To try to make sense of it, I output a [
before a search-word-index=1 match and a ]
after a search-word-index=n match. Thus to qualify as a matching "phrase" in, for example, a 3-word search, the output would have to contain an instance of something like [214:1 215:2, 216:3]
Further refinement is clearly possible, but it would take more time than I have at the moment.
documentclass{article}
usepackage{titlecaps}
usepackage{lipsum}
makeatletter
renewcommandseek@lcwords{%
kill@punct%
setcounter{word@count}{0}%
whiledo{value{word@count} < narg}{%
addtocounter{word@count}{1}%
protected@edefcurrent@word{csname argroman{word@count}endcsname}%
deffound@word{F}%
setcounter{lcword@index}{0}%
expandafterdefcsname%
found@wordroman{word@count}endcsname{F}%
whiledo{value{lcword@index} < value{lc@words}}{%
addtocounter{lcword@index}{1}%
protected@edefcurrent@lcword{%
csname lcwordroman{lcword@index}endcsname}%
%% THE FOLLOWING THREE LINES ARE FROM DAVID CARLISLE
protected@edeftmp{noexpandscantokens{defnoexpandtmp%
{noexpandifthenelse{noexpandequal{current@word}{current@lcword}}}}}%
tmpifhmodeunskipfitmp
%%
{expandafterdefcsname%
found@wordroman{word@count}endcsname{T}%
ifthenelse{equal{value{lcword@index}}{1}}{[}{}%
arabic{word@count}:arabic{lcword@index}%
ifthenelse{equal{value{lcword@index}}{value{lc@words}}}{]}{ }%
setcounter{lcword@index}{value{lc@words}}}%
{}%
}%
}%
restore@punct%
}
letgetargsCget@argsC
newcommandseekphrase[2]{%
Seeking phrase ``#1'':\%
Addlcwords{#1}%
redefine@tertius%
getargsC{#2}%
seek@lcwords%
Resetlcwords%
par%
}
makeatother
begin{document}
lipsum[1]
lipsum[1]
defx{%
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut
purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis.
Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget,
consectetuer id, vulputate a, magna. Donec vehicula augue eu neque.
Pellentesque habitant morbi tristique senectus et netus et malesuada
fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem.
Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus
sit amet tortor gravida placerat. Integer sapien est, iaculis in,
pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices
bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar
at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci
eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis,
diam. Duis eget orci sit amet orci dignissim rutrum.
%
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut
purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis.
Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget,
consectetuer id, vulputate a, magna. Donec vehicula augue eu neque.
Pellentesque habitant morbi tristique senectus et netus et malesuada
fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem.
Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus
sit amet tortor gravida placerat. Integer sapien est, iaculis in,
pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices
bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar
at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci
eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis,
diam. Duis eget orci sit amet orci dignissim rutrum.
}
seekphrase{et}{x}
seekphrase{et netus}{x}
seekphrase{bibendum Aenean}{x}
seekphrase{eget sem vel}{x}
end{document}
add a comment |
What I present is a total kludge, stealing things from my titlecaps
package. So the format is not optimal, but it might give a place from which to proceed. Nice thing is that punctuation is screened out of the search in the target-text.
When a search-string and target-text (not exceeding one paragraph) is passed to the seekphrase
, it will output where each word of the search-phrase appears in the target-text. I output it as a pair "target-location:search-word-index". To try to make sense of it, I output a [
before a search-word-index=1 match and a ]
after a search-word-index=n match. Thus to qualify as a matching "phrase" in, for example, a 3-word search, the output would have to contain an instance of something like [214:1 215:2, 216:3]
Further refinement is clearly possible, but it would take more time than I have at the moment.
documentclass{article}
usepackage{titlecaps}
usepackage{lipsum}
makeatletter
renewcommandseek@lcwords{%
kill@punct%
setcounter{word@count}{0}%
whiledo{value{word@count} < narg}{%
addtocounter{word@count}{1}%
protected@edefcurrent@word{csname argroman{word@count}endcsname}%
deffound@word{F}%
setcounter{lcword@index}{0}%
expandafterdefcsname%
found@wordroman{word@count}endcsname{F}%
whiledo{value{lcword@index} < value{lc@words}}{%
addtocounter{lcword@index}{1}%
protected@edefcurrent@lcword{%
csname lcwordroman{lcword@index}endcsname}%
%% THE FOLLOWING THREE LINES ARE FROM DAVID CARLISLE
protected@edeftmp{noexpandscantokens{defnoexpandtmp%
{noexpandifthenelse{noexpandequal{current@word}{current@lcword}}}}}%
tmpifhmodeunskipfitmp
%%
{expandafterdefcsname%
found@wordroman{word@count}endcsname{T}%
ifthenelse{equal{value{lcword@index}}{1}}{[}{}%
arabic{word@count}:arabic{lcword@index}%
ifthenelse{equal{value{lcword@index}}{value{lc@words}}}{]}{ }%
setcounter{lcword@index}{value{lc@words}}}%
{}%
}%
}%
restore@punct%
}
letgetargsCget@argsC
newcommandseekphrase[2]{%
Seeking phrase ``#1'':\%
Addlcwords{#1}%
redefine@tertius%
getargsC{#2}%
seek@lcwords%
Resetlcwords%
par%
}
makeatother
begin{document}
lipsum[1]
lipsum[1]
defx{%
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut
purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis.
Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget,
consectetuer id, vulputate a, magna. Donec vehicula augue eu neque.
Pellentesque habitant morbi tristique senectus et netus et malesuada
fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem.
Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus
sit amet tortor gravida placerat. Integer sapien est, iaculis in,
pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices
bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar
at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci
eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis,
diam. Duis eget orci sit amet orci dignissim rutrum.
%
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut
purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis.
Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget,
consectetuer id, vulputate a, magna. Donec vehicula augue eu neque.
Pellentesque habitant morbi tristique senectus et netus et malesuada
fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem.
Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus
sit amet tortor gravida placerat. Integer sapien est, iaculis in,
pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices
bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar
at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci
eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis,
diam. Duis eget orci sit amet orci dignissim rutrum.
}
seekphrase{et}{x}
seekphrase{et netus}{x}
seekphrase{bibendum Aenean}{x}
seekphrase{eget sem vel}{x}
end{document}
add a comment |
What I present is a total kludge, stealing things from my titlecaps
package. So the format is not optimal, but it might give a place from which to proceed. Nice thing is that punctuation is screened out of the search in the target-text.
When a search-string and target-text (not exceeding one paragraph) is passed to the seekphrase
, it will output where each word of the search-phrase appears in the target-text. I output it as a pair "target-location:search-word-index". To try to make sense of it, I output a [
before a search-word-index=1 match and a ]
after a search-word-index=n match. Thus to qualify as a matching "phrase" in, for example, a 3-word search, the output would have to contain an instance of something like [214:1 215:2, 216:3]
Further refinement is clearly possible, but it would take more time than I have at the moment.
documentclass{article}
usepackage{titlecaps}
usepackage{lipsum}
makeatletter
renewcommandseek@lcwords{%
kill@punct%
setcounter{word@count}{0}%
whiledo{value{word@count} < narg}{%
addtocounter{word@count}{1}%
protected@edefcurrent@word{csname argroman{word@count}endcsname}%
deffound@word{F}%
setcounter{lcword@index}{0}%
expandafterdefcsname%
found@wordroman{word@count}endcsname{F}%
whiledo{value{lcword@index} < value{lc@words}}{%
addtocounter{lcword@index}{1}%
protected@edefcurrent@lcword{%
csname lcwordroman{lcword@index}endcsname}%
%% THE FOLLOWING THREE LINES ARE FROM DAVID CARLISLE
protected@edeftmp{noexpandscantokens{defnoexpandtmp%
{noexpandifthenelse{noexpandequal{current@word}{current@lcword}}}}}%
tmpifhmodeunskipfitmp
%%
{expandafterdefcsname%
found@wordroman{word@count}endcsname{T}%
ifthenelse{equal{value{lcword@index}}{1}}{[}{}%
arabic{word@count}:arabic{lcword@index}%
ifthenelse{equal{value{lcword@index}}{value{lc@words}}}{]}{ }%
setcounter{lcword@index}{value{lc@words}}}%
{}%
}%
}%
restore@punct%
}
letgetargsCget@argsC
newcommandseekphrase[2]{%
Seeking phrase ``#1'':\%
Addlcwords{#1}%
redefine@tertius%
getargsC{#2}%
seek@lcwords%
Resetlcwords%
par%
}
makeatother
begin{document}
lipsum[1]
lipsum[1]
defx{%
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut
purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis.
Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget,
consectetuer id, vulputate a, magna. Donec vehicula augue eu neque.
Pellentesque habitant morbi tristique senectus et netus et malesuada
fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem.
Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus
sit amet tortor gravida placerat. Integer sapien est, iaculis in,
pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices
bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar
at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci
eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis,
diam. Duis eget orci sit amet orci dignissim rutrum.
%
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut
purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis.
Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget,
consectetuer id, vulputate a, magna. Donec vehicula augue eu neque.
Pellentesque habitant morbi tristique senectus et netus et malesuada
fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem.
Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus
sit amet tortor gravida placerat. Integer sapien est, iaculis in,
pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices
bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar
at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci
eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis,
diam. Duis eget orci sit amet orci dignissim rutrum.
}
seekphrase{et}{x}
seekphrase{et netus}{x}
seekphrase{bibendum Aenean}{x}
seekphrase{eget sem vel}{x}
end{document}
What I present is a total kludge, stealing things from my titlecaps
package. So the format is not optimal, but it might give a place from which to proceed. Nice thing is that punctuation is screened out of the search in the target-text.
When a search-string and target-text (not exceeding one paragraph) is passed to the seekphrase
, it will output where each word of the search-phrase appears in the target-text. I output it as a pair "target-location:search-word-index". To try to make sense of it, I output a [
before a search-word-index=1 match and a ]
after a search-word-index=n match. Thus to qualify as a matching "phrase" in, for example, a 3-word search, the output would have to contain an instance of something like [214:1 215:2, 216:3]
Further refinement is clearly possible, but it would take more time than I have at the moment.
documentclass{article}
usepackage{titlecaps}
usepackage{lipsum}
makeatletter
renewcommandseek@lcwords{%
kill@punct%
setcounter{word@count}{0}%
whiledo{value{word@count} < narg}{%
addtocounter{word@count}{1}%
protected@edefcurrent@word{csname argroman{word@count}endcsname}%
deffound@word{F}%
setcounter{lcword@index}{0}%
expandafterdefcsname%
found@wordroman{word@count}endcsname{F}%
whiledo{value{lcword@index} < value{lc@words}}{%
addtocounter{lcword@index}{1}%
protected@edefcurrent@lcword{%
csname lcwordroman{lcword@index}endcsname}%
%% THE FOLLOWING THREE LINES ARE FROM DAVID CARLISLE
protected@edeftmp{noexpandscantokens{defnoexpandtmp%
{noexpandifthenelse{noexpandequal{current@word}{current@lcword}}}}}%
tmpifhmodeunskipfitmp
%%
{expandafterdefcsname%
found@wordroman{word@count}endcsname{T}%
ifthenelse{equal{value{lcword@index}}{1}}{[}{}%
arabic{word@count}:arabic{lcword@index}%
ifthenelse{equal{value{lcword@index}}{value{lc@words}}}{]}{ }%
setcounter{lcword@index}{value{lc@words}}}%
{}%
}%
}%
restore@punct%
}
letgetargsCget@argsC
newcommandseekphrase[2]{%
Seeking phrase ``#1'':\%
Addlcwords{#1}%
redefine@tertius%
getargsC{#2}%
seek@lcwords%
Resetlcwords%
par%
}
makeatother
begin{document}
lipsum[1]
lipsum[1]
defx{%
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut
purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis.
Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget,
consectetuer id, vulputate a, magna. Donec vehicula augue eu neque.
Pellentesque habitant morbi tristique senectus et netus et malesuada
fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem.
Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus
sit amet tortor gravida placerat. Integer sapien est, iaculis in,
pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices
bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar
at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci
eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis,
diam. Duis eget orci sit amet orci dignissim rutrum.
%
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut
purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis.
Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget,
consectetuer id, vulputate a, magna. Donec vehicula augue eu neque.
Pellentesque habitant morbi tristique senectus et netus et malesuada
fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem.
Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus
sit amet tortor gravida placerat. Integer sapien est, iaculis in,
pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices
bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar
at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci
eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis,
diam. Duis eget orci sit amet orci dignissim rutrum.
}
seekphrase{et}{x}
seekphrase{et netus}{x}
seekphrase{bibendum Aenean}{x}
seekphrase{eget sem vel}{x}
end{document}
edited May 2 '13 at 20:21
answered May 2 '13 at 20:10
Steven B. Segletes
152k9192400
152k9192400
add a comment |
add a comment |
Thanks for contributing an answer to TeX - LaTeX Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2ftex.stackexchange.com%2fquestions%2f101434%2fsearch-for-most-frequently-occurring-patterns-in-text-to-replace-them-by-macros%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Interesting question, though in case of "patterns" longer than, say, one word, things might get a bit difficult due to space/newline treatment, various ways to get space after a macro etc.
– mbork
Mar 8 '13 at 10:42
Often used word combinations, phrases...
– Dennis Yurichev
Mar 8 '13 at 10:54
Welcome to TeX.SX.
– Claudio Fiandrino
Mar 8 '13 at 10:55
An interesting question, but sounds like a job for an external tool...
– cmhughes
Mar 8 '13 at 16:19
1
I wonder what your goal is with this. For writing a document, you're better off analyzing which combinations you find significant and worth automating. You seem to be asking for a sort of "literate LZW" algorithm.
– Ryan Reich
May 2 '13 at 21:35