detecting unique lines from log file

up vote
0
down vote

favorite

I have a large log file and would like to detect the patterns instead of specific lines.

for example:

/path/messages-20181116:11/15/2018 14:23:05.159|worker001|clusterm|I|userx deleted job 5018

/path/messages-20181116:11/15/2018 14:41:25.662|worker001|clusterm|I|userx deleted job 4895

/path/messages-20181116:11/15/2018 14:41:25.673|worker000|clusterm|I|userx deleted job 4890

/path/messages-20181116:11/15/2018 14:41:25.681|worker000|clusterm|I|userx deleted job 4889

11/09/2018 06:18:55.115|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1 

11/09/2018 06:18:55.118|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1                

11/09/2018 06:18:55.120|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1                

11/09/2018 06:18:55.140|scheduler000|clusterm|P|PROF: job dispatching took 5.005 s (10 fast)             

11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 1 job(s)             

11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 5 job(s)             

11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 3 job(s)             

11/09/2018 06:18:55.145|scheduler000|clusterm|P|PROF: parallel matching   14  0438 107668                 

11/09/2018 06:18:55.148|scheduler000|clusterm|P|PROF: sequential matching  9  0261   8203               

11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting :wc =0.006s              

11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc=5.005              

11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting : wc=0.006s

11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc =0.015

becomes something like below:

/path/messages-*NUMBER*:*DATE* *TIME*|worker001|clusterm|I|userx deleted job *NUMBER*

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job profiling(low job) of *NUMBER* 

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job dispatching took *NUMBER* s (*NUMBER* fast)             

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: dispatched *NUMBER* job(s)             

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: parallel matching   *NUMBER*  *NUMBER* *NUMBER*                 

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: sequential matching  *NUMBER*  *NUMBER*   *NUMBER*               

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job sorting :wc =*NUMBER*s              

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job dispatching: wc=*NUMBER*

which greatly reduce the number of lines and make analyzing/reading log by eye easier.

basically detecting variable words and replace them with some symbol.

asked 2 days ago

user772266

New contributor

1

What steps have you tried to take on your own? What were the results? Please include this in your question.
– Panki
2 days ago

have you looked at cut and uniq?
– ctrl-alt-delor
2 days ago

Uniq will only work with exactly matching lines. Will not work for two same line with different time stamp. Cut you need to read the whole log file and yet you don’t know the patterns
– user772266
2 days ago

I do used sort -u | uniq but this shows equal lines as if two lines only differ in time stamp both will be printed.
– user772266
2 days ago

I don't understand the transformations you're expecting. Do you want the dates and times replaced by *DATE* *TIME* or are those placeholders for real values of some sort? What makes a line non-unique?
– Jeff Schaller
2 days ago

|
show 1 more comment

up vote
0
down vote

favorite

I have a large log file and would like to detect the patterns instead of specific lines.

for example:

/path/messages-20181116:11/15/2018 14:23:05.159|worker001|clusterm|I|userx deleted job 5018

/path/messages-20181116:11/15/2018 14:41:25.662|worker001|clusterm|I|userx deleted job 4895

/path/messages-20181116:11/15/2018 14:41:25.673|worker000|clusterm|I|userx deleted job 4890

/path/messages-20181116:11/15/2018 14:41:25.681|worker000|clusterm|I|userx deleted job 4889

11/09/2018 06:18:55.115|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1 

11/09/2018 06:18:55.118|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1                

11/09/2018 06:18:55.120|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1                

11/09/2018 06:18:55.140|scheduler000|clusterm|P|PROF: job dispatching took 5.005 s (10 fast)             

11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 1 job(s)             

11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 5 job(s)             

11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 3 job(s)             

11/09/2018 06:18:55.145|scheduler000|clusterm|P|PROF: parallel matching   14  0438 107668                 

11/09/2018 06:18:55.148|scheduler000|clusterm|P|PROF: sequential matching  9  0261   8203               

11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting :wc =0.006s              

11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc=5.005              

11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting : wc=0.006s

11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc =0.015

becomes something like below:

/path/messages-*NUMBER*:*DATE* *TIME*|worker001|clusterm|I|userx deleted job *NUMBER*

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job profiling(low job) of *NUMBER* 

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job dispatching took *NUMBER* s (*NUMBER* fast)             

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: dispatched *NUMBER* job(s)             

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: parallel matching   *NUMBER*  *NUMBER* *NUMBER*                 

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: sequential matching  *NUMBER*  *NUMBER*   *NUMBER*               

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job sorting :wc =*NUMBER*s              

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job dispatching: wc=*NUMBER*

which greatly reduce the number of lines and make analyzing/reading log by eye easier.

basically detecting variable words and replace them with some symbol.

asked 2 days ago

user772266

New contributor

1

What steps have you tried to take on your own? What were the results? Please include this in your question.
– Panki
2 days ago

have you looked at cut and uniq?
– ctrl-alt-delor
2 days ago

Uniq will only work with exactly matching lines. Will not work for two same line with different time stamp. Cut you need to read the whole log file and yet you don’t know the patterns
– user772266
2 days ago

I do used sort -u | uniq but this shows equal lines as if two lines only differ in time stamp both will be printed.
– user772266
2 days ago

I don't understand the transformations you're expecting. Do you want the dates and times replaced by *DATE* *TIME* or are those placeholders for real values of some sort? What makes a line non-unique?
– Jeff Schaller
2 days ago

|
show 1 more comment

up vote
0
down vote

favorite

I have a large log file and would like to detect the patterns instead of specific lines.

for example:

/path/messages-20181116:11/15/2018 14:23:05.159|worker001|clusterm|I|userx deleted job 5018

/path/messages-20181116:11/15/2018 14:41:25.662|worker001|clusterm|I|userx deleted job 4895

/path/messages-20181116:11/15/2018 14:41:25.673|worker000|clusterm|I|userx deleted job 4890

/path/messages-20181116:11/15/2018 14:41:25.681|worker000|clusterm|I|userx deleted job 4889

11/09/2018 06:18:55.115|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1 

11/09/2018 06:18:55.118|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1                

11/09/2018 06:18:55.120|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1                

11/09/2018 06:18:55.140|scheduler000|clusterm|P|PROF: job dispatching took 5.005 s (10 fast)             

11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 1 job(s)             

11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 5 job(s)             

11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 3 job(s)             

11/09/2018 06:18:55.145|scheduler000|clusterm|P|PROF: parallel matching   14  0438 107668                 

11/09/2018 06:18:55.148|scheduler000|clusterm|P|PROF: sequential matching  9  0261   8203               

11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting :wc =0.006s              

11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc=5.005              

11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting : wc=0.006s

11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc =0.015

becomes something like below:

/path/messages-*NUMBER*:*DATE* *TIME*|worker001|clusterm|I|userx deleted job *NUMBER*

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job profiling(low job) of *NUMBER* 

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job dispatching took *NUMBER* s (*NUMBER* fast)             

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: dispatched *NUMBER* job(s)             

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: parallel matching   *NUMBER*  *NUMBER* *NUMBER*                 

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: sequential matching  *NUMBER*  *NUMBER*   *NUMBER*               

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job sorting :wc =*NUMBER*s              

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job dispatching: wc=*NUMBER*

which greatly reduce the number of lines and make analyzing/reading log by eye easier.

basically detecting variable words and replace them with some symbol.

asked 2 days ago

user772266

New contributor

I have a large log file and would like to detect the patterns instead of specific lines.

for example:

/path/messages-20181116:11/15/2018 14:23:05.159|worker001|clusterm|I|userx deleted job 5018

/path/messages-20181116:11/15/2018 14:41:25.662|worker001|clusterm|I|userx deleted job 4895

/path/messages-20181116:11/15/2018 14:41:25.673|worker000|clusterm|I|userx deleted job 4890

/path/messages-20181116:11/15/2018 14:41:25.681|worker000|clusterm|I|userx deleted job 4889

11/09/2018 06:18:55.115|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1 

11/09/2018 06:18:55.118|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1                

11/09/2018 06:18:55.120|scheduler000|clusterm|P|PROF: job profiling(low job) of 9473507.1                

11/09/2018 06:18:55.140|scheduler000|clusterm|P|PROF: job dispatching took 5.005 s (10 fast)             

11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 1 job(s)             

11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 5 job(s)             

11/09/2018 06:18:55.143|scheduler000|clusterm|P|PROF: dispatched 3 job(s)             

11/09/2018 06:18:55.145|scheduler000|clusterm|P|PROF: parallel matching   14  0438 107668                 

11/09/2018 06:18:55.148|scheduler000|clusterm|P|PROF: sequential matching  9  0261   8203               

11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting :wc =0.006s              

11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc=5.005              

11/09/2018 06:18:55.561|scheduler000|clusterm|P|PROF(1776285440): job sorting : wc=0.006s

11/09/2018 06:18:55.564|scheduler000|clusterm|P|PROF(1776285440): job dispatching: wc =0.015

becomes something like below:

/path/messages-*NUMBER*:*DATE* *TIME*|worker001|clusterm|I|userx deleted job *NUMBER*

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job profiling(low job) of *NUMBER* 

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: job dispatching took *NUMBER* s (*NUMBER* fast)             

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: dispatched *NUMBER* job(s)             

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: parallel matching   *NUMBER*  *NUMBER* *NUMBER*                 

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF: sequential matching  *NUMBER*  *NUMBER*   *NUMBER*               

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job sorting :wc =*NUMBER*s              

*DATE* *TIME*|scheduler*NUMBER*|clusterm|P|PROF(*NUMBER*): job dispatching: wc=*NUMBER*

which greatly reduce the number of lines and make analyzing/reading log by eye easier.

basically detecting variable words and replace them with some symbol.

command-line logs wildcards text

asked 2 days ago

user772266

New contributor

asked 2 days ago

user772266

New contributor

asked 2 days ago

user772266

New contributor

asked 2 days ago

user772266

asked 2 days ago

user772266

New contributor

user772266 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

1

What steps have you tried to take on your own? What were the results? Please include this in your question.
– Panki
2 days ago

have you looked at cut and uniq?
– ctrl-alt-delor
2 days ago

Uniq will only work with exactly matching lines. Will not work for two same line with different time stamp. Cut you need to read the whole log file and yet you don’t know the patterns
– user772266
2 days ago

I do used sort -u | uniq but this shows equal lines as if two lines only differ in time stamp both will be printed.
– user772266
2 days ago

I don't understand the transformations you're expecting. Do you want the dates and times replaced by *DATE* *TIME* or are those placeholders for real values of some sort? What makes a line non-unique?
– Jeff Schaller
2 days ago

|
show 1 more comment

1

What steps have you tried to take on your own? What were the results? Please include this in your question.
– Panki
2 days ago

have you looked at cut and uniq?
– ctrl-alt-delor
2 days ago

Uniq will only work with exactly matching lines. Will not work for two same line with different time stamp. Cut you need to read the whole log file and yet you don’t know the patterns
– user772266
2 days ago

I do used sort -u | uniq but this shows equal lines as if two lines only differ in time stamp both will be printed.
– user772266
2 days ago

I don't understand the transformations you're expecting. Do you want the dates and times replaced by *DATE* *TIME* or are those placeholders for real values of some sort? What makes a line non-unique?
– Jeff Schaller
2 days ago

What steps have you tried to take on your own? What were the results? Please include this in your question.
– Panki
2 days ago

have you looked at cut and uniq?
– ctrl-alt-delor
2 days ago

Uniq will only work with exactly matching lines. Will not work for two same line with different time stamp. Cut you need to read the whole log file and yet you don’t know the patterns
– user772266
2 days ago

I do used sort -u | uniq but this shows equal lines as if two lines only differ in time stamp both will be printed.
– user772266
2 days ago

I don't understand the transformations you're expecting. Do you want the dates and times replaced by *DATE* *TIME* or are those placeholders for real values of some sort? What makes a line non-unique?
– Jeff Schaller
2 days ago

|
show 1 more comment

active

oldest

votes

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

user772266 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f482960%2fdetecting-unique-lines-from-log-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

user772266 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

user772266 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sstrhsrtj