How to split a single file into multiple files based on a column in linux?

I have a text file with following information:

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

MTHFR   TCGA-BD-A2L6-01A-11D-A20W-10    4524    BCM GRCh38

SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10    7779    BCM GRCh38

USH2A   TCGA-BD-A2L6-01A-11D-A20W-10    7399    BCM GRCh38

SOS1    TCGA-BD-A2L6-01A-11D-A20W-10    6654    BCM GRCh38

TMEM51  TCGA-O8-A75V-01A-11D-A32G-10    55092   BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

PRDM16  TCGA-G3-A7M5-01A-11D-A33Q-10    63976   BCM GRCh38

DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10    55735   BCM GRCh38

HNRNPCL2    TCGA-G3-A7M5-01A-11D-A33Q-10    440563  BCM GRCh38

C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10    84970   BCM GRCh38

NFYC    TCGA-G3-A7M5-01A-11D-A33Q-10    4802    BCM GRCh38

IPP TCGA-G3-A7M5-01A-11D-A33Q-10    3652    BCM GRCh38

As you see there are multiple samples I want to split the file into multiple files based on the column "Tumor_Sample_Barcode". The output files need to be named with samplename.txt.

First output - TCGA-BD-A2L6-01A-11D-A20W-10.txt

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

MTHFR   TCGA-BD-A2L6-01A-11D-A20W-10    4524    BCM GRCh38

SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10    7779    BCM GRCh38

USH2A   TCGA-BD-A2L6-01A-11D-A20W-10    7399    BCM GRCh38

SOS1    TCGA-BD-A2L6-01A-11D-A20W-10    6654    BCM GRCh38

Second output - TCGA-O8-A75V-01A-11D-A32G-10.txt

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

TMEM51  TCGA-O8-A75V-01A-11D-A32G-10    55092   BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

Third output - TCGA-G3-A7M5-01A-11D-A33Q-10.txt

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

PRDM16  TCGA-G3-A7M5-01A-11D-A33Q-10    63976   BCM GRCh38

DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10    55735   BCM GRCh38

HNRNPCL2    TCGA-G3-A7M5-01A-11D-A33Q-10    440563  BCM GRCh38

C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10    84970   BCM GRCh38

NFYC    TCGA-G3-A7M5-01A-11D-A33Q-10    4802    BCM GRCh38

IPP TCGA-G3-A7M5-01A-11D-A33Q-10    3652    BCM GRCh38

How to do this linux?

edited Jan 31 '18 at 12:56

asked Jan 31 '18 at 12:41

user3351523

15739

Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?

– user3351523
Jan 31 '18 at 12:51

1

@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)

– RomanPerekhrest
Jan 31 '18 at 12:53

Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?

– user3351523
Jan 31 '18 at 12:57

add a comment |

I have a text file with following information:

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

MTHFR   TCGA-BD-A2L6-01A-11D-A20W-10    4524    BCM GRCh38

SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10    7779    BCM GRCh38

USH2A   TCGA-BD-A2L6-01A-11D-A20W-10    7399    BCM GRCh38

SOS1    TCGA-BD-A2L6-01A-11D-A20W-10    6654    BCM GRCh38

TMEM51  TCGA-O8-A75V-01A-11D-A32G-10    55092   BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

PRDM16  TCGA-G3-A7M5-01A-11D-A33Q-10    63976   BCM GRCh38

DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10    55735   BCM GRCh38

HNRNPCL2    TCGA-G3-A7M5-01A-11D-A33Q-10    440563  BCM GRCh38

C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10    84970   BCM GRCh38

NFYC    TCGA-G3-A7M5-01A-11D-A33Q-10    4802    BCM GRCh38

IPP TCGA-G3-A7M5-01A-11D-A33Q-10    3652    BCM GRCh38

As you see there are multiple samples I want to split the file into multiple files based on the column "Tumor_Sample_Barcode". The output files need to be named with samplename.txt.

First output - TCGA-BD-A2L6-01A-11D-A20W-10.txt

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

MTHFR   TCGA-BD-A2L6-01A-11D-A20W-10    4524    BCM GRCh38

SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10    7779    BCM GRCh38

USH2A   TCGA-BD-A2L6-01A-11D-A20W-10    7399    BCM GRCh38

SOS1    TCGA-BD-A2L6-01A-11D-A20W-10    6654    BCM GRCh38

Second output - TCGA-O8-A75V-01A-11D-A32G-10.txt

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

TMEM51  TCGA-O8-A75V-01A-11D-A32G-10    55092   BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

Third output - TCGA-G3-A7M5-01A-11D-A33Q-10.txt

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

PRDM16  TCGA-G3-A7M5-01A-11D-A33Q-10    63976   BCM GRCh38

DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10    55735   BCM GRCh38

HNRNPCL2    TCGA-G3-A7M5-01A-11D-A33Q-10    440563  BCM GRCh38

C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10    84970   BCM GRCh38

NFYC    TCGA-G3-A7M5-01A-11D-A33Q-10    4802    BCM GRCh38

IPP TCGA-G3-A7M5-01A-11D-A33Q-10    3652    BCM GRCh38

How to do this linux?

edited Jan 31 '18 at 12:56

asked Jan 31 '18 at 12:41

user3351523

15739

Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?

– user3351523
Jan 31 '18 at 12:51

1

@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)

– RomanPerekhrest
Jan 31 '18 at 12:53

Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?

– user3351523
Jan 31 '18 at 12:57

add a comment |

I have a text file with following information:

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

MTHFR   TCGA-BD-A2L6-01A-11D-A20W-10    4524    BCM GRCh38

SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10    7779    BCM GRCh38

USH2A   TCGA-BD-A2L6-01A-11D-A20W-10    7399    BCM GRCh38

SOS1    TCGA-BD-A2L6-01A-11D-A20W-10    6654    BCM GRCh38

TMEM51  TCGA-O8-A75V-01A-11D-A32G-10    55092   BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

PRDM16  TCGA-G3-A7M5-01A-11D-A33Q-10    63976   BCM GRCh38

DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10    55735   BCM GRCh38

HNRNPCL2    TCGA-G3-A7M5-01A-11D-A33Q-10    440563  BCM GRCh38

C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10    84970   BCM GRCh38

NFYC    TCGA-G3-A7M5-01A-11D-A33Q-10    4802    BCM GRCh38

IPP TCGA-G3-A7M5-01A-11D-A33Q-10    3652    BCM GRCh38

As you see there are multiple samples I want to split the file into multiple files based on the column "Tumor_Sample_Barcode". The output files need to be named with samplename.txt.

First output - TCGA-BD-A2L6-01A-11D-A20W-10.txt

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

MTHFR   TCGA-BD-A2L6-01A-11D-A20W-10    4524    BCM GRCh38

SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10    7779    BCM GRCh38

USH2A   TCGA-BD-A2L6-01A-11D-A20W-10    7399    BCM GRCh38

SOS1    TCGA-BD-A2L6-01A-11D-A20W-10    6654    BCM GRCh38

Second output - TCGA-O8-A75V-01A-11D-A32G-10.txt

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

TMEM51  TCGA-O8-A75V-01A-11D-A32G-10    55092   BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

Third output - TCGA-G3-A7M5-01A-11D-A33Q-10.txt

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

PRDM16  TCGA-G3-A7M5-01A-11D-A33Q-10    63976   BCM GRCh38

DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10    55735   BCM GRCh38

HNRNPCL2    TCGA-G3-A7M5-01A-11D-A33Q-10    440563  BCM GRCh38

C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10    84970   BCM GRCh38

NFYC    TCGA-G3-A7M5-01A-11D-A33Q-10    4802    BCM GRCh38

IPP TCGA-G3-A7M5-01A-11D-A33Q-10    3652    BCM GRCh38

How to do this linux?

edited Jan 31 '18 at 12:56

asked Jan 31 '18 at 12:41

user3351523

15739

I have a text file with following information:

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

MTHFR   TCGA-BD-A2L6-01A-11D-A20W-10    4524    BCM GRCh38

SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10    7779    BCM GRCh38

USH2A   TCGA-BD-A2L6-01A-11D-A20W-10    7399    BCM GRCh38

SOS1    TCGA-BD-A2L6-01A-11D-A20W-10    6654    BCM GRCh38

TMEM51  TCGA-O8-A75V-01A-11D-A32G-10    55092   BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

PRDM16  TCGA-G3-A7M5-01A-11D-A33Q-10    63976   BCM GRCh38

DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10    55735   BCM GRCh38

HNRNPCL2    TCGA-G3-A7M5-01A-11D-A33Q-10    440563  BCM GRCh38

C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10    84970   BCM GRCh38

NFYC    TCGA-G3-A7M5-01A-11D-A33Q-10    4802    BCM GRCh38

IPP TCGA-G3-A7M5-01A-11D-A33Q-10    3652    BCM GRCh38

As you see there are multiple samples I want to split the file into multiple files based on the column "Tumor_Sample_Barcode". The output files need to be named with samplename.txt.

First output - TCGA-BD-A2L6-01A-11D-A20W-10.txt

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

MTHFR   TCGA-BD-A2L6-01A-11D-A20W-10    4524    BCM GRCh38

SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10    7779    BCM GRCh38

USH2A   TCGA-BD-A2L6-01A-11D-A20W-10    7399    BCM GRCh38

SOS1    TCGA-BD-A2L6-01A-11D-A20W-10    6654    BCM GRCh38

Second output - TCGA-O8-A75V-01A-11D-A32G-10.txt

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

TMEM51  TCGA-O8-A75V-01A-11D-A32G-10    55092   BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

Third output - TCGA-G3-A7M5-01A-11D-A33Q-10.txt

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

PRDM16  TCGA-G3-A7M5-01A-11D-A33Q-10    63976   BCM GRCh38

DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10    55735   BCM GRCh38

HNRNPCL2    TCGA-G3-A7M5-01A-11D-A33Q-10    440563  BCM GRCh38

C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10    84970   BCM GRCh38

NFYC    TCGA-G3-A7M5-01A-11D-A33Q-10    4802    BCM GRCh38

IPP TCGA-G3-A7M5-01A-11D-A33Q-10    3652    BCM GRCh38

How to do this linux?

linux files split

edited Jan 31 '18 at 12:56

asked Jan 31 '18 at 12:41

user3351523

15739

edited Jan 31 '18 at 12:56

asked Jan 31 '18 at 12:41

user3351523

15739

edited Jan 31 '18 at 12:56

asked Jan 31 '18 at 12:41

user3351523

15739

asked Jan 31 '18 at 12:41

user3351523

15739

asked Jan 31 '18 at 12:41

user3351523

15739

Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?

– user3351523
Jan 31 '18 at 12:51

1

@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)

– RomanPerekhrest
Jan 31 '18 at 12:53

Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?

– user3351523
Jan 31 '18 at 12:57

add a comment |

Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?

– user3351523
Jan 31 '18 at 12:51

1

@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)

– RomanPerekhrest
Jan 31 '18 at 12:53

Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?

– user3351523
Jan 31 '18 at 12:57

Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?

– user3351523
Jan 31 '18 at 12:51

@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)

– RomanPerekhrest
Jan 31 '18 at 12:53

Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?

– user3351523
Jan 31 '18 at 12:57

add a comment |

1 Answer
1

active

oldest

votes

Awk solution:

awk 'NR==1{ h=$0 }NR>1{ print (!a[$2]++? h ORS $0 : $0) > $2".txt" }' file

NR==1{ h=$0 } - capture the 1st line/record as header line (NR points to a record number, $0 - contains the current line)

NR > 1 - for all records except the first one:
- <cond>? <operand_1> : <operand_2> - classical ternary operator
- !a[$2]++? - check for the 1st occurrence of barcode value $2 used as a key of associative array a
- h ORS $0 - common header line concatenated with ORS(output record separator, defaults to n) and current record $0
- print ... > $2".txt" - print custom content or the current line(if nothing was specified) into file <barcode_value>.txt

Or a more self-explanatory version:

awk 'NR==1 {header = $0; next}

     !header_printed[$2]++ {print header > $2".txt"}

     {print > $2".txt"}' < file

Viewing results:

$ head TCGA*.txt

==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

MTHFR   TCGA-BD-A2L6-01A-11D-A20W-10    4524    BCM GRCh38

SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10    7779    BCM GRCh38

USH2A   TCGA-BD-A2L6-01A-11D-A20W-10    7399    BCM GRCh38

SOS1    TCGA-BD-A2L6-01A-11D-A20W-10    6654    BCM GRCh38



==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

PRDM16  TCGA-G3-A7M5-01A-11D-A33Q-10    63976   BCM GRCh38

DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10    55735   BCM GRCh38

HNRNPCL2    TCGA-G3-A7M5-01A-11D-A33Q-10    440563  BCM GRCh38

C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10    84970   BCM GRCh38

NFYC    TCGA-G3-A7M5-01A-11D-A33Q-10    4802    BCM GRCh38

IPP TCGA-G3-A7M5-01A-11D-A33Q-10    3652    BCM GRCh38



==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

TMEM51  TCGA-O8-A75V-01A-11D-A32G-10    55092   BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

To adjust a filename based on 15-char sequence of barcode value:

awk 'NR==1{ h=$0 }NR>1{ print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" }' file

edited Jan 31 '18 at 13:53

answered Jan 31 '18 at 13:06

RomanPerekhrest

22.9k12346

Thank you very much !! Could you please explain the command?

– user3351523
Jan 31 '18 at 13:13

Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

– user3351523
Jan 31 '18 at 13:31

1

@user3351523, yes, see my explanation

– RomanPerekhrest
Jan 31 '18 at 13:47

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f420938%2fhow-to-split-a-single-file-into-multiple-files-based-on-a-column-in-linux%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Awk solution:

awk 'NR==1{ h=$0 }NR>1{ print (!a[$2]++? h ORS $0 : $0) > $2".txt" }' file

NR==1{ h=$0 } - capture the 1st line/record as header line (NR points to a record number, $0 - contains the current line)

NR > 1 - for all records except the first one:
- <cond>? <operand_1> : <operand_2> - classical ternary operator
- !a[$2]++? - check for the 1st occurrence of barcode value $2 used as a key of associative array a
- h ORS $0 - common header line concatenated with ORS(output record separator, defaults to n) and current record $0
- print ... > $2".txt" - print custom content or the current line(if nothing was specified) into file <barcode_value>.txt

Or a more self-explanatory version:

awk 'NR==1 {header = $0; next}

     !header_printed[$2]++ {print header > $2".txt"}

     {print > $2".txt"}' < file

Viewing results:

$ head TCGA*.txt

==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

MTHFR   TCGA-BD-A2L6-01A-11D-A20W-10    4524    BCM GRCh38

SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10    7779    BCM GRCh38

USH2A   TCGA-BD-A2L6-01A-11D-A20W-10    7399    BCM GRCh38

SOS1    TCGA-BD-A2L6-01A-11D-A20W-10    6654    BCM GRCh38



==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

PRDM16  TCGA-G3-A7M5-01A-11D-A33Q-10    63976   BCM GRCh38

DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10    55735   BCM GRCh38

HNRNPCL2    TCGA-G3-A7M5-01A-11D-A33Q-10    440563  BCM GRCh38

C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10    84970   BCM GRCh38

NFYC    TCGA-G3-A7M5-01A-11D-A33Q-10    4802    BCM GRCh38

IPP TCGA-G3-A7M5-01A-11D-A33Q-10    3652    BCM GRCh38



==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

TMEM51  TCGA-O8-A75V-01A-11D-A32G-10    55092   BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

To adjust a filename based on 15-char sequence of barcode value:

awk 'NR==1{ h=$0 }NR>1{ print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" }' file

edited Jan 31 '18 at 13:53

answered Jan 31 '18 at 13:06

RomanPerekhrest

22.9k12346

Thank you very much !! Could you please explain the command?

– user3351523
Jan 31 '18 at 13:13

Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

– user3351523
Jan 31 '18 at 13:31

1

@user3351523, yes, see my explanation

– RomanPerekhrest
Jan 31 '18 at 13:47

add a comment |

Awk solution:

awk 'NR==1{ h=$0 }NR>1{ print (!a[$2]++? h ORS $0 : $0) > $2".txt" }' file

NR==1{ h=$0 } - capture the 1st line/record as header line (NR points to a record number, $0 - contains the current line)

NR > 1 - for all records except the first one:
- <cond>? <operand_1> : <operand_2> - classical ternary operator
- !a[$2]++? - check for the 1st occurrence of barcode value $2 used as a key of associative array a
- h ORS $0 - common header line concatenated with ORS(output record separator, defaults to n) and current record $0
- print ... > $2".txt" - print custom content or the current line(if nothing was specified) into file <barcode_value>.txt

Or a more self-explanatory version:

awk 'NR==1 {header = $0; next}

     !header_printed[$2]++ {print header > $2".txt"}

     {print > $2".txt"}' < file

Viewing results:

$ head TCGA*.txt

==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

MTHFR   TCGA-BD-A2L6-01A-11D-A20W-10    4524    BCM GRCh38

SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10    7779    BCM GRCh38

USH2A   TCGA-BD-A2L6-01A-11D-A20W-10    7399    BCM GRCh38

SOS1    TCGA-BD-A2L6-01A-11D-A20W-10    6654    BCM GRCh38



==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

PRDM16  TCGA-G3-A7M5-01A-11D-A33Q-10    63976   BCM GRCh38

DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10    55735   BCM GRCh38

HNRNPCL2    TCGA-G3-A7M5-01A-11D-A33Q-10    440563  BCM GRCh38

C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10    84970   BCM GRCh38

NFYC    TCGA-G3-A7M5-01A-11D-A33Q-10    4802    BCM GRCh38

IPP TCGA-G3-A7M5-01A-11D-A33Q-10    3652    BCM GRCh38



==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

TMEM51  TCGA-O8-A75V-01A-11D-A32G-10    55092   BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

To adjust a filename based on 15-char sequence of barcode value:

awk 'NR==1{ h=$0 }NR>1{ print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" }' file

edited Jan 31 '18 at 13:53

answered Jan 31 '18 at 13:06

RomanPerekhrest

22.9k12346

Thank you very much !! Could you please explain the command?

– user3351523
Jan 31 '18 at 13:13

Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

– user3351523
Jan 31 '18 at 13:31

1

@user3351523, yes, see my explanation

– RomanPerekhrest
Jan 31 '18 at 13:47

add a comment |

Awk solution:

awk 'NR==1{ h=$0 }NR>1{ print (!a[$2]++? h ORS $0 : $0) > $2".txt" }' file

NR==1{ h=$0 } - capture the 1st line/record as header line (NR points to a record number, $0 - contains the current line)

NR > 1 - for all records except the first one:
- <cond>? <operand_1> : <operand_2> - classical ternary operator
- !a[$2]++? - check for the 1st occurrence of barcode value $2 used as a key of associative array a
- h ORS $0 - common header line concatenated with ORS(output record separator, defaults to n) and current record $0
- print ... > $2".txt" - print custom content or the current line(if nothing was specified) into file <barcode_value>.txt

Or a more self-explanatory version:

awk 'NR==1 {header = $0; next}

     !header_printed[$2]++ {print header > $2".txt"}

     {print > $2".txt"}' < file

Viewing results:

$ head TCGA*.txt

==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

MTHFR   TCGA-BD-A2L6-01A-11D-A20W-10    4524    BCM GRCh38

SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10    7779    BCM GRCh38

USH2A   TCGA-BD-A2L6-01A-11D-A20W-10    7399    BCM GRCh38

SOS1    TCGA-BD-A2L6-01A-11D-A20W-10    6654    BCM GRCh38



==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

PRDM16  TCGA-G3-A7M5-01A-11D-A33Q-10    63976   BCM GRCh38

DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10    55735   BCM GRCh38

HNRNPCL2    TCGA-G3-A7M5-01A-11D-A33Q-10    440563  BCM GRCh38

C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10    84970   BCM GRCh38

NFYC    TCGA-G3-A7M5-01A-11D-A33Q-10    4802    BCM GRCh38

IPP TCGA-G3-A7M5-01A-11D-A33Q-10    3652    BCM GRCh38



==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

TMEM51  TCGA-O8-A75V-01A-11D-A32G-10    55092   BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

To adjust a filename based on 15-char sequence of barcode value:

awk 'NR==1{ h=$0 }NR>1{ print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" }' file

edited Jan 31 '18 at 13:53

answered Jan 31 '18 at 13:06

RomanPerekhrest

22.9k12346

Awk solution:

awk 'NR==1{ h=$0 }NR>1{ print (!a[$2]++? h ORS $0 : $0) > $2".txt" }' file

NR==1{ h=$0 } - capture the 1st line/record as header line (NR points to a record number, $0 - contains the current line)

NR > 1 - for all records except the first one:
- <cond>? <operand_1> : <operand_2> - classical ternary operator
- !a[$2]++? - check for the 1st occurrence of barcode value $2 used as a key of associative array a
- h ORS $0 - common header line concatenated with ORS(output record separator, defaults to n) and current record $0
- print ... > $2".txt" - print custom content or the current line(if nothing was specified) into file <barcode_value>.txt

Or a more self-explanatory version:

awk 'NR==1 {header = $0; next}

     !header_printed[$2]++ {print header > $2".txt"}

     {print > $2".txt"}' < file

Viewing results:

$ head TCGA*.txt

==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

MTHFR   TCGA-BD-A2L6-01A-11D-A20W-10    4524    BCM GRCh38

SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10    7779    BCM GRCh38

USH2A   TCGA-BD-A2L6-01A-11D-A20W-10    7399    BCM GRCh38

SOS1    TCGA-BD-A2L6-01A-11D-A20W-10    6654    BCM GRCh38



==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

PRDM16  TCGA-G3-A7M5-01A-11D-A33Q-10    63976   BCM GRCh38

DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10    55735   BCM GRCh38

HNRNPCL2    TCGA-G3-A7M5-01A-11D-A33Q-10    440563  BCM GRCh38

C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10    84970   BCM GRCh38

NFYC    TCGA-G3-A7M5-01A-11D-A33Q-10    4802    BCM GRCh38

IPP TCGA-G3-A7M5-01A-11D-A33Q-10    3652    BCM GRCh38



==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build

TMEM51  TCGA-O8-A75V-01A-11D-A32G-10    55092   BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

To adjust a filename based on 15-char sequence of barcode value:

awk 'NR==1{ h=$0 }NR>1{ print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" }' file

edited Jan 31 '18 at 13:53

answered Jan 31 '18 at 13:06

RomanPerekhrest

22.9k12346

edited Jan 31 '18 at 13:53

answered Jan 31 '18 at 13:06

RomanPerekhrest

22.9k12346

answered Jan 31 '18 at 13:06

RomanPerekhrest

22.9k12346

answered Jan 31 '18 at 13:06

RomanPerekhrest

22.9k12346

Thank you very much !! Could you please explain the command?

– user3351523
Jan 31 '18 at 13:13

Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

– user3351523
Jan 31 '18 at 13:31

1

@user3351523, yes, see my explanation

– RomanPerekhrest
Jan 31 '18 at 13:47

add a comment |

Thank you very much !! Could you please explain the command?

– user3351523
Jan 31 '18 at 13:13

Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

– user3351523
Jan 31 '18 at 13:31

1

@user3351523, yes, see my explanation

– RomanPerekhrest
Jan 31 '18 at 13:47

Thank you very much !! Could you please explain the command?

– user3351523
Jan 31 '18 at 13:13

Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

– user3351523
Jan 31 '18 at 13:31

@user3351523, yes, see my explanation

– RomanPerekhrest
Jan 31 '18 at 13:47

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sstrhsrtj