Browser-like Reader Mode with text-only output
Background: Reader Mode, as seen in Safari and other browsers, extracts the main content of article based web pages using sophisticated heuristics, and displays this with a very readable font.
All navigation, headers, footers, and other fluff is removed. The mode only works with "articles", ie. pages where there is a "main content" like a news article, scientific paper, etc.
The question: Is there an open source implementation of this for Terminals (ie. text-only)? Or alternatively, another way to accomplish the same thing?
Example: This article from The New York Times should output like so:
$ utility --reader-mode https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
SEND US YOUR IDEAS FOR WHAT TO DO DURING THE POLAR VORTEX. WE
WANT TO HEAR FROM YOU.
It’s so cold in much of the Midwest today that you could get
frostbite within five minutes once you step outside. If you’re
living through it indoors, give us your tips.
A commuter during an extremely light morning rush hour in Chicago
on Wednesday. Businesses and schools have closed as the city
copes with record low temperatures.
Across the Midwest, where wind chills were minus 51 in
Minneapolis and minus 45 in Chicago, the risks of going outside
on Wednesday were dire. So, many people simply didn’t bother,
while others took a chance to briefly experience the coldest
weather in a generation.
Whether you’re an adventurer or a hibernator, tell us your
recommendations for staying warm and busy. What are you cooking
or binge-watching? What board games are you playing? If you’re
venturing outside, what are you doing to stay safe? (Experts warn
that even a short time in the extreme cold can be very
dangerous.) How many layers of clothing are you wearing, and
which special hats and gloves are necessary? Send us your photos
and your stories.
terminal browser
add a comment |
Background: Reader Mode, as seen in Safari and other browsers, extracts the main content of article based web pages using sophisticated heuristics, and displays this with a very readable font.
All navigation, headers, footers, and other fluff is removed. The mode only works with "articles", ie. pages where there is a "main content" like a news article, scientific paper, etc.
The question: Is there an open source implementation of this for Terminals (ie. text-only)? Or alternatively, another way to accomplish the same thing?
Example: This article from The New York Times should output like so:
$ utility --reader-mode https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
SEND US YOUR IDEAS FOR WHAT TO DO DURING THE POLAR VORTEX. WE
WANT TO HEAR FROM YOU.
It’s so cold in much of the Midwest today that you could get
frostbite within five minutes once you step outside. If you’re
living through it indoors, give us your tips.
A commuter during an extremely light morning rush hour in Chicago
on Wednesday. Businesses and schools have closed as the city
copes with record low temperatures.
Across the Midwest, where wind chills were minus 51 in
Minneapolis and minus 45 in Chicago, the risks of going outside
on Wednesday were dire. So, many people simply didn’t bother,
while others took a chance to briefly experience the coldest
weather in a generation.
Whether you’re an adventurer or a hibernator, tell us your
recommendations for staying warm and busy. What are you cooking
or binge-watching? What board games are you playing? If you’re
venturing outside, what are you doing to stay safe? (Experts warn
that even a short time in the extreme cold can be very
dangerous.) How many layers of clothing are you wearing, and
which special hats and gloves are necessary? Send us your photos
and your stories.
terminal browser
1
determining the "main content" seems to me to be a tricky problem to solve
– Jeff Schaller
15 hours ago
Yes. This is "solved" in best effort with various implementations of "Reader Mode" using heuristics. So it would have to be a text-only port of that, or something similar. google.com/search?q=reader+mode+source+code
– forthrin
15 hours ago
add a comment |
Background: Reader Mode, as seen in Safari and other browsers, extracts the main content of article based web pages using sophisticated heuristics, and displays this with a very readable font.
All navigation, headers, footers, and other fluff is removed. The mode only works with "articles", ie. pages where there is a "main content" like a news article, scientific paper, etc.
The question: Is there an open source implementation of this for Terminals (ie. text-only)? Or alternatively, another way to accomplish the same thing?
Example: This article from The New York Times should output like so:
$ utility --reader-mode https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
SEND US YOUR IDEAS FOR WHAT TO DO DURING THE POLAR VORTEX. WE
WANT TO HEAR FROM YOU.
It’s so cold in much of the Midwest today that you could get
frostbite within five minutes once you step outside. If you’re
living through it indoors, give us your tips.
A commuter during an extremely light morning rush hour in Chicago
on Wednesday. Businesses and schools have closed as the city
copes with record low temperatures.
Across the Midwest, where wind chills were minus 51 in
Minneapolis and minus 45 in Chicago, the risks of going outside
on Wednesday were dire. So, many people simply didn’t bother,
while others took a chance to briefly experience the coldest
weather in a generation.
Whether you’re an adventurer or a hibernator, tell us your
recommendations for staying warm and busy. What are you cooking
or binge-watching? What board games are you playing? If you’re
venturing outside, what are you doing to stay safe? (Experts warn
that even a short time in the extreme cold can be very
dangerous.) How many layers of clothing are you wearing, and
which special hats and gloves are necessary? Send us your photos
and your stories.
terminal browser
Background: Reader Mode, as seen in Safari and other browsers, extracts the main content of article based web pages using sophisticated heuristics, and displays this with a very readable font.
All navigation, headers, footers, and other fluff is removed. The mode only works with "articles", ie. pages where there is a "main content" like a news article, scientific paper, etc.
The question: Is there an open source implementation of this for Terminals (ie. text-only)? Or alternatively, another way to accomplish the same thing?
Example: This article from The New York Times should output like so:
$ utility --reader-mode https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
SEND US YOUR IDEAS FOR WHAT TO DO DURING THE POLAR VORTEX. WE
WANT TO HEAR FROM YOU.
It’s so cold in much of the Midwest today that you could get
frostbite within five minutes once you step outside. If you’re
living through it indoors, give us your tips.
A commuter during an extremely light morning rush hour in Chicago
on Wednesday. Businesses and schools have closed as the city
copes with record low temperatures.
Across the Midwest, where wind chills were minus 51 in
Minneapolis and minus 45 in Chicago, the risks of going outside
on Wednesday were dire. So, many people simply didn’t bother,
while others took a chance to briefly experience the coldest
weather in a generation.
Whether you’re an adventurer or a hibernator, tell us your
recommendations for staying warm and busy. What are you cooking
or binge-watching? What board games are you playing? If you’re
venturing outside, what are you doing to stay safe? (Experts warn
that even a short time in the extreme cold can be very
dangerous.) How many layers of clothing are you wearing, and
which special hats and gloves are necessary? Send us your photos
and your stories.
terminal browser
terminal browser
edited 12 mins ago
forthrin
asked 20 hours ago
forthrinforthrin
8901121
8901121
1
determining the "main content" seems to me to be a tricky problem to solve
– Jeff Schaller
15 hours ago
Yes. This is "solved" in best effort with various implementations of "Reader Mode" using heuristics. So it would have to be a text-only port of that, or something similar. google.com/search?q=reader+mode+source+code
– forthrin
15 hours ago
add a comment |
1
determining the "main content" seems to me to be a tricky problem to solve
– Jeff Schaller
15 hours ago
Yes. This is "solved" in best effort with various implementations of "Reader Mode" using heuristics. So it would have to be a text-only port of that, or something similar. google.com/search?q=reader+mode+source+code
– forthrin
15 hours ago
1
1
determining the "main content" seems to me to be a tricky problem to solve
– Jeff Schaller
15 hours ago
determining the "main content" seems to me to be a tricky problem to solve
– Jeff Schaller
15 hours ago
Yes. This is "solved" in best effort with various implementations of "Reader Mode" using heuristics. So it would have to be a text-only port of that, or something similar. google.com/search?q=reader+mode+source+code
– forthrin
15 hours ago
Yes. This is "solved" in best effort with various implementations of "Reader Mode" using heuristics. So it would have to be a text-only port of that, or something similar. google.com/search?q=reader+mode+source+code
– forthrin
15 hours ago
add a comment |
2 Answers
2
active
oldest
votes
The comment about "navigation content" is addressed by the -nolist
option, e.g.,
lynx -nolist -dump www.google.com > file.txt
which shows no links, etc:
$ lynx -nolist -dump www.google.com > file.txt
$ cat file.txt
Search Images Maps Play YouTube News Gmail Drive More »
Web History | Settings | Sign in
Google
_______________________________________________________
Google Search I'm Feeling Lucky Advanced search
Language tools
Advertising Programs Business Solutions +Google About
Google
© 2019 - Privacy - Terms
w3m
gives something similar, without the option:
$ w3m -dump https://www.google.com
Search Images Maps Play YouTube News Gmail Drive More >>
Web History | Settings | Sign in
Google
[ ] Advanced
searchLanguage
[Google Search][I'm Feeling Lucky] tools
Advertising ProgramsBusiness Solutions+GoogleAbout Google
(C) 2019 - Privacy - Terms
links2
output looks much like w3m
's (noting the missing space before About):
$ links2 -dump www.google.com
Search Images Maps Play YouTube News Gmail Drive More >>========(97,1) 31% ==
Web History | Settings | Sign in
Google
__________________________________________________________ Advanced
[ Google Search ] [ I'm Feeling Lucky ] searchLanguage
tools
Advertising ProgramsBusiness Solutions+GoogleAbout Google
(c) 2019 - Privacy - Terms
$ links2 -dump www.google.com >file.txt
$ cat file.txt
Search Images Maps Play YouTube News Gmail Drive More >>
Web History | Settings | Sign in
Google
__________________________________________________________ Advanced
[ Google Search ] [ I'm Feeling Lucky ] searchLanguage
tools
Advertising ProgramsBusiness Solutions+GoogleAbout Google
(c) 2019 - Privacy - Terms
(oddly enough, it also prints progress if the dump goes directly to the terminal—not a good feature)
and elinks
apparently only dumps the format with "navigation content" (ymmv).
From further comments, it turns out that OP is interested in something which could render the contents of a given division on the page. Comparing the sizes of the source and dump for that page gives some clues:
Size Buffer name Contents
------- -------------------- ----------------------------------------------------------------------------------------
0# 267624 [!lynx -source ht-1] !lynx -source https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
1 5475 [!lynx -dump -nolis] !lynx -dump -nolist https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
shows that the dump is about 2% of the size of the source. Most of the page is non-informational, and the text-browsers show the information. But the division requested is in a two-line chunk that looks like this (only the beginning: the first line actually has 62265 characters):
<div id="app"><div class="css-v89234 e3w10z60"><div><div><div class="css-13lpfd6 e1nre7570"><header class="css-1bymuyk e1>
<script>window.__preloadedData = {"initialState":{"Article:QXJ0aWNsZTpueXQ6Ly9hcnRpY2xlLzBhODc0MTcxLWM0MjEtNWRjOS1hN2IzLW>
The first line holds the article text (plus a lot of markup), and offhand, looking at the second line, that's probably the script which the GUI browsers detect to show the article. None of the above-mentioned text-browsers has a feature for just showing a given <div>...</div>
, or interpreting a script in that manner. These articles mention the absence of standard URI for reader mode in several GUI browsers:
- Web Reading Mode: The non-standard rendering mode
- Web Reading Mode: A bad reading experience
Thanks for sharing! I've updated the question a bit to point out that the target for such use is article pages, where the objective is to extract the article itself. Have a look at the posting again and see if you can help.
– forthrin
9 hours ago
add a comment |
Does this satisfy your requirement? (From https://stackoverflow.com/questions/12422289/bash-command-to-convert-html-page-to-a-text-file )
lynx --dump www.google.com > file.txt
New contributor
1
Nope. This dumps a ton of navigation links, eg.lynx -dump https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
A solution should strip away all navigation and fluff, and leave ONLY the MAIN CONTENT, like Reader Mode in a browser does.
– forthrin
15 hours ago
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f497911%2fbrowser-like-reader-mode-with-text-only-output%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
The comment about "navigation content" is addressed by the -nolist
option, e.g.,
lynx -nolist -dump www.google.com > file.txt
which shows no links, etc:
$ lynx -nolist -dump www.google.com > file.txt
$ cat file.txt
Search Images Maps Play YouTube News Gmail Drive More »
Web History | Settings | Sign in
Google
_______________________________________________________
Google Search I'm Feeling Lucky Advanced search
Language tools
Advertising Programs Business Solutions +Google About
Google
© 2019 - Privacy - Terms
w3m
gives something similar, without the option:
$ w3m -dump https://www.google.com
Search Images Maps Play YouTube News Gmail Drive More >>
Web History | Settings | Sign in
Google
[ ] Advanced
searchLanguage
[Google Search][I'm Feeling Lucky] tools
Advertising ProgramsBusiness Solutions+GoogleAbout Google
(C) 2019 - Privacy - Terms
links2
output looks much like w3m
's (noting the missing space before About):
$ links2 -dump www.google.com
Search Images Maps Play YouTube News Gmail Drive More >>========(97,1) 31% ==
Web History | Settings | Sign in
Google
__________________________________________________________ Advanced
[ Google Search ] [ I'm Feeling Lucky ] searchLanguage
tools
Advertising ProgramsBusiness Solutions+GoogleAbout Google
(c) 2019 - Privacy - Terms
$ links2 -dump www.google.com >file.txt
$ cat file.txt
Search Images Maps Play YouTube News Gmail Drive More >>
Web History | Settings | Sign in
Google
__________________________________________________________ Advanced
[ Google Search ] [ I'm Feeling Lucky ] searchLanguage
tools
Advertising ProgramsBusiness Solutions+GoogleAbout Google
(c) 2019 - Privacy - Terms
(oddly enough, it also prints progress if the dump goes directly to the terminal—not a good feature)
and elinks
apparently only dumps the format with "navigation content" (ymmv).
From further comments, it turns out that OP is interested in something which could render the contents of a given division on the page. Comparing the sizes of the source and dump for that page gives some clues:
Size Buffer name Contents
------- -------------------- ----------------------------------------------------------------------------------------
0# 267624 [!lynx -source ht-1] !lynx -source https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
1 5475 [!lynx -dump -nolis] !lynx -dump -nolist https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
shows that the dump is about 2% of the size of the source. Most of the page is non-informational, and the text-browsers show the information. But the division requested is in a two-line chunk that looks like this (only the beginning: the first line actually has 62265 characters):
<div id="app"><div class="css-v89234 e3w10z60"><div><div><div class="css-13lpfd6 e1nre7570"><header class="css-1bymuyk e1>
<script>window.__preloadedData = {"initialState":{"Article:QXJ0aWNsZTpueXQ6Ly9hcnRpY2xlLzBhODc0MTcxLWM0MjEtNWRjOS1hN2IzLW>
The first line holds the article text (plus a lot of markup), and offhand, looking at the second line, that's probably the script which the GUI browsers detect to show the article. None of the above-mentioned text-browsers has a feature for just showing a given <div>...</div>
, or interpreting a script in that manner. These articles mention the absence of standard URI for reader mode in several GUI browsers:
- Web Reading Mode: The non-standard rendering mode
- Web Reading Mode: A bad reading experience
Thanks for sharing! I've updated the question a bit to point out that the target for such use is article pages, where the objective is to extract the article itself. Have a look at the posting again and see if you can help.
– forthrin
9 hours ago
add a comment |
The comment about "navigation content" is addressed by the -nolist
option, e.g.,
lynx -nolist -dump www.google.com > file.txt
which shows no links, etc:
$ lynx -nolist -dump www.google.com > file.txt
$ cat file.txt
Search Images Maps Play YouTube News Gmail Drive More »
Web History | Settings | Sign in
Google
_______________________________________________________
Google Search I'm Feeling Lucky Advanced search
Language tools
Advertising Programs Business Solutions +Google About
Google
© 2019 - Privacy - Terms
w3m
gives something similar, without the option:
$ w3m -dump https://www.google.com
Search Images Maps Play YouTube News Gmail Drive More >>
Web History | Settings | Sign in
Google
[ ] Advanced
searchLanguage
[Google Search][I'm Feeling Lucky] tools
Advertising ProgramsBusiness Solutions+GoogleAbout Google
(C) 2019 - Privacy - Terms
links2
output looks much like w3m
's (noting the missing space before About):
$ links2 -dump www.google.com
Search Images Maps Play YouTube News Gmail Drive More >>========(97,1) 31% ==
Web History | Settings | Sign in
Google
__________________________________________________________ Advanced
[ Google Search ] [ I'm Feeling Lucky ] searchLanguage
tools
Advertising ProgramsBusiness Solutions+GoogleAbout Google
(c) 2019 - Privacy - Terms
$ links2 -dump www.google.com >file.txt
$ cat file.txt
Search Images Maps Play YouTube News Gmail Drive More >>
Web History | Settings | Sign in
Google
__________________________________________________________ Advanced
[ Google Search ] [ I'm Feeling Lucky ] searchLanguage
tools
Advertising ProgramsBusiness Solutions+GoogleAbout Google
(c) 2019 - Privacy - Terms
(oddly enough, it also prints progress if the dump goes directly to the terminal—not a good feature)
and elinks
apparently only dumps the format with "navigation content" (ymmv).
From further comments, it turns out that OP is interested in something which could render the contents of a given division on the page. Comparing the sizes of the source and dump for that page gives some clues:
Size Buffer name Contents
------- -------------------- ----------------------------------------------------------------------------------------
0# 267624 [!lynx -source ht-1] !lynx -source https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
1 5475 [!lynx -dump -nolis] !lynx -dump -nolist https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
shows that the dump is about 2% of the size of the source. Most of the page is non-informational, and the text-browsers show the information. But the division requested is in a two-line chunk that looks like this (only the beginning: the first line actually has 62265 characters):
<div id="app"><div class="css-v89234 e3w10z60"><div><div><div class="css-13lpfd6 e1nre7570"><header class="css-1bymuyk e1>
<script>window.__preloadedData = {"initialState":{"Article:QXJ0aWNsZTpueXQ6Ly9hcnRpY2xlLzBhODc0MTcxLWM0MjEtNWRjOS1hN2IzLW>
The first line holds the article text (plus a lot of markup), and offhand, looking at the second line, that's probably the script which the GUI browsers detect to show the article. None of the above-mentioned text-browsers has a feature for just showing a given <div>...</div>
, or interpreting a script in that manner. These articles mention the absence of standard URI for reader mode in several GUI browsers:
- Web Reading Mode: The non-standard rendering mode
- Web Reading Mode: A bad reading experience
Thanks for sharing! I've updated the question a bit to point out that the target for such use is article pages, where the objective is to extract the article itself. Have a look at the posting again and see if you can help.
– forthrin
9 hours ago
add a comment |
The comment about "navigation content" is addressed by the -nolist
option, e.g.,
lynx -nolist -dump www.google.com > file.txt
which shows no links, etc:
$ lynx -nolist -dump www.google.com > file.txt
$ cat file.txt
Search Images Maps Play YouTube News Gmail Drive More »
Web History | Settings | Sign in
Google
_______________________________________________________
Google Search I'm Feeling Lucky Advanced search
Language tools
Advertising Programs Business Solutions +Google About
Google
© 2019 - Privacy - Terms
w3m
gives something similar, without the option:
$ w3m -dump https://www.google.com
Search Images Maps Play YouTube News Gmail Drive More >>
Web History | Settings | Sign in
Google
[ ] Advanced
searchLanguage
[Google Search][I'm Feeling Lucky] tools
Advertising ProgramsBusiness Solutions+GoogleAbout Google
(C) 2019 - Privacy - Terms
links2
output looks much like w3m
's (noting the missing space before About):
$ links2 -dump www.google.com
Search Images Maps Play YouTube News Gmail Drive More >>========(97,1) 31% ==
Web History | Settings | Sign in
Google
__________________________________________________________ Advanced
[ Google Search ] [ I'm Feeling Lucky ] searchLanguage
tools
Advertising ProgramsBusiness Solutions+GoogleAbout Google
(c) 2019 - Privacy - Terms
$ links2 -dump www.google.com >file.txt
$ cat file.txt
Search Images Maps Play YouTube News Gmail Drive More >>
Web History | Settings | Sign in
Google
__________________________________________________________ Advanced
[ Google Search ] [ I'm Feeling Lucky ] searchLanguage
tools
Advertising ProgramsBusiness Solutions+GoogleAbout Google
(c) 2019 - Privacy - Terms
(oddly enough, it also prints progress if the dump goes directly to the terminal—not a good feature)
and elinks
apparently only dumps the format with "navigation content" (ymmv).
From further comments, it turns out that OP is interested in something which could render the contents of a given division on the page. Comparing the sizes of the source and dump for that page gives some clues:
Size Buffer name Contents
------- -------------------- ----------------------------------------------------------------------------------------
0# 267624 [!lynx -source ht-1] !lynx -source https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
1 5475 [!lynx -dump -nolis] !lynx -dump -nolist https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
shows that the dump is about 2% of the size of the source. Most of the page is non-informational, and the text-browsers show the information. But the division requested is in a two-line chunk that looks like this (only the beginning: the first line actually has 62265 characters):
<div id="app"><div class="css-v89234 e3w10z60"><div><div><div class="css-13lpfd6 e1nre7570"><header class="css-1bymuyk e1>
<script>window.__preloadedData = {"initialState":{"Article:QXJ0aWNsZTpueXQ6Ly9hcnRpY2xlLzBhODc0MTcxLWM0MjEtNWRjOS1hN2IzLW>
The first line holds the article text (plus a lot of markup), and offhand, looking at the second line, that's probably the script which the GUI browsers detect to show the article. None of the above-mentioned text-browsers has a feature for just showing a given <div>...</div>
, or interpreting a script in that manner. These articles mention the absence of standard URI for reader mode in several GUI browsers:
- Web Reading Mode: The non-standard rendering mode
- Web Reading Mode: A bad reading experience
The comment about "navigation content" is addressed by the -nolist
option, e.g.,
lynx -nolist -dump www.google.com > file.txt
which shows no links, etc:
$ lynx -nolist -dump www.google.com > file.txt
$ cat file.txt
Search Images Maps Play YouTube News Gmail Drive More »
Web History | Settings | Sign in
Google
_______________________________________________________
Google Search I'm Feeling Lucky Advanced search
Language tools
Advertising Programs Business Solutions +Google About
Google
© 2019 - Privacy - Terms
w3m
gives something similar, without the option:
$ w3m -dump https://www.google.com
Search Images Maps Play YouTube News Gmail Drive More >>
Web History | Settings | Sign in
Google
[ ] Advanced
searchLanguage
[Google Search][I'm Feeling Lucky] tools
Advertising ProgramsBusiness Solutions+GoogleAbout Google
(C) 2019 - Privacy - Terms
links2
output looks much like w3m
's (noting the missing space before About):
$ links2 -dump www.google.com
Search Images Maps Play YouTube News Gmail Drive More >>========(97,1) 31% ==
Web History | Settings | Sign in
Google
__________________________________________________________ Advanced
[ Google Search ] [ I'm Feeling Lucky ] searchLanguage
tools
Advertising ProgramsBusiness Solutions+GoogleAbout Google
(c) 2019 - Privacy - Terms
$ links2 -dump www.google.com >file.txt
$ cat file.txt
Search Images Maps Play YouTube News Gmail Drive More >>
Web History | Settings | Sign in
Google
__________________________________________________________ Advanced
[ Google Search ] [ I'm Feeling Lucky ] searchLanguage
tools
Advertising ProgramsBusiness Solutions+GoogleAbout Google
(c) 2019 - Privacy - Terms
(oddly enough, it also prints progress if the dump goes directly to the terminal—not a good feature)
and elinks
apparently only dumps the format with "navigation content" (ymmv).
From further comments, it turns out that OP is interested in something which could render the contents of a given division on the page. Comparing the sizes of the source and dump for that page gives some clues:
Size Buffer name Contents
------- -------------------- ----------------------------------------------------------------------------------------
0# 267624 [!lynx -source ht-1] !lynx -source https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
1 5475 [!lynx -dump -nolis] !lynx -dump -nolist https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
shows that the dump is about 2% of the size of the source. Most of the page is non-informational, and the text-browsers show the information. But the division requested is in a two-line chunk that looks like this (only the beginning: the first line actually has 62265 characters):
<div id="app"><div class="css-v89234 e3w10z60"><div><div><div class="css-13lpfd6 e1nre7570"><header class="css-1bymuyk e1>
<script>window.__preloadedData = {"initialState":{"Article:QXJ0aWNsZTpueXQ6Ly9hcnRpY2xlLzBhODc0MTcxLWM0MjEtNWRjOS1hN2IzLW>
The first line holds the article text (plus a lot of markup), and offhand, looking at the second line, that's probably the script which the GUI browsers detect to show the article. None of the above-mentioned text-browsers has a feature for just showing a given <div>...</div>
, or interpreting a script in that manner. These articles mention the absence of standard URI for reader mode in several GUI browsers:
- Web Reading Mode: The non-standard rendering mode
- Web Reading Mode: A bad reading experience
edited 8 hours ago
answered 10 hours ago
Thomas DickeyThomas Dickey
52.7k596170
52.7k596170
Thanks for sharing! I've updated the question a bit to point out that the target for such use is article pages, where the objective is to extract the article itself. Have a look at the posting again and see if you can help.
– forthrin
9 hours ago
add a comment |
Thanks for sharing! I've updated the question a bit to point out that the target for such use is article pages, where the objective is to extract the article itself. Have a look at the posting again and see if you can help.
– forthrin
9 hours ago
Thanks for sharing! I've updated the question a bit to point out that the target for such use is article pages, where the objective is to extract the article itself. Have a look at the posting again and see if you can help.
– forthrin
9 hours ago
Thanks for sharing! I've updated the question a bit to point out that the target for such use is article pages, where the objective is to extract the article itself. Have a look at the posting again and see if you can help.
– forthrin
9 hours ago
add a comment |
Does this satisfy your requirement? (From https://stackoverflow.com/questions/12422289/bash-command-to-convert-html-page-to-a-text-file )
lynx --dump www.google.com > file.txt
New contributor
1
Nope. This dumps a ton of navigation links, eg.lynx -dump https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
A solution should strip away all navigation and fluff, and leave ONLY the MAIN CONTENT, like Reader Mode in a browser does.
– forthrin
15 hours ago
add a comment |
Does this satisfy your requirement? (From https://stackoverflow.com/questions/12422289/bash-command-to-convert-html-page-to-a-text-file )
lynx --dump www.google.com > file.txt
New contributor
1
Nope. This dumps a ton of navigation links, eg.lynx -dump https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
A solution should strip away all navigation and fluff, and leave ONLY the MAIN CONTENT, like Reader Mode in a browser does.
– forthrin
15 hours ago
add a comment |
Does this satisfy your requirement? (From https://stackoverflow.com/questions/12422289/bash-command-to-convert-html-page-to-a-text-file )
lynx --dump www.google.com > file.txt
New contributor
Does this satisfy your requirement? (From https://stackoverflow.com/questions/12422289/bash-command-to-convert-html-page-to-a-text-file )
lynx --dump www.google.com > file.txt
New contributor
New contributor
answered 18 hours ago
VBBVBB
992
992
New contributor
New contributor
1
Nope. This dumps a ton of navigation links, eg.lynx -dump https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
A solution should strip away all navigation and fluff, and leave ONLY the MAIN CONTENT, like Reader Mode in a browser does.
– forthrin
15 hours ago
add a comment |
1
Nope. This dumps a ton of navigation links, eg.lynx -dump https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
A solution should strip away all navigation and fluff, and leave ONLY the MAIN CONTENT, like Reader Mode in a browser does.
– forthrin
15 hours ago
1
1
Nope. This dumps a ton of navigation links, eg.
lynx -dump https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
A solution should strip away all navigation and fluff, and leave ONLY the MAIN CONTENT, like Reader Mode in a browser does.– forthrin
15 hours ago
Nope. This dumps a ton of navigation links, eg.
lynx -dump https://www.nytimes.com/2019/01/30/reader-center/polar-vortex-tips.html
A solution should strip away all navigation and fluff, and leave ONLY the MAIN CONTENT, like Reader Mode in a browser does.– forthrin
15 hours ago
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f497911%2fbrowser-like-reader-mode-with-text-only-output%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
determining the "main content" seems to me to be a tricky problem to solve
– Jeff Schaller
15 hours ago
Yes. This is "solved" in best effort with various implementations of "Reader Mode" using heuristics. So it would have to be a text-only port of that, or something similar. google.com/search?q=reader+mode+source+code
– forthrin
15 hours ago