​html2text​​是一个Python模块,用来把 HTML格式转换为文本( Markdown )格式。


# 安装html2text


$ pip install html2text


# 使用html2text


import html2text

html = '''
<html>
<body>
<h1>Title</h1>
<p>Hello World</p>
<ul>
<li>Here's one thing</li>
<li>And here's another!</li>
</ul>
</body>
</html>
'''

markdown = html2text.html2text(html)

print(markdown)

输出结果:




它还提供一个命令行接口:


$ html2text -h
Usage: html2text [(filename|url) [encoding]]

Options:
--version show program's version number and exit
-h, --help show this help message and exit
--pad-tables pad the cells to equal column width in tables
--no-wrap-links wrap links during conversion
--ignore-emphasis don't include any formatting for emphasis
--reference-links use reference style links instead of inline links
--ignore-links don't include any formatting for links
--protect-links protect links from line breaks surrounding them with
angle brackets
--ignore-images don't include any formatting for images
--images-to-alt Discard image data, only keep alt text
--images-with-size Write image tags with height and width attrs as raw
html to retain dimensions
-g, --google-doc convert an html-exported Google Document
-d, --dash-unordered-list
use a dash rather than a star for unordered list items
-e, --asterisk-emphasis
use an asterisk rather than an underscore for
emphasized text
-b BODY_WIDTH, --body-width=BODY_WIDTH
number of characters per output line, 0 for no wrap
-i LIST_INDENT, --google-list-indent=LIST_INDENT
number of pixels Google indents nested lists
-s, --hide-strikethrough
hide strike-through text. only relevant when -g is
specified as well
--escape-all Escape all special characters. Output is less
readable, but avoids corner case formatting issues.
--bypass-tables Format tables in HTML rather than Markdown syntax.
--single-line-break Use a single line break after a block element rather
than two line breaks. NOTE: Requires --body-width=0
--unicode-snob Use unicode throughout document
--no-automatic-links Do not use automatic links wherever applicable
--no-skip-internal-links
Do not skip internal links
--links-after-para Put links after each paragraph instead of document
--mark-code Mark program code blocks with [code]...[/code]
--decode-errors=DECODE_ERRORS
What to do in case of decode errors.'ignore', 'strict'
and 'replace' are acceptable values

其它工具:



还有一个node.js的在线工具:


​http://island205.github.io/h2m/​