Reference http://www.crummy.com/software/BeautifulSoup/bs4/doc/
soup = BeautifulSoup(markup{, parser})
Parsers
- Default uses Python's
html.parser
lxml
("lxml"
,"xml"
): very fast (pip install lxml
)html5lib
: ("html5lib"
): HTML5-compliant but slow (pip install html5lib
)
Types
Tag
(includes Tag
and BeautifulSoup
)
tag = soup.b
tag.name
(can assign too)tag['class']
/tag.attrs
(can assign / return a list)
NavigableString
(includes NavigableString
, Comment
, CData
, ProcessingInstruction
, Declaration
, and Doctype
)
tag.string
(Should convert to unicode withunicode()
before use)tag.string.replace_with("new stuff")
- Many properties are similar to
Tag
Navigate
Going Down
soup.head
/soup.body.b
(only the first one is returned; usesoup.find_all('a')
to get all)tag.contents
(list) /tag.children
(generator) /tag.descendants
(generator)tag.string
:- One child → return the string of the child
- Multiple child → return
None
!!!
tag.strings
(generator) /tag.stripped_strings
(generator; whitespace strings removed)
Going Up
tag.parent
(can be used onNavigableString
too)tag.parents
(generator)
Going Sideways
tag.next_sibling
/tag.previous_sibling
tag.next_siblings
(generator) /tag.previous_siblings
(generator)
Going in Parse Order
tag.next_element
/tag.previous_element
tag.next_elements
(generator) /tag.previous_elements
(generator)
Search
Acceptable Argument Types (AAT)
- string:
'b'
- regex:
re.compile("t")
(usematch()
method, so/t/
will also matchhtml
) - list:
['a', 'b']
True
(all tags, not strings)- function that takes a
Tag
and returnsTrue
orFalse
Find All
The main function is tag.find_all(name, attrs, recursive=True, text, limit=None, **kwargs)
Shorthand: tag(...)
is equivalent to tag.find_all(...)
- Name:
name
(any AAT)- Note that
tag.a
is equivalent totag.find_all('a')
- Note that
- Attributes:
**kwargs
,attrs
id='link2'
/href=re.compile("elsie")
/ ... (any AAT)class_="sister"
/class_=re.compile("itl")
/ ... (any AAT)- 1 class string → any position
- multiple class string → must be of the same order! (
tag.select(...)
is more flexible)
attrs={"data-foo": "value"}
(for attributes with hyphens / any AAT)
- Text:
text
- Without
name
: search for text (any AAT) - With
name
: search for tags such thattag.string
matches thetext
argument (any AAT)
- Without
Other Functions
tag.find(...)
finds 1 result (same signature asfind_all
, exceptlimit
)tag.find_{parents|{next|previous}_siblings|all_{next|previous}}(...)
tag.find_{parent|{next|previous}_sibling|{next|previous}}(...)
tag.select(...)
(CSS pattern)
Modify
tag.name = "blockquote"
/tag['class'] = 'verybold'
/del tag['class']
tag.string = "New link text."
(old content and children will be removed!)tag.append(new_...)
/tag.insert(idx, new_...)
/tag.insert_{before|after}(new_...)
new_string = soup.new_string(" there")
(Python unicode object works too)new_comment = soup.new_string("Nice to see you.", Comment)
new_tag = soup.new_tag("a", href="http://www.example.com")
tag.clear()
(clear inside but not self)tag.extract()
(remove self and return)tag.decompose()
(remove self and destroy completely)tag.replace_with(soup.new_tag("b"))
(replace and return the old one)tag.wrap(soup.new_tag("b"))
tag.unwrap()
(replace self with the inside and return the old one)
Output
tag.prettify()
/tag.decode()
=unicode(tag)
/tag.encode()
=str(tag)
(encoded in UTF-8)- argument
formatter
can be"minimal"
,"html"
,None
, or a function
- argument
tag.get_text()
/tag.get_text('|', strip=True)
(specify separator)- can also use
[text for text in tag.stripped_strings]
- can also use