Parsers

Types

(includes NavigableString, Comment, CData, ProcessingInstruction, Declaration, and Doctype)

soup.head / soup.body.b (only the first one is returned; use soup.find_all('a') to get all)
tag.contents (list) / tag.children (generator) / tag.descendants (generator)
tag.string:
- One child → return the string of the child
- Multiple child → return None!!!
tag.strings (generator) / tag.stripped_strings (generator; whitespace strings removed)

string: 'b'
regex: re.compile("t") (use match() method, so /t/ will also match html)
list: ['a', 'b']
True (all tags, not strings)
function that takes a Tag and returns True or False

The main function is tag.find_all(name, attrs, recursive=True, text, limit=None, **kwargs)

Shorthand: tag(...) is equivalent to tag.find_all(...)

Name: name (any AAT)
- Note that tag.a is equivalent to tag.find_all('a')
Attributes: **kwargs, attrs
- id='link2' / href=re.compile("elsie") / ... (any AAT)
- class_="sister" / class_=re.compile("itl") / ... (any AAT)
  - 1 class string → any position
  - multiple class string → must be of the same order! (tag.select(...) is more flexible)
- attrs={"data-foo": "value"} (for attributes with hyphens / any AAT)
Text: text
- Without name: search for text (any AAT)
- With name: search for tags such that tag.string matches the text argument (any AAT)

tag.name = "blockquote" / tag['class'] = 'verybold' / del tag['class']
tag.string = "New link text." (old content and children will be removed!)
tag.append(new_...) / tag.insert(idx, new_...) / tag.insert_{before|after}(new_...)
- new_string = soup.new_string(" there") (Python unicode object works too)
- new_comment = soup.new_string("Nice to see you.", Comment)
- new_tag = soup.new_tag("a", href="http://www.example.com")
tag.clear() (clear inside but not self)
tag.extract() (remove self and return)
tag.decompose() (remove self and destroy completely)
tag.replace_with(soup.new_tag("b")) (replace and return the old one)
tag.wrap(soup.new_tag("b"))
tag.unwrap() (replace self with the inside and return the old one)

tag.prettify() / tag.decode() = unicode(tag) / tag.encode() = str(tag) (encoded in UTF-8)
- argument formatter can be "minimal", "html", None, or a function
tag.get_text() / tag.get_text('|', strip=True) (specify separator)
- can also use [text for text in tag.stripped_strings]