Beautiful Soup

Reference http://www.crummy.com/software/BeautifulSoup/bs4/doc/

soup = BeautifulSoup(markup{, parser})

Parsers

  • Default uses Python's html.parser
  • lxml ("lxml", "xml"): very fast (pip install lxml)
  • html5lib: ("html5lib"): HTML5-compliant but slow (pip install html5lib)

Types

Tag

(includes Tag and BeautifulSoup)

  • tag = soup.b
  • tag.name (can assign too)
  • tag['class'] / tag.attrs (can assign / return a list)

(includes NavigableString, Comment, CData, ProcessingInstruction, Declaration, and Doctype)

  • tag.string (Should convert to unicode with unicode() before use)
  • tag.string.replace_with("new stuff")
  • Many properties are similar to Tag

Navigate

Going Down

  • soup.head / soup.body.b (only the first one is returned; use soup.find_all('a') to get all)
  • tag.contents (list) / tag.children (generator) / tag.descendants (generator)
  • tag.string:
    • One child → return the string of the child
    • Multiple child → return None!!!
  • tag.strings (generator) / tag.stripped_strings (generator; whitespace strings removed)

Going Up

  • tag.parent (can be used on NavigableString too)
  • tag.parents (generator)

Going Sideways

  • tag.next_sibling / tag.previous_sibling
  • tag.next_siblings (generator) / tag.previous_siblings (generator)

Going in Parse Order

  • tag.next_element / tag.previous_element
  • tag.next_elements (generator) / tag.previous_elements (generator)

Search

Acceptable Argument Types (AAT)

  • string: 'b'
  • regex: re.compile("t") (use match() method, so /t/ will also match html)
  • list: ['a', 'b']
  • True (all tags, not strings)
  • function that takes a Tag and returns True or False

Find All

The main function is tag.find_all(name, attrs, recursive=True, text, limit=None, **kwargs)

Shorthand: tag(...) is equivalent to tag.find_all(...)

  • Name: name (any AAT)
    • Note that tag.a is equivalent to tag.find_all('a')
  • Attributes: **kwargs, attrs
    • id='link2' / href=re.compile("elsie") / ... (any AAT)
    • class_="sister" / class_=re.compile("itl") / ... (any AAT)
      • 1 class string → any position
      • multiple class string → must be of the same order! (tag.select(...) is more flexible)
    • attrs={"data-foo": "value"} (for attributes with hyphens / any AAT)
  • Text: text
    • Without name: search for text (any AAT)
    • With name: search for tags such that tag.string matches the text argument (any AAT)

Other Functions

  • tag.find(...) finds 1 result (same signature as find_all, except limit)
  • tag.find_{parents|{next|previous}_siblings|all_{next|previous}}(...)
  • tag.find_{parent|{next|previous}_sibling|{next|previous}}(...)
  • tag.select(...) (CSS pattern)

Modify

  • tag.name = "blockquote" / tag['class'] = 'verybold' / del tag['class']
  • tag.string = "New link text." (old content and children will be removed!)
  • tag.append(new_...) / tag.insert(idx, new_...) / tag.insert_{before|after}(new_...)
    • new_string = soup.new_string(" there") (Python unicode object works too)
    • new_comment = soup.new_string("Nice to see you.", Comment)
    • new_tag = soup.new_tag("a", href="http://www.example.com")
  • tag.clear() (clear inside but not self)
  • tag.extract() (remove self and return)
  • tag.decompose() (remove self and destroy completely)
  • tag.replace_with(soup.new_tag("b")) (replace and return the old one)
  • tag.wrap(soup.new_tag("b"))
  • tag.unwrap() (replace self with the inside and return the old one)

Output

  • tag.prettify() / tag.decode() = unicode(tag) / tag.encode() = str(tag) (encoded in UTF-8)
    • argument formatter can be "minimal", "html", None, or a function
  • tag.get_text() / tag.get_text('|', strip=True) (specify separator)
    • can also use [text for text in tag.stripped_strings]
Exported: 2021-01-02T22:35:44.313476