Reference http://www.crummy.com/software/BeautifulSoup/bs4/doc/
soup = BeautifulSoup(markup{, parser})
Parsers
- Default uses Python's
html.parser lxml("lxml","xml"): very fast (pip install lxml)html5lib: ("html5lib"): HTML5-compliant but slow (pip install html5lib)
Types
Tag
(includes Tag and BeautifulSoup)
tag = soup.btag.name(can assign too)tag['class']/tag.attrs(can assign / return a list)
NavigableString
(includes NavigableString, Comment, CData, ProcessingInstruction, Declaration, and Doctype)
tag.string(Should convert to unicode withunicode()before use)tag.string.replace_with("new stuff")- Many properties are similar to
Tag
Navigate
Going Down
soup.head/soup.body.b(only the first one is returned; usesoup.find_all('a')to get all)tag.contents(list) /tag.children(generator) /tag.descendants(generator)tag.string:- One child → return the string of the child
- Multiple child → return
None!!!
tag.strings(generator) /tag.stripped_strings(generator; whitespace strings removed)
Going Up
tag.parent(can be used onNavigableStringtoo)tag.parents(generator)
Going Sideways
tag.next_sibling/tag.previous_siblingtag.next_siblings(generator) /tag.previous_siblings(generator)
Going in Parse Order
tag.next_element/tag.previous_elementtag.next_elements(generator) /tag.previous_elements(generator)
Search
Acceptable Argument Types (AAT)
- string:
'b' - regex:
re.compile("t")(usematch()method, so/t/will also matchhtml) - list:
['a', 'b'] True(all tags, not strings)- function that takes a
Tagand returnsTrueorFalse
Find All
The main function is tag.find_all(name, attrs, recursive=True, text, limit=None, **kwargs)
Shorthand: tag(...) is equivalent to tag.find_all(...)
- Name:
name(any AAT)- Note that
tag.ais equivalent totag.find_all('a')
- Note that
- Attributes:
**kwargs,attrsid='link2'/href=re.compile("elsie")/ ... (any AAT)class_="sister"/class_=re.compile("itl")/ ... (any AAT)- 1 class string → any position
- multiple class string → must be of the same order! (
tag.select(...)is more flexible)
attrs={"data-foo": "value"}(for attributes with hyphens / any AAT)
- Text:
text- Without
name: search for text (any AAT) - With
name: search for tags such thattag.stringmatches thetextargument (any AAT)
- Without
Other Functions
tag.find(...)finds 1 result (same signature asfind_all, exceptlimit)tag.find_{parents|{next|previous}_siblings|all_{next|previous}}(...)tag.find_{parent|{next|previous}_sibling|{next|previous}}(...)tag.select(...)(CSS pattern)
Modify
tag.name = "blockquote"/tag['class'] = 'verybold'/del tag['class']tag.string = "New link text."(old content and children will be removed!)tag.append(new_...)/tag.insert(idx, new_...)/tag.insert_{before|after}(new_...)new_string = soup.new_string(" there")(Python unicode object works too)new_comment = soup.new_string("Nice to see you.", Comment)new_tag = soup.new_tag("a", href="http://www.example.com")
tag.clear()(clear inside but not self)tag.extract()(remove self and return)tag.decompose()(remove self and destroy completely)tag.replace_with(soup.new_tag("b"))(replace and return the old one)tag.wrap(soup.new_tag("b"))tag.unwrap()(replace self with the inside and return the old one)
Output
tag.prettify()/tag.decode()=unicode(tag)/tag.encode()=str(tag)(encoded in UTF-8)- argument
formattercan be"minimal","html",None, or a function
- argument
tag.get_text()/tag.get_text('|', strip=True)(specify separator)- can also use
[text for text in tag.stripped_strings]
- can also use