HTML parsing with BeautifulSoup
Programación, Python 28 de February de 2007I'm gonna put an example using BeautifulSoup, a Python module which we can parse HTML.
First we've to download the library from here. Once installed, let's see the simple example:
from BeautifulSoup import BeautifulSoup html = '''<html><head><title>Titulo de la pagina</title></head> <body> <div id="cabecera"> <h1>Cabecera</h1> </div> <div id="contenido"> Vamos a poner una lista. Lista: <ul id="lista1"> <li>Elemento 1</li> <li>Elemento 2</li> <li>Elemento 3</li> <li>Elemento 4</li> </ul></div> </body> </html>''' soup = BeautifulSoup(html) # Mostramos el titulo de la pagina print soup.head.title.string # Mostramos la cabecera print soup.find("div",{"id":"cabecera"}).contents # Mostramos el contenido contenido = soup.find("div",{"id":"contenido"}) print contenido.contents # Ahora mostramos todos los elementos de lista1 lista = contenido.find("ul",{"id":"lista1"}) for x in lista: print x.string
Instead of a string with html code we can read from an URL on this way:
... import urllib2 sock = urllib2.urlopen("http://servidor/documento.html") soup = BeautifulSoup(sock.read()) ...
It's a very simple sample, for details read the BeautifulSoup documentation.

