Python write reptiles, come and come, you can learn

Python

Python crawler, generally used for grasping specific content, recently wanted to learn to crawl through the network of what you want. The main function of case program: grab pictures from our campus network news.

#coding=utf-8

Import urllib

Import re

# function a grab web content

Def getHtml (URL):

WebPage = urllib.urlopen (URL)

HTML = webPage.read ()

Return HTML

# define a function to crawl the web in the picture

Def getNewsImgs (HTML):

# regular expressions

Reg = r’src= (1? /.jpg) “

IMG = re.compile (reg)

# web access to all qualified pictures URL

Imglist = re.findall (IMG, HTML)

X = 0

# according to download pictures and rename pictures.

For imgUrl in imglist:

Urllib.urlretrieve (“http://www.abc.edu.cn/news/”, “+imgUrl”, “‘news-%s.jpg'”, “X”)

X = 1

# get web page

HTML = getHtml (“http://www.abc.edu.cn/news/show.aspx, id=21413&amp, cid=5”)

# grab pictures

Print getNewsImgs (HTML)

So you can grab pictures from the campus news. The above is used to match data items with regular expressions, but it is error prone to write, and if you have DOM development experience or use jQuery’s friends to see BeautifulSoup, it’s like meeting an old friend. First, install BeautifulSoup, Mac, BeautifulSoup installation is very simple, open the terminal, execute the following statement, and then enter the password to install.

Sudo easy_install beautifulsoup4

Change code

#coding=utf-8

Import urllib

From BS4 import BeautifulSoup

# function a grab web content

Def getHtml (URL):

WebPage = urllib.urlopen (URL)

HTML = webPage.read ()

Return HTML

# define a function to crawl the web in the picture

Def getNewsImgs (HTML):

# create BeautifulSoup

Soup = BeautifulSoup (HTML, html.parser)

# find all img Tags

UrlList = soup.find_all (“img”)

Length = len (urlList)

# traversal tags download pictures

For, I, in, range (length):

ImgUrl = urlList[i].attrs[“SRC”]

Urllib.urlretrieve (“http://www.abc.edu.cn/news/”, “+imgUrl”, “‘news-%s.jpg'”, “I”)

# get web page

HTML = getHtml (“http://www.abc.edu.cn/news/show.aspx, id=21430&amp, cid=5”)

# grab pictures

GetNewsImgs (HTML)

The implementation effect is as follows:

Python write reptiles, come and come, you can learn

Python writing reptiles is so simple, not fast try? I have built a python learning exchange group, in the group, we help each other, mutual care, mutual sharing of content, this problem is more and more people who help you, 301, is the group number is 056, then 051, so you can find the aggregation of God, if you just want to help you. Not willing to share or help others, please don’t be added, you will tell others this is a share