Python Sleep and Then Try to Connect to Socket Again
Networked programs
While many of the examples in this book have focused on reading files and looking for data in those files, in that location are many different sources of information when one considers the Internet.
In this affiliate we will pretend to be a spider web browser and retrieve spider web pages using the Hypertext Transfer Protocol (HTTP). Then nosotros will read through the web page data and parse it.
Hypertext Transfer Protocol - HTTP
The network protocol that powers the web is actually quite uncomplicated and at that place is built-in support in Python called socket
which makes it very easy to make network connections and retrieve data over those sockets in a Python program.
A socket is much like a file, except that a single socket provides a 2-fashion connectedness between 2 programs. You can both read from and write to the aforementioned socket. If you write something to a socket, it is sent to the application at the other end of the socket. If yous read from the socket, you are given the data which the other application has sent.
Simply if yous try to read a socket when the program on the other end of the socket has not sent whatsoever data, you just sit down and wait. If the programs on both ends of the socket simply wait for some information without sending anything, they volition wait for a very long time, so an of import part of programs that communicate over the Internet is to take some sort of protocol.
A protocol is a set of precise rules that determine who is to go first, what they are to do, and and so what the responses are to that message, and who sends side by side, and so on. In a sense the two applications at either end of the socket are doing a dance and making sure not to step on each other's toes.
There are many documents that describe these network protocols. The Hypertext Transfer Protocol is described in the following document:
https://world wide web.w3.org/Protocols/rfc2616/rfc2616.txt
This is a long and complex 176-page certificate with a lot of detail. If y'all find it interesting, experience free to read it all. But if you take a look effectually folio 36 of RFC2616 y'all will notice the syntax for the GET asking. To request a document from a web server, we brand a connectedness to the www.pr4e.org
server on port 80, and then send a line of the form
GET http://data.pr4e.org/romeo.txt HTTP/1.0
where the second parameter is the spider web page we are requesting, and and so nosotros also ship a blank line. The web server will respond with some header information nigh the document and a bare line followed by the document content.
The world'due south simplest spider web browser
Perhaps the easiest style to show how the HTTP protocol works is to write a very elementary Python programme that makes a connection to a spider web server and follows the rules of the HTTP protocol to request a document and display what the server sends back.
import socket mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) mysock.connect(('data.pr4e.org', 80)) cmd = 'Become http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode() mysock.transport(cmd) while True: data = mysock.recv(512) if len(data) < 1: break print(data.decode(),terminate='') mysock.shut() # Code: http://www.py4e.com/code3/socket1.py
Beginning the program makes a connexion to port eighty on the server www.py4e.com. Since our program is playing the role of the "spider web browser", the HTTP protocol says we must send the Become control followed past a blank line. \r\n
signifies an EOL (terminate of line), so \r\n\r\north
signifies cypher between two EOL sequences. That is the equivalent of a bare line.
Once nosotros ship that blank line, nosotros write a loop that receives data in 512-character chunks from the socket and prints the data out until there is no more data to read (i.e., the recv() returns an empty string).
The program produces the post-obit output:
HTTP/1.1 200 OK Appointment: Midweek, 11 Apr 2018 eighteen:52:55 GMT Server: Apache/2.four.seven (Ubuntu) Last-Modified: Sabbatum, xiii May 2017 11:22:22 GMT ETag: "a7-54f6609245537" Have-Ranges: bytes Content-Length: 167 Enshroud-Control: max-age=0, no-enshroud, no-store, must-revalidate Pragma: no-cache Expires: Wed, 11 January 1984 05:00:00 GMT Connection: close Content-Type: text/plain Simply soft what light through yonder window breaks Information technology is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief
The output starts with headers which the web server sends to describe the document. For instance, the Content-Blazon
header indicates that the document is a plain text document (text/plain
).
After the server sends us the headers, it adds a bare line to signal the terminate of the headers, and and then sends the actual information of the file romeo.txt.
This example shows how to brand a depression-level network connection with sockets. Sockets tin be used to communicate with a web server or with a mail server or many other kinds of servers. All that is needed is to discover the certificate which describes the protocol and write the code to transport and receive the data according to the protocol.
However, since the protocol that we apply nigh commonly is the HTTP web protocol, Python has a special library specifically designed to support the HTTP protocol for the retrieval of documents and information over the web.
One of the requirements for using the HTTP protocol is the need to send and receive data as bytes objects, instead of strings. In the preceding example, the encode()
and decode()
methods convert strings into bytes objects and dorsum over again.
The next example uses b''
notation to specify that a variable should be stored as a bytes object. encode()
and b''
are equivalent.
>>> b'Howdy globe' b'Hello world' >>> 'Hello world'.encode() b'How-do-you-do earth'
Retrieving an prototype over HTTP
In the higher up example, we retrieved a apparently text file which had newlines in the file and nosotros simply copied the data to the screen as the program ran. We can use a like program to retrieve an prototype across using HTTP. Instead of copying the data to the screen as the plan runs, we accumulate the data in a string, trim off the headers, and then save the prototype data to a file as follows:
import socket import fourth dimension HOST = 'information.pr4e.org' PORT = 80 mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) mysock.connect((HOST, PORT)) mysock.sendall(b'GET http://information.pr4e.org/cover3.jpg HTTP/i.0\r\northward\r\n') count = 0 motion-picture show = b"" while True: data = mysock.recv(5120) if len(data) < 1: break #fourth dimension.sleep(0.25) count = count + len(data) impress(len(information), count) pic = moving-picture show + information mysock.close() # Wait for the end of the header (2 CRLF) pos = motion-picture show.find(b"\r\north\r\n") impress('Header length', pos) print(picture show[:pos].decode()) # Skip by the header and save the picture information picture = picture[pos+4:] fhand = open("stuff.jpg", "wb") fhand.write(motion picture) fhand.close() # Lawmaking: http://www.py4e.com/code3/urljpeg.py
When the programme runs, it produces the following output:
$ python urljpeg.py 5120 5120 5120 10240 4240 14480 5120 19600 ... 5120 214000 3200 217200 5120 222320 5120 227440 3167 230607 Header length 393 HTTP/one.i 200 OK Engagement: Wed, 11 Apr 2018 eighteen:54:09 GMT Server: Apache/2.4.7 (Ubuntu) Terminal-Modified: Mon, xv May 2017 12:27:40 GMT ETag: "38342-54f8f2e5b6277" Accept-Ranges: bytes Content-Length: 230210 Vary: Accept-Encoding Enshroud-Command: max-historic period=0, no-enshroud, no-store, must-revalidate Pragma: no-cache Expires: Midweek, 11 Jan 1984 05:00:00 GMT Connection: close Content-Type: paradigm/jpeg
You can see that for this url, the Content-Blazon
header indicates that body of the document is an image (paradigm/jpeg
). Once the program completes, you lot tin view the image information by opening the file stuff.jpg
in an image viewer.
As the plan runs, you lot can see that we don't get 5120 characters each fourth dimension we telephone call the recv()
method. We go as many characters as have been transferred beyond the network to us by the web server at the moment we call recv()
. In this example, we either become equally few as 3200 characters each fourth dimension we asking up to 5120 characters of data.
Your results may be unlike depending on your network speed. Also note that on the concluding call to recv()
nosotros get 3167 bytes, which is the cease of the stream, and in the adjacent call to recv()
we get a zip-length string that tells usa that the server has called shut()
on its end of the socket and there is no more data forthcoming.
Nosotros can irksome down our successive recv()
calls by uncommenting the call to fourth dimension.sleep()
. This style, we wait a quarter of a 2d after each call so that the server tin "get ahead" of united states and send more data to usa before nosotros call recv()
again. With the delay, in place the programme executes as follows:
$ python urljpeg.py 5120 5120 5120 10240 5120 15360 ... 5120 225280 5120 230400 207 230607 Header length 393 HTTP/1.i 200 OK Appointment: Wed, eleven April 2018 21:42:08 GMT Server: Apache/2.4.7 (Ubuntu) Concluding-Modified: Mon, 15 May 2017 12:27:40 GMT ETag: "38342-54f8f2e5b6277" Accept-Ranges: bytes Content-Length: 230210 Vary: Accept-Encoding Cache-Control: max-age=0, no-cache, no-shop, must-revalidate Pragma: no-cache Expires: Wednesday, 11 January 1984 05:00:00 GMT Connectedness: shut Content-Type: image/jpeg
At present other than the first and last calls to recv()
, we now go 5120 characters each time we ask for new data.
In that location is a buffer betwixt the server making transport()
requests and our application making recv()
requests. When we run the program with the delay in place, at some betoken the server might fill upward the buffer in the socket and exist forced to pause until our program starts to empty the buffer. The pausing of either the sending awarding or the receiving application is chosen "flow control."
Retrieving web pages with urllib
While nosotros tin manually send and receive information over HTTP using the socket library, there is a much simpler manner to perform this mutual chore in Python past using the urllib
library.
Using urllib
, you can treat a web page much like a file. You lot simply indicate which web page you would similar to retrieve and urllib
handles all of the HTTP protocol and header details.
The equivalent code to read the romeo.txt file from the web using urllib
is every bit follows:
import urllib.request fhand = urllib.request.urlopen('http://information.pr4e.org/romeo.txt') for line in fhand: print(line.decode().strip()) # Lawmaking: http://www.py4e.com/code3/urllib1.py
Once the web page has been opened with urllib.urlopen
, we can treat it like a file and read through information technology using a for
loop.
When the plan runs, we but see the output of the contents of the file. The headers are still sent, but the urllib
code consumes the headers and just returns the data to us.
Just soft what light through yonder window breaks It is the eastward and Juliet is the dominicus Ascend fair sun and impale the envious moon Who is already sick and pale with grief
As an example, we tin write a program to retrieve the data for romeo.txt
and compute the frequency of each discussion in the file every bit follows:
import urllib.request, urllib.parse, urllib.fault fhand = urllib.asking.urlopen('http://data.pr4e.org/romeo.txt') counts = dict() for line in fhand: words = line.decode().split() for give-and-take in words: counts[discussion] = counts.go(word, 0) + ane impress(counts) # Lawmaking: http://www.py4e.com/code3/urlwords.py
Again, once we have opened the web page, we can read it like a local file.
Reading binary files using urllib
Sometimes you desire to retrieve a non-text (or binary) file such as an image or video file. The data in these files is mostly not useful to print out, but you can hands make a copy of a URL to a local file on your hard disk using urllib
.
The pattern is to open the URL and employ read
to download the entire contents of the certificate into a string variable (img
) then write that information to a local file as follows:
import urllib.asking, urllib.parse, urllib.error img = urllib.request.urlopen('http://information.pr4e.org/cover3.jpg').read() fhand = open up('cover3.jpg', 'wb') fhand.write(img) fhand.close() # Code: http://world wide web.py4e.com/code3/curl1.py
This program reads all of the information in at once across the network and stores it in the variable img
in the principal retentivity of your computer, then opens the file comprehend.jpg
and writes the information out to your disk. The wb
argument for open()
opens a binary file for writing only. This program will work if the size of the file is less than the size of the memory of your reckoner.
Even so if this is a large audio or video file, this program may crash or at least run extremely slowly when your figurer runs out of memory. In order to avoid running out of memory, we call back the information in blocks (or buffers) and so write each block to your disk before retrieving the next cake. This manner the program tin read any size file without using up all of the retentiveness you have in your computer.
import urllib.request, urllib.parse, urllib.fault img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg') fhand = open up('cover3.jpg', 'wb') size = 0 while True: info = img.read(100000) if len(info) < 1: pause size = size + len(info) fhand.write(info) print(size, 'characters copied.') fhand.shut() # Lawmaking: http://www.py4e.com/code3/curl2.py
In this example, we read only 100,000 characters at a time and then write those characters to the comprehend.jpg
file before retrieving the side by side 100,000 characters of data from the spider web.
This program runs as follows:
python curl2.py 230210 characters copied.
Parsing HTML and scraping the web
One of the common uses of the urllib
capability in Python is to scrape the web. Web scraping is when nosotros write a programme that pretends to exist a web browser and retrieves pages, so examines the data in those pages looking for patterns.
As an example, a search engine such as Google volition await at the source of one web page and extract the links to other pages and call back those pages, extracting links, so on. Using this technique, Google spiders its way through most all of the pages on the web.
Google also uses the frequency of links from pages it finds to a detail page as one measure out of how "of import" a page is and how high the page should announced in its search results.
Parsing HTML using regular expressions
1 simple mode to parse HTML is to use regular expressions to repeatedly search for and extract substrings that match a item design.
Here is a unproblematic spider web page:
<h1>The First Folio</h1> <p> If y'all like, you lot tin switch to the <a href="http://www.dr-chuck.com/page2.htm"> Second Page</a>. </p>
We can construct a well-formed regular expression to match and extract the link values from the higher up text as follows:
href="http[due south]?://.+?"
Our regular expression looks for strings that kickoff with "href="http://" or "href="https://", followed past 1 or more characters (.+?
), followed by another double quote. The question marker behind the [s]?
indicates to search for the string "http" followed by nil or one "s".
The question mark added to the .+?
indicates that the match is to be washed in a "not-greedy" fashion instead of a "greedy" manner. A non-greedy match tries to observe the smallest possible matching string and a greedy match tries to find the largest possible matching string.
We add parentheses to our regular expression to bespeak which part of our matched string we would like to extract, and produce the post-obit program:
# Search for link values inside URL input import urllib.request, urllib.parse, urllib.error import re import ssl # Ignore SSL certificate errors ctx = ssl.create_default_context() ctx.check_hostname = Simulated ctx.verify_mode = ssl.CERT_NONE url = input('Enter - ') html = urllib.request.urlopen(url, context=ctx).read() links = re.findall(b'href="(http[s]?://.*?)"', html) for link in links: print(link.decode()) # Lawmaking: http://world wide web.py4e.com/code3/urlregex.py
The ssl
library allows this program to admission web sites that strictly enforce HTTPS. The read
method returns HTML source lawmaking as a bytes object instead of returning an HTTPResponse object. The findall
regular expression method will requite u.s. a listing of all of the strings that match our regular expression, returning only the link text between the double quotes.
When we run the program and input a URL, we get the following output:
Enter - https://docs.python.org https://docs.python.org/3/index.html https://www.python.org/ https://docs.python.org/three.8/ https://docs.python.org/three.7/ https://docs.python.org/3.5/ https://docs.python.org/2.7/ https://www.python.org/md/versions/ https://www.python.org/dev/peps/ https://wiki.python.org/moin/BeginnersGuide https://wiki.python.org/moin/PythonBooks https://world wide web.python.org/medico/av/ https://www.python.org/ https://www.python.org/psf/donations/ http://sphinx.pocoo.org/
Regular expressions work very nicely when your HTML is well formatted and predictable. But since in that location are a lot of "broken" HTML pages out there, a solution simply using regular expressions might either miss some valid links or terminate upwardly with bad data.
This can exist solved by using a robust HTML parsing library.
Parsing HTML using BeautifulSoup
Even though HTML looks like XML1 and some pages are carefully constructed to be XML, most HTML is generally broken in means that cause an XML parser to reject the entire page of HTML as improperly formed.
In that location are a number of Python libraries which can assist you parse HTML and extract information from the pages. Each of the libraries has its strengths and weaknesses and you tin can choice one based on your needs.
As an example, we will merely parse some HTML input and extract links using the BeautifulSoup library. BeautifulSoup tolerates highly flawed HTML and nonetheless lets you hands extract the data y'all need. You tin download and install the BeautifulSoup code from:
https://pypi.python.org/pypi/beautifulsoup4
Information on installing BeautifulSoup with the Python Parcel Index tool pip
is available at:
https://packaging.python.org/tutorials/installing-packages/
We will utilise urllib
to read the folio and then use BeautifulSoup
to extract the href
attributes from the anchor (a
) tags.
# To run this, download the BeautifulSoup naught file # http://world wide web.py4e.com/code3/bs4.zip # and unzip information technology in the same directory every bit this file import urllib.request, urllib.parse, urllib.fault from bs4 import BeautifulSoup import ssl # Ignore SSL certificate errors ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE url = input('Enter - ') html = urllib.asking.urlopen(url, context=ctx).read() soup = BeautifulSoup(html, 'html.parser') # Retrieve all of the anchor tags tags = soup('a') for tag in tags: print(tag.get('href', None)) # Lawmaking: http://world wide web.py4e.com/code3/urllinks.py
The program prompts for a spider web accost, and so opens the web page, reads the data and passes the data to the BeautifulSoup parser, and so retrieves all of the anchor tags and prints out the href
attribute for each tag.
When the plan runs, it produces the following output:
Enter - https://docs.python.org genindex.html py-modindex.html https://www.python.org/ # whatsnew/iii.6.html whatsnew/index.html tutorial/index.html library/index.html reference/index.html using/index.html howto/index.html installing/index.html distributing/alphabetize.html extending/alphabetize.html c-api/alphabetize.html faq/index.html py-modindex.html genindex.html glossary.html search.html contents.html bugs.html nearly.html license.html copyright.html download.html https://docs.python.org/3.8/ https://docs.python.org/3.7/ https://docs.python.org/3.v/ https://docs.python.org/two.7/ https://www.python.org/dr./versions/ https://world wide web.python.org/dev/peps/ https://wiki.python.org/moin/BeginnersGuide https://wiki.python.org/moin/PythonBooks https://www.python.org/doc/av/ genindex.html py-modindex.html https://www.python.org/ # copyright.html https://world wide web.python.org/psf/donations/ bugs.html http://sphinx.pocoo.org/
This list is much longer because some HTML anchor tags are relative paths (east.g., tutorial/index.html) or in-page references (east.one thousand., '#') that do not include "http://" or "https://", which was a requirement in our regular expression.
You lot can use besides BeautifulSoup to pull out various parts of each tag:
# To run this, download the BeautifulSoup zip file # http://world wide web.py4e.com/code3/bs4.zero # and unzip it in the same directory as this file from urllib.request import urlopen from bs4 import BeautifulSoup import ssl # Ignore SSL certificate errors ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE url = input('Enter - ') html = urlopen(url, context=ctx).read() soup = BeautifulSoup(html, "html.parser") # Retrieve all of the anchor tags tags = soup('a') for tag in tags: # Await at the parts of a tag print('TAG:', tag) print('URL:', tag.get('href', None)) print('Contents:', tag.contents[0]) impress('Attrs:', tag.attrs) # Code: http://www.py4e.com/code3/urllink2.py
python urllink2.py Enter - http://world wide web.dr-chuck.com/page1.htm TAG: <a href="http://www.dr-chuck.com/page2.htm"> Second Page</a> URL: http://world wide web.dr-chuck.com/page2.htm Content: ['\nSecond Page'] Attrs: [('href', 'http://www.dr-chuck.com/page2.htm')]
html.parser
is the HTML parser included in the standard Python 3 library. Data on other HTML parsers is available at:
http://world wide web.crummy.com/software/BeautifulSoup/bs4/physician/#installing-a-parser
These examples only begin to show the power of BeautifulSoup when information technology comes to parsing HTML.
Bonus section for Unix / Linux users
If you have a Linux, Unix, or Macintosh computer, you lot probably take commands built in to your operating system that retrieves both apparently text and binary files using the HTTP or File Transfer (FTP) protocols. Ane of these commands is curl
:
$ curl -O http://www.py4e.com/cover.jpg
The command curl
is short for "copy URL" and then the two examples listed earlier to think binary files with urllib
are cleverly named curl1.py
and curl2.py
on www.py4e.com/code3 as they implement like functionality to the gyre
control. There is also a curl3.py
sample program that does this chore a little more effectively, in case you actually want to use this pattern in a program you are writing.
A second command that functions very similarly is wget
:
$ wget http://world wide web.py4e.com/cover.jpg
Both of these commands make retrieving webpages and remote files a simple task.
Glossary
- BeautifulSoup
- A Python library for parsing HTML documents and extracting data from HTML documents that compensates for most of the imperfections in the HTML that browsers mostly ignore. Yous can download the BeautifulSoup code from world wide web.crummy.com.
- port
- A number that by and large indicates which awarding you are contacting when yous brand a socket connection to a server. As an example, web traffic usually uses port 80 while e-mail traffic uses port 25.
- scrape
- When a program pretends to be a spider web browser and retrieves a spider web page, and then looks at the web page content. Often programs are following the links in one folio to notice the side by side page and then they can traverse a network of pages or a social network.
- socket
- A network connection between two applications where the applications tin send and receive data in either direction.
- spider
- The act of a web search engine retrieving a page and so all the pages linked from a folio and so on until they have nearly all of the pages on the Cyberspace which they use to build their search alphabetize.
Exercises
Exercise 1: Change the socket program socket1.py
to prompt the user for the URL so information technology can read any web page. Yous tin utilise split('/')
to break the URL into its component parts so you can extract the host name for the socket connect
call. Add error checking using endeavour
and except
to handle the condition where the user enters an improperly formatted or non-existent URL.
Exercise 2: Change your socket program so that it counts the number of characters it has received and stops displaying any text after it has shown 3000 characters. The program should call up the entire certificate and count the total number of characters and display the count of the number of characters at the end of the certificate.
Exercise 3: Use urllib
to replicate the previous exercise of (ane) retrieving the document from a URL, (two) displaying up to 3000 characters, and (3) counting the overall number of characters in the document. Don't worry nearly the headers for this exercise, simply show the start 3000 characters of the document contents.
Practice 4: Change the urllinks.py
program to extract and count paragraph (p) tags from the retrieved HTML document and display the count of the paragraphs every bit the output of your plan. Practice non display the paragraph text, but count them. Test your programme on several small spider web pages as well equally some larger spider web pages.
Practise v: (Advanced) Change the socket plan so that it only shows data afterwards the headers and a blank line have been received. Remember that recv
receives characters (newlines and all), not lines.
-
The XML format is described in the side by side chapter.↩︎
If you lot discover a mistake in this volume, feel complimentary to transport me a fix using Github.
griffithcidew1966.blogspot.com
Source: https://www.py4e.com/html3/12-network
0 Response to "Python Sleep and Then Try to Connect to Socket Again"
Postar um comentário