Python uses Tesseract library to identify and verify (24)

(1) About Tesseract

Tesseract is an OCR Library (OCR is the abbreviation of Optical Character Recognition in English). It is used to scan the text data, and then analyze and process the image file to obtain the text and layout information. Tesseract is currently recognized as the best and relatively accurate OCR library.

(2) Use of Tesseract

1. Download and install Tesseract: Click to download

2. Set environment variables under Windows system:

#Configure environment variables according to the path of the download installation file

3. Install pyteseract module

pip install pytesseract

4. How to introduce the testeract.exe application in Python script:

pytesseract.pytesseract.tesseract_cmd = r'F:\Tesseract-OCR\tesseract.exe'

5. Case demonstration

Recognize the following picture text:

import pytesseract
from PIL import Image
#1. Introduce Tesseract program
pytesseract.pytesseract.tesseract_cmd = r'F:\Tesseract-OCR\tesseract.exe'
#2. Use the Open() function under the Image module to open the picture
image ='6.jpg',mode='r')
#3. Recognize pictures and words
code= pytesseract.image_to_string(image)

Results demonstration:

<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=611x210 at 0x1A5DFDCB4A8>

Note: for example, the verification code generated by the tesseact OCR engine can't recognize its content. If you need to crawl the data in the Douban, you need to enter the verification code manually:

(3) Simulated Login Zhihu source code

import requests
import time
import pytesseract
from PIL import Image
from bs4 import BeautifulSoup

def captcha(data):
    with open('captcha.jpg','wb') as fp:
    image ="captcha.jpg")
    text = pytesseract.image_to_string(image)
    print "The verification code after machine identification is:" + text
    command = raw_input("Please input Y It means that you agree to use it. Press other keys to re-enter by yourself:")
    if (command == "Y" or command == "y"):
        return text
        return raw_input('Enter verification code:')

def zhihuLogin(username,password):

    # Build a session object to save the Cookie value
    sessiona = requests.Session()
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'}

    # First get the page information, find the data that needs POST (and record the Cookie of the current page)
    html = sessiona.get('', headers=headers).content

    # Find the input tag whose name attribute value is_xsrf, and take out the value in value
    _xsrf = BeautifulSoup(html ,'lxml').find('input', attrs={'name':'_xsrf'}).get('value')

    # Take out the verification code. The value after r is the Unix timestamp, time.time()
    captcha_url = '' % (time.time() * 1000)
    response = sessiona.get(captcha_url, headers = headers)

    data = {
        "captcha": captcha(response.content)

    response ='', data = data, headers=headers)
    print response.text

    response = sessiona.get('', headers=headers)
    print response.text

if __name__ == "__main__":
    #username = raw_input("username")
    #password = raw_input("password")

Tags: Windows Session pip Python

Posted on Fri, 03 Apr 2020 08:58:10 -0700 by wiggst3r