[interesting case] using Python to break through the limit of verification code

1, Experimental description

This experiment will use a simple example to explain the principle of decoding verification code, and learn and practice the following knowledge points:

  1. Python Basics
  2. Use of PIL module

2, Experiment content

To install the pilot (PIL) Library:

$ sudo apt-get update

$ sudo apt-get install python-dev

$ sudo apt-get install libtiff5-dev libjpeg8-dev zlib1g-dev \
libfreetype6-dev liblcms2-dev libwebp-dev tcl8.6-dev tk8.6-dev python-tk

$ sudo pip install pillow

Download the experimental file:

$ wget http://labfile.oss.aliyuncs.com/courses/364/python_captcha.zip
$ unzip python_captcha.zip
$ cd python_captcha

This is captcha.gif, the verification code we used in our experiment

Extract text picture

Create a new crack.py file in the working directory and edit it.

What I don't know in the learning process can add to me
 python learning qun, 855408893
 There are good learning video tutorials, development tools and e-books in the group.
Share with you the current talent needs of python enterprises and how to learn python from scratch, and what to learn

#-*- coding:utf8 -*-
from PIL import Image

im = Image.open("captcha.gif")
#(convert picture to 8-bit pixel mode)
im = im.convert("P")

#Print color histogram
print im.histogram()

Output:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0 , 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 2, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 3, 1, 3, 3, 0, 0, 0, 0, 0, 0, 1, 0, 3, 2, 132, 1, 1, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 15, 0 , 1, 0, 1, 0, 0, 8, 1, 0, 0, 0, 0, 1, 6, 0, 2, 0, 0, 0, 0, 18, 1, 1, 1, 1, 1, 2, 365, 115, 0, 1, 0, 0, 0, 135, 186, 0, 0, 1, 0, 0, 0, 116, 3, 0, 0, 0, 0, 0, 21, 1, 1, 0, 0, 0, 2, 10, 2, 0, 0, 0, 0, 2, 10, 0, 0, 0, 0, 1, 0, 625]

Each digit of the color histogram represents the number of pixels that contain the color of the corresponding digit in the picture.

Each pixel can represent 256 colors, and you will find that white dots are the most (the position of white number 255, which is the last digit, you can see that there are 625 white pixels). The red pixel is about 200. We can get useful colors by sorting.

his = im.histogram()
values = {}

for i in range(256):
    values[i] = his[i]

for j,k in sorted(values.items(),key=lambda x:x[1],reverse = True)[:10]:
    print j,k

Output:

255 625
212 365
220 186
219 135
169 132
227 116
213 115
234 21
205 18
184 15

We get up to 10 colors in the picture, of which 220 and 227 are the red and gray we need. We can construct a black-and-white binary picture through this message.

What I don't know in the learning process can add to me
python Study qun,855408893
//There are good learning video tutorials, development tools and e-books in the group.
//Share with you the current talent needs of python enterprises and how to learn python from scratch, and what to learn

#-*- coding:utf8 -*-
from PIL import Image

im = Image.open("captcha.gif")
im = im.convert("P")
im2 = Image.new("P",im.size,255)

for x in range(im.size[1]):
    for y in range(im.size[0]):
        pix = im.getpixel((y,x))
        if pix == 220 or pix == 227: # these are the numbers to get
            im2.putpixel((y,x),0)

im2.show()

Results obtained:

Extract single character picture

The next work is to get the pixel set of a single character. Because the example is relatively simple, we cut it vertically:

inletter = False
foundletter=False
start = 0
end = 0

letters = []

for y in range(im2.size[0]): 
    for x in range(im2.size[1]):
        pix = im2.getpixel((y,x))
        if pix != 255:
            inletter = True
    if foundletter == False and inletter == True:
        foundletter = True
        start = y

    if foundletter == True and inletter == False:
        foundletter = False
        end = y
        letters.append((start,end))

    inletter=False
print letters

Output:

[(6, 14), (15, 25), (27, 35), (37, 46), (48, 56), (57, 67)]

Gets the column number at the beginning and end of each character.

import hashlib
import time

count = 0
for letter in letters:
    m = hashlib.md5()
    im3 = im2.crop(( letter[0] , 0, letter[1],im2.size[1] ))
    m.update("%s%s"%(time.time(),count))
    im3.save("./%s.gif"%(m.hexdigest()))
    count += 1

(follow the code above)

Cut the picture to get the part of the picture where each character is.

AI and vector space image recognition

Here we use the vector space search engine for character recognition, which has many advantages:

  • No need for a lot of training iterations
  • No overtraining
  • You can add / remove wrong data at any time to view the effect
  • Easy to understand and code
  • Provide rating results, you can view the closest multiple matches
  • As long as the unrecognized things are added to the search engine, they can be recognized immediately.

Of course, it also has disadvantages, such as the speed of classification is much slower than neural network, it can not find its own way to solve the problem and so on.

The name of vector space search engine sounds very tall, but the principle is very simple. Take the example in the article:

You have three documents. How do we calculate the similarity between them? The more words you use in two documents, the more similar they are! But how to deal with too many words? We will choose several key words. The selected words are also called features. Each feature is like a dimension (x, y, z, etc.) in space. A group of features is a vector. We can get such a vector for each document. As long as we calculate the angle between vectors, we can get the similarity of the article.

Implement vector space with Python class:

import math

class VectorCompare:
    #Calculate vector size
    def magnitude(self,concordance):
        total = 0
        for word,count in concordance.iteritems():
            total += count ** 2
        return math.sqrt(total)

    #Calculate the cos value between vectors
    def relation(self,concordance1, concordance2):
        relevance = 0
        topvalue = 0
        for word, count in concordance1.iteritems():
            if concordance2.has_key(word):
                topvalue += count * concordance2[word]
        return topvalue / (self.magnitude(concordance1) * self.magnitude(concordance2))

It will compare two python dictionary types and output their similarity (represented by numbers of 0-1)

Put the previous content together

There is also a lot of verification code to extract a single character image as a training set, but as long as the students who have read the above well, they must know how to do these works, which is omitted here. You can use the training set provided directly to do the following operations.

Under the iconset directory is our training set.

Last additional content:

What I don't know in the learning process can add to me
python Study qun,855408893
//There are good learning video tutorials, development tools and e-books in the group.
//Share with you the current talent needs of python enterprises and how to learn python from scratch, and what to learn

#Convert picture to vector
def buildvector(im):
    d1 = {}
    count = 0
    for i in im.getdata():
        d1[count] = i
        count += 1
    return d1

v = VectorCompare()

iconset = ['0','1','2','3','4','5','6','7','8','9','0','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']

#Load training set
imageset = []
for letter in iconset:
    for img in os.listdir('./iconset/%s/'%(letter)):
        temp = []
        if img != "Thumbs.db" and img != ".DS_Store":
            temp.append(buildvector(Image.open("./iconset/%s/%s"%(letter,img))))
        imageset.append({letter:temp})

count = 0
#Cut the verification code picture
for letter in letters:
    m = hashlib.md5()
    im3 = im2.crop(( letter[0] , 0, letter[1],im2.size[1] ))

    guess = []

    #Compare the small segments of the cut verification code with each training segment
    for image in imageset:
        for x,y in image.iteritems():
            if len(y) != 0:
                guess.append( ( v.relation(y[0],buildvector(im3)),x) )

    guess.sort(reverse=True)
    print "",guess[0]
    count += 1

Get results

We are ready to run our code:

python crack.py

output

(0.96376811594202894, '7')
(0.96234028545977002, 's')
(0.9286884286888929, '9')
(0.98350370609844473, 't')
(0.96751165072506273, '9')
(0.96989711688772628, 'j')

It's Zhengjie. Nice job.

Tags: Front-end Python sudo pip Lambda

Posted on Wed, 29 Apr 2020 01:58:14 -0700 by Asday