Regular expression of reptile

1. Why must regular expressions be used to learn crawlers?
When we crawl the specific content of some web pages, we need only a part of the content of a certain tag of this web page, or when the value of a property of this tag is used, it cannot be completed with ordinary xpath or css.selector. At this time, we need to use regular expressions to match and obtain.
2. Official introduction to regular expression?
Regular expression, also known as regular expression. (often abbreviated to regex, regexp, or RE in code), a concept in computer science. Regular expressions are often used to retrieve and replace text that conforms to a pattern (rule).

 1  '''
 2  regular expression 
 3  '''
 4  
 5  import re
 6  
 7  line = 'jijianXksA123'
 8 
 9 # ^a Indicates match to a Starting string (matches only once)
10 # . Indicates that the character can be any character (only matched once)
11 # * Indicates that the preceding character can appear any number of times (0 or more) (multiple matches)
12 reg_str01 = '^j.*'    # Represented by j Starting string
13 # re.match function
14 # The first parameter is the matching format
15 # The second parameter is the string to match
16 # The return value is: match succeeded, return match Object, otherwise return: None
17 
18 if re.match(reg_str01,line) :
19     print("Match succeeded!")  # reg_str = '^j.*'     Matching success
20 else:
21     print("Matching failed!")  # reg_str = '^i.*'     Matching failure
22 
23 
24 # 23$ Represents a string that matches the end of 23 (matches only once)
25 reg_str02 = '^j.*23$'
26 if re.match(reg_str02,line) :
27     print("Match succeeded!")  # reg_str = '^j.*23$'     Matching success
28 else:
29     print("Matching failed!")  # reg_str = '^j.*13$'     Matching failure
30 
31 
32 line01 = 'boooboaobxby'
33 # () Inside is the matching mode, through group Function to get the matching result
34 # Greedy regular expression matching pattern: starting from the back (right)
35 reg_str03 = '.*(b.*b).*'
36 test01 = re.match(reg_str03,line01)
37 if  test01:
38     print(test01.group(1))      # result : bxb
39 else:
40     print("Matching failed!")
41 
42 # Regular expression non greedy matching pattern: matching from the front (left)
43 # ? : Indicates to start matching from the left and match to the first content matching the pattern, i.e. enter the pattern
44 #
45 reg_str03 = '.*?(b.*b).*'   # Semi greedy matching
46 reg_str04 = '.*?(b.*?b).*'  # Non greedy matching
47 test01 = re.match(reg_str03,line01)
48 test02 = re.match(reg_str04,line01)
49 if  test01 and test02:
50     print(test01.group(1))      # result : boooboaobxb
51     print(test02.group(1))  # result : booob
52 else:
53     print("Matching failed!")

 

 1  import re
 2  line01 = 'boooboaobcxby'
 3  
 4  def regtest(reg_str,line = line01):
 5     test = re.match(reg_str, line)
 6     if test:
 7         print(test.group(1))
 8     else:
 9         print("Matching failed!")
10 
11 # + : Represents the preceding character, at least once
12 reg_str04 = '.*(b.+b).*'  # (b.+b)Express b And b At least one character between
13 regtest(reg_str04)      # result : bcxb
14 
15 # {n} : Control the number of occurrences of preceding characters
16 # a{2} : Express a Two times
17 # b{3,4} : Express b At least 3 times, at most 4 times
18 # c{4,} : Express c At least 4 times
19 reg_str05 = '.*(b.{2}b).*'  # (b.{2}b)Indicates matched to b And b Between, only two characters
20 reg_str06 = '.*(b.{3,4}b).*'  # (b.{3,6}b)Indicates matched to b And b Between,At least 3 characters, at most 4 characters
21 reg_str07 = '.*(b.{4,}b).*'  # (b.{8,}b)Indicates matched to b And b Between, at least 4 characters
22 regtest(reg_str05)   # result : bcxb
23 regtest(reg_str06)   # result : boaob
24 regtest(reg_str07)   # result : boaobcxb
25 
26 # | :Express or
27 # (abc|123) : Match to abc Or 123,It's all a match
28 reg_str08 = '.*(boo|abc)'
29 reg_str09 = '.*(abc|boo)'
30 regtest(reg_str08)   # result : boo
31 regtest(reg_str09)   # result : boo
32 
33 # [] : It means that all contents can be matched,Include content with surface character meaning only
34 # [abcd] : Indicates that as long as this character is a/b/c/d One of them can match successfully
35 # [0-9] : Indicates that as long as this character is at 0-9 Within this range, it can be matched successfully
36 # [^x] : Indicates that the matching character is not x
37 line02 = 'Tel: 15573563467'
38 reg_str10 = '.*(1[3458][0-9]{9}).*'
39 reg_str11 = '.*(1[3458][^1]{9}).*'
40 regtest(reg_str10,line02)   # result : 15573563467
41 regtest(reg_str11,line02)   # result : 15573563467
42 
43 # \s Indicates match space, match once
44 # \S Indicates to match characters that are not spaces, once
45 # \w Representation matching A-Z,0-9,_ Easy characters in, match once
46 # \W And \w Contrary
47 # \d Representation number
48 # [\u4E00-\u9FA5] : All Chinese characters, unicode Code
49 
50 def regtest_test(reg_str,line = line01):
51     test = re.match(reg_str, line)
52     if test:
53         print(test.group(1)+':'+test.group(2)+'-'+test.group(3)+'-'+test.group(4))
54     else:
55         print("Matching failed!")
56 
57 # Simple example
58 str01 = 'Zhang San was born on December 20, 1997'
59 str02 = 'Li Si was born in 1989-01-20'
60 str03 = 'Wang Wu was born in 1997/2/5'
61 str04 = 'Zhao Liu was born in 1997.12.20'
62 str = [str01,str02,str03,str04]
63 # Extract name+Date of birth
64 # Matching mode
65 reg_str12 = '(.*)Born in(\d{4})[.year/-](\d{1,2})[.month/-](\d{1,2}).*?'
66 for i in range(4):
67     regtest_test(reg_str12,str[i])
68 # result :
69 #       Zhang San:1997-12-20
70 #       Li Si:1989-01-20
71 #       Wang Wu:1997-2-5
72 #       Zhao Liu:1997-12-20

Note: This article comes from the network and returns to the network

Tags: Python network

Posted on Sun, 01 Dec 2019 05:48:14 -0800 by ttomnmo