Use Python to segment text into punctuated sentences

With word segmentation toolkits such as jieba, it's easy to cut sentences into different words, but when you have the need to cut whole sentences, what can you do about it?

  • Divide paragraphs into sentences by periods

1.jieba participle can be used to segment words

In Chinese, segmentation is a common operation, for example:

import jieba.posseg as pseg
txt = "[#Female Mercedes-Benz owners do not accept 4 S Shop apology# The two sides spoke fiercely on the spot. In April 13th, the 4 women and owners of Mercedes Benz S When the stores met, the two sides did not settle on the spot. Four S Relevant shopkeepers said that due to travel and other reasons, the owner did not contact the owner in time, but the owner retorted that he could contact the owner by telephone, "Nobody gave me your contact information." In the course of negotiation, the two sides had a fierce verbal confrontation.#Car owners' rights protection# "
words = pseg.cut(txt)
for word, flag in words:
    print('%s %s' % (word, flag))

Using this code for word segmentation and part-of-speech tagging of this text information, the following results are obtained (some omitted):

[ x
# x
 Mercedes Benz v
 Female b
 Owner n
...

2. Segmentation of whole sentences by punctuation

Maybe you don't need word segmentation, you just need whole sentence segmentation, that is, punctuation, so you can do this:

import re
txt = "[#Female Mercedes-Benz owners do not accept 4 S Shop apology# In April 13th, Xi'an's Mercedes Benz female owners and 4 S When the stores met, the two sides did not settle on the spot. Four S Relevant shopkeepers said that due to travel and other reasons, the owner did not contact the owner in time, but the owner retorted that he could contact the owner by telephone, "Nobody gave me your contact information." In the course of negotiation, the two sides had a fierce verbal confrontation.#Car owners' rights protection#  "
pattern = r',|\.|/|;|\'|`|\[|\]|<|>|\?|:|"|\{|\}|\~|!|@|#|\$|%|\^|&|\(|\)|-|=|\_|\+|,|. |,|;|'|'|[|]|·|!| |...|(|)'
result_list = re.split(pattern, txt)
print(result_list)

This is a whole sentence segmentation using regular expressions. The result is as follows: Punctuation marks automatically cut sentences.

['', '', 'Female Mercedes-Benz owners do not accept 4 S Shop apology', '', 'On-site negotiations between the two sides were fierce', '4 13 June', 'Mercedes Benz female owners and 4 S Shop meeting', 'The two sides did not reconcile on the spot', '4S The person in charge of the store', 'Failure to contact the owner in time due to business trip and other reasons', 'The owner retorted', 'Contact by telephone', '"Nobody gave me your contact information.', '"The verbal confrontation between the two sides was intense in the course of negotiation.', '', 'Xi'an Mercedes Benz female owners' rights protection', '', '']

3. Whole Sentence Segmentation - Segmentation by Period

Maybe you need a complete sentence, that is, to keep commas and to cut a whole sentence by a period. What can you do about that? You just need to get rid of commas in regular expressions.

import re
txt = "[#Female Mercedes-Benz owners do not accept 4 S Shop apology# In April 13th, Xi'an's Mercedes Benz female owners and 4 S When the stores met, the two sides did not settle on the spot. Four S Relevant shopkeepers said that due to travel and other reasons, the owner did not contact the owner in time, but the owner retorted that he could contact the owner by telephone, "Nobody gave me your contact information." In the course of negotiation, the two sides had a fierce verbal confrontation.#Car owners' rights protection# "
#pattern = r',|\.|/|;|\'|`|\[|\]|<|>|\?|:|"|\{|\}|\~|!|@|#|\$|%|\^|&|\(|\)|-|=|\_|\+|,|. |,|;|'|'|[|]|·|!| |...|(|)'
pattern = r'\.|/|;|\'|`|\[|\]|<|>|\?|:|"|\{|\}|\~|!|@|#|\$|%|\^|&|\(|\)|-|=|\_|\+|. |,|;|'|'|[|]|·|!| |...|(|)'
result_list = re.split(pattern, txt)
print(result_list)

give the result as follows

['', '', 'Female Mercedes-Benz owners do not accept 4 S Shop apology', '', 'On-site negotiations between the two sides were fierce', '4 On 13 June, Mercedes Benz female owners and 4 S When the stores met, the two sides did not settle on the spot.', '4S The person in charge of the store said that the owner had not been contacted in time because of business trip and other reasons. The owner retorted that he could contact the owner by telephone. "No one gave me your contact information."', '"The verbal confrontation between the two sides was intense in the course of negotiation.', '', 'Car owners' rights protection', '', '']

Is get here?

Posted on Wed, 09 Oct 2019 10:00:57 -0700 by tkm