Chinese data set

When personal development is doing many NLP related tasks, it is a headache to find the corpus.
There are many public corpora, which are charged by others or downloaded by points.
It causes a lot of inconvenience to the development at ordinary times.
I have collected some of my own corpus for your convenience

News text classification corpus

THUCNews is generated by filtering and filtering historical data of Sina News RSS subscription channel from 2005 to 2011, including 740000 news documents (2.19 GB), all in UTF-8 plain text format. Thank you very much, seniors and partners in the learning process.
GitHub address: × text-classification-with-cnn-and-rnn
Download address

A simplified subset of cnews news
In this training, 10 categories are used, with 6500 pieces of data in each category.

The categories are as follows:

Sports, finance, real estate, home furnishing, education, technology, fashion, current affairs, games, entertainment

This subset can be downloaded here: link: password: qfud

Sogou news corpus
News data from 18 channels including domestic, international, sports, social, entertainment, etc. during June July 2012 from several news sites, providing URL and text information
Pay attention to select IE browser or download fails

Corpus of Fudan University
This corpus is provided by Li Ronglu of Fudan University. test_corpus.rar is the test corpus, with 9833 documents in total; train_corpus.rar is the training corpus, with 9804 documents in total. The two predictions are divided into 20 same categories
Link: extraction code: 36wh

Sentiment analysis corpus

Emotion analysis corpus in sownlp open source package

More than 3W comment types in total
Link: extraction code: 04p1

Su Shen's open emotion analysis corpus
More than 2W short comments
Link: extraction code: 17m1

Entity analysis corpus

BosonNLP developer corpus
Follow the download instructions for Boson data developers
Download address:
Link: extraction code: 88x3

Detailed ner tagging corpus
I can't find the source of this corpus here.
Please contact me if you have any clear information
Link: extraction code: ptad

Garbage classification corpus

Super Chinese corpus sharing:
Many NLP related resources in Chinese:
Examples related to various models:

All kinds of resource source network, if any infringement contact me, delete it immediately.

            <link href="" rel="stylesheet">
                                            <div class="more-toolbox">
            <div class="left-toolbox">
                <ul class="toolbox-list">
                    <li class="tool-item tool-active is-like "><a href="javascript:;"><svg class="icon" aria-hidden="true">
                        <use xlink:href="#csdnc-thumbsup"></use>
                    </svg><span class="name">Give the thumbs-up</span>
                    <span class="count">6</span>
                    <li class="tool-item tool-active is-collection "><a href="javascript:;" data-report-click="{&quot;mod&quot;:&quot;popu_824&quot;}"><svg class="icon" aria-hidden="true">
                        <use xlink:href="#icon-csdnc-Collection-G"></use>
                    </svg><span class="name">Collection</span></a></li>
                    <li class="tool-item tool-active is-share"><a href="javascript:;"><svg class="icon" aria-hidden="true">
                        <use xlink:href="#icon-csdnc-fenxiang"></use>
                    <!--Reward begins-->
                                            <!--End of reward-->
                                            <li class="tool-item tool-more">
                        <svg t="1575545411852" class="icon" viewBox="0 0 1024 1024" version="1.1" xmlns="" p-id="5717" xmlns:xlink="" width="200" height="200"><defs><style type="text/css"></style></defs><path d="M179.176 499.222m-113.245 0a113.245 113.245 0 1 0 226.49 0 113.245 113.245 0 1 0-226.49 0Z" p-id="5718"></path><path d="M509.684 499.222m-113.245 0a113.245 113.245 0 1 0 226.49 0 113.245 113.245 0 1 0-226.49 0Z" p-id="5719"></path><path d="M846.175 499.222m-113.245 0a113.245 113.245 0 1 0 226.49 0 113.245 113.245 0 1 0-226.49 0Z" p-id="5720"></path></svg>
                        <ul class="more-box">
                            <li class="item"><a class="article-report">Article report</a></li>
        <div class="person-messagebox">
            <div class="left-message"><a href="">
                <img src="" class="avatar_pic" username="cyz52">
                                        <img src="" class="user-years">
            <div class="middle-message">
                                    <div class="title"><span class="tit"><a href="https://Blog. CSDN. Net / cyz52 "data report Click =" {& quot; mod & quot;: & quot; popu & quot;} "target =" "> garden WOW</a></span>
                <div class="text"><span>Published 33 original articles</span> · <span>Praise 36</span> · <span>20000 visitors+</span></div>
                            <div class="right-message">
                                        <a href="https://Im. CSDN. Net / im / main. HTML? Username = cyz52 "target =" [blank "class =" BTN BTN SM BTN red hole BT button personal letter "> private message
                                                        <a class="btn btn-sm  bt-button personal-watch" data-report-click="{&quot;mod&quot;:&quot;popu_379&quot;}">follow</a>
Published 10 original articles, won praise 0, visited 302
Private letter follow

Tags: github Javascript PHP IE

Posted on Thu, 06 Feb 2020 05:20:03 -0800 by NNTB