鏂囨湰鑷劧璇█澶勭悊鏁版嵁闆�
=======================

鐩墠鏈」鐩殑鏂囨湰鎯呮劅鍒嗘瀽妯″瀷璇勬祴妯″潡宸查泦鎴�6涓暟鎹泦锛屽嵆 

1. :code:`amazon_reviews_multi` 鐨勪腑鏂囬儴鍒嗭紙璁颁负 :code:`amazon_reviews_zh`锛�
2. :code:`SST`
3. :code:`imdb_reviews` 甯︽爣绛炬暟鎹殑涓€灏忛儴鍒嗭紙璁颁负 :code:`imdb_reviews_tiny`锛�
4. :code:`dianping` 鐨勪竴灏忛儴鍒嗭紙璁颁负 :code:`dianping_tiny`锛�
5. :code:`jd_binary` 鐨勪竴灏忛儴鍒嗭紙璁颁负 :code:`jd_binary_tiny`锛�
6. :code:`jd_full` 鐨勪竴灏忛儴鍒嗭紙璁颁负 :code:`jd_full_tiny`锛�


amazon_reviews_multi
----------------------
璇ユ暟鎹泦鏀堕泦浜嗕骇鐢熶簹椹€婏紙Amazon锛夎嚜2015.11.1鑷�2019.11.1鐨勫晢鍝佽瘎璁猴紝娑电洊涓枃锛岃嫳鏂囩瓑鍏�6绉嶈瑷€銆傛暟鎹泦涓瘡涓€鏉¤褰曞寘鍚簡鍟嗗搧璇勮鐨勬枃鏈紝鏍囬锛屾槦鏍囷紙1-5鏄燂級锛岃瘎璁鸿€呯殑鍖垮悕ID锛屽晢鍝佺殑鍖垮悕ID锛屼互鍙婄矖绮掑害鐨勫晢鍝佸搧绫汇€傛瘡涓€绉嶈瑷€鍦ㄨ缁冮泦銆侀獙璇侀泦銆佹祴璇曢泦涓兘鍒嗗埆鏈�200000銆�5000浠ュ強5000鏉℃暟鎹€備互涓嬫槸鏌愭潯涓枃鐨勬暟鎹細

.. code:: json

    {
        "language": "zh",
        "product_category": "book",
        "product_id": "product_zh_0123483",
        "review_body": "杩欑畝鐩村氨鏄お宸簡锛佸嚭鐗堢ぞ鎬庝箞灏辫兘鍑虹増鍚楋紵鎴戜互涓烘槸鐧惧害鎽樺綍鍛紒杩欏埌搴曟槸鍝釜楸肩洰娣风彔鐨勬暀鎺堝晩锛燂紒鑳界粰鐐瑰共璐у悧锛燂紒鎬荤畻搴旈獙浜嗕竴鍙ヨ瘽锛屼竴鏈功鍝€曞彧鏈変竴鍙ヨ姳浣犳劅鍒版湁鎰忎箟涔熺畻鏄湰濂戒功銆傚搰涓轰簡鎵捐繖鏈功鍝€曚竴鍙ヤ笉鏄簾璇濈殑鍙ュ瓙閮借垂浜嗘垜鏁存暣涓€澶╂椂闂淬€傘€�",
        "review_id": "zh_0713738",
        "review_title": "绠€鐩存槸搴熻瘽锛�",
        "reviewer_id": "reviewer_zh_0518940",
        "stars": 1
    }


amazon_reviews_zh
--------------------
灏� :code:`language` 瀛楁涓� :code:`zh` 鐨勯儴鍒嗘娊鍙栧嚭鏉ワ紝鍏辫210000鏉★紝鏀惧叆涓€涓� :code:`DataFrame` 涓紝淇濆瓨涓� :code:`csv` 鏂囦欢锛屽苟鐢� :code:`gzip` 鍘嬬缉锛屽ぇ灏忎负19.4M銆傜洰鍓嶈繖涓暟鎹泦淇濆瓨鍦� :code:`text/Datasets/amazon_reviews_zh` 涓嬨€傚叿浣撲粠 :code:`Huggingface` 鐨勮蒋浠跺寘 :code:`datasets` 鐢熸垚鐨勪唬鐮佸涓嬶細

.. code:: python

    import datasets
    import pandas as pd
    ar_zh = datasets.load_dataset("amazon_reviews_multi", "zh")
    cols = ["set", "product_category", "product_id", "review_body", "review_id", "review_title", "reviewer_id", "stars",]
    temp = []
    for s in ["train", "validation", "test"]:
        for item in ar_zh[s]:
            c = {"set":s}
            c.update(item)
            temp.append(c)
    ar_zh = pd.DataFrame(temp)
    ar_zh = ar_zh[cols]


SST
----
鏁版嵁闆� :code:`SST` 鍏ㄧО涓� Stanford Sentiment Treebank 锛屾槸涓€涓嫳鏂囩殑褰辫瘎鏁版嵁闆嗭紝鍏辨湁11855鏉℃暟鎹紝姣忎竴鏉℃暟鎹舰濡�

.. code:: json

    {
        "label": 0.7222200036048889,
        "sentence": "Yet the act is still charming here .",
        "tokens": "Yet|the|act|is|still|charming|here|.",
        "tree": "15|13|13|10|9|9|11|12|10|11|12|14|14|15|0"
    }

鍏朵腑锛屾瘡涓€涓瓧娈垫剰涔夊涓�

-   sentence: 鍏充簬涓€閮ㄧ數褰辩殑涓€鏉¤瘎璁�
-   label: 杩欐潯璇勮鐨勬闈㈡€х殑绋嬪害锛屽€肩殑鑼冨洿涓�0.0鍒�1.0
-   tokens: 浠庤繖鏉¤瘎璁哄緱鍒扮殑tokens
-   tree: 鍒╃敤鍙ユ硶瑙f瀽寰楀埌鐨勮繖鏉¤瘎璁虹殑鍙ユ硶鏍�

鐩墠杩欎釜鏁版嵁闆嗕繚瀛樺湪 :code:`text/Datasets/sst` 涓嬨€傚叿浣撲粠 :code:`Huggingface` 鐨勮蒋浠跺寘 :code:`datasets` 鐢熸垚鐨勪唬鐮佸涓嬶細

.. code:: python

    import datasets
    import pandas as pd
    sst = datasets.load_dataset("sst")
    cols = ["set", "label", "sentence", "tokens", "tree",]
    temp = []
    for s in ["train", "validation", "test"]:
        for item in sst[s]:
            c = {"set":s}
            c.update(item)
            temp.append(c)
    sst = pd.DataFrame(temp)
    sst = sst[cols]


imdb_reviews
-------------
璇ユ暟鎹泦涓昏鏄疘MDB涓婄殑褰辫瘎锛屾爣绛句负褰辫瘎涓鸿礋闈㈣瘎浠凤紙0锛夈€佹闈㈣瘎浠凤紙1锛夛紝鍒嗕负浜嗚缁冮泦銆佹祴璇曢泦锛屽垎鍒湁25000鏉℃暟鎹€傛澶栬繕鏈�50000鏉℃湭甯︽爣绛撅紙-1锛夌殑鏁版嵁銆傝娉ㄦ剰鐨勬槸锛屽師濮嬫暟鎹殑鏂囨湰涓彲鑳藉甫鏈塇TML鏍囩銆�

.. code:: json

    {
        "label": 0,
        "text": "Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Mountie telling the people of Dawson City, Yukon to elect themselves a marshal (yes a marshal!) and to enforce the law themselves, then gunfighters battling it out on the streets for control of the town? <br /><br />Nothing even remotely resembling that happened on the Canadian side of the border during the Klondike gold rush. Mr. Mann and company appear to have mistaken Dawson City for Deadwood, the Canadian North for the American Wild West.<br /><br />Canadian viewers be prepared for a Reefer Madness type of enjoyable howl with this ludicrous plot, or, to shake your head in disgust."
    }


imdb_reviews_tiny
------------------
浠� :code:`imdb_reviews` 鐨勮缁冮泦銆佹祴璇曢泦鍒嗗埆闅忔満鎶藉彇1000鏉℃闈㈣瘎浠锋牱鏈笌1000鏉¤礋闈㈣瘎浠锋牱鏈紝鍏辫4000鏉★紝鎴戜滑鏋勫缓浜嗕竴涓皬鍨嬬殑鏁版嵁闆� :code:`imdb_reviews_tiny`锛屼唬鐮佸涓�

.. code:: python

    import pandas as pd
    import tensorflow_datasets as tfds
    from random import sample

    ds = tfds.text.IMDBReviews()
    ds.download_and_prepare()

    imdb_train = tfds.load(name="imdb_reviews", split="train")
    imdb_test = tfds.load(name="imdb_reviews", split="test")
    df_imdb_train = tfds.as_dataframe(imdb_train)
    df_imdb_test = tfds.as_dataframe(imdb_test)

    df_imdb_train["set"] = "train"
    df_imdb_test["set"] = "test"
    df_imdb = pd.concat([df_imdb_train,df_imdb_test]).reset_index(drop=True)
    df_imdb = df_imdb[["set", "label", "text",]]
    train_0 = df_imdb[(df_imdb["set"]=="train") & (df_imdb["label"]==0)].index.tolist()
    train_1 = df_imdb[(df_imdb["set"]=="train") & (df_imdb["label"]==1)].index.tolist()
    test_0 = df_imdb[(df_imdb["set"]=="test") & (df_imdb["label"]==0)].index.tolist()
    test_1 = df_imdb[(df_imdb["set"]=="test") & (df_imdb["label"]==1)].index.tolist()
    imdb_tiny_indices = sorted(sample(train_0, 1000) + sample(train_1, 1000) + sample(test_0, 1000) + sample(test_1, 1000))
    df_imdb_tiny = df_imdb[df_imdb.index.isin(imdb_tiny_indices)].reset_index(drop=True)
    df_imdb_tiny["text"] = df_imdb_tiny["text"].apply(lambda s:s.decode())

鐩墠杩欎釜鏁版嵁闆嗕繚瀛樺湪 :code:`text/Datasets/imdb_reviews_tiny` 涓嬨€�


dianping
----------
鏁版嵁闆� :code:`dianping` 鏀堕泦鑷ぇ浼楃偣璇勭殑鍟嗗搧銆佸簵閾鸿瘎璁烘暟鎹紝鍏辫涓ょ櫨浣欎竾鏉★紝瀛楁鏈� :code:`label` 锛�1锛�2锛夛紝浠ュ強 :code:`text`锛屾簮鑷� `Glyph 椤圭洰 <https://github.com/zhangxiangxiao/glyph/>`_ [#glyph]_銆傜ず渚嬫暟鎹涓�

.. code:: json

    {
        "label": 0,
        "text": "鍡撳瓙鐤硷紝super甯︽垜鍘婚粍鎸緳鍠濇枒鑾庡噳鑼躲€備綘璇存垜鑲畾涓嶈兘涓€鍙f皵鍠濅笅杩欎箞鑻︾殑涓滆タ锛屾垜鍍忎釜灏忓瀛愪竴鏍凤紝涓轰簡琛ㄧ幇鑷繁锛屼竴鍙f皵鍠濆畬浜嗭紝鐒跺悗鎾囩潃鍢存壘浣犺姗樼毊鍚冦€備綘鐤肩埍鍦版湜鐫€鎴戯紝涓€杈归獋鎴戔€滃彉鎬佺殑鈥濅竴杈瑰じ鎴戝帀瀹炽€俓\n浜嬪悗鎴戣娌¤儍鍙i椆鐫€瑕佸悆婀樿彍锛屼簬鏄綘甯﹀埌杩欏搴楋紝鐜濂藉鎬摝锛屾洿鍍忔槸涓€涓タ椁愬巺锛屾寕鐫€涓€浜涚嚎甯橈紝绾㈡矙鍙戙€傚凡缁忓繕浜嗗悆鐨勪粈涔堜簡銆傚悆瀹岄キ鎴戜滑涓€璺蛋鍘讳汉姘戝叕鍥紝閫斿緞7浠旇繕涔颁簡涓€鏉€濅箰鍐帮紝鎴戜滑鍍忎袱涓皬瀛╀竴鏍峰潗鍦ㄤ汉姘戝叕鍥殑闀垮嚦涓婃妸杩欐澂鎬濅箰鍐扮粰鍚稿畬浜嗐€俓\nsuper锛宯ever ever"
    }


.. _dianping-tiny:

dianping_tiny
----------------------------------
浠� :code:`dianping` 鐨勮缁冮泦銆佹祴璇曢泦鍒嗗埆闅忔満鎶藉彇2500鏉℃闈㈣瘎浠锋牱鏈笌2500鏉¤礋闈㈣瘎浠锋牱鏈紝鍏辫10000鏉★紝鎴戜滑鏋勫缓浜嗕竴涓皬鍨嬬殑鏁版嵁闆� :code:`dianping_tiny`锛屼繚瀛樺湪 :code:`text/Datasets/imdb_reviews_tiny` 涓嬨€傝鏁版嵁闆嗚繕閫氳繃闄愬畾 :code:`text` 瀛楁鐨勯暱搴︼紝鍒嗗埆鏈� :code:`long` 鐗堟湰锛堥暱搴︹墺30锛変互鍙� :code:`xl` 鐗堟湰锛堥暱搴︹墺100锛夈€備粠鏁版嵁闆� :code:`dianping` 鐢熸垚 :code:`dianping_tiny` 鐨勪唬鐮佸涓�

.. code:: python

    from typing import Sequence
    from itertools import product

    def selection(df:pd.DataFrame, cols:Sequence[str], num:int) -> pd.DataFrame:
        """
        """
        col_sets = product(*[set(df[c].tolist()) for c in cols])
        df_out = pd.DataFrame()
        for combination in col_sets:
            df_tmp = df.copy()
            for c,v in zip(cols, combination):
                df_tmp = df_tmp[df_tmp[c]==v]
            df_tmp = df_tmp.sample(n=min(num,len(df_tmp))).reset_index(drop=True)
            df_out = pd.concat([df_out, df_tmp], ignore_index=True)
        return df_out

    df_dp_train = pd.read_csv("dianping-train.csv.xz", header=None)
    df_dp_test = pd.read_csv("dianping-test.csv.xz", header=None)
    df_dp_train.columns = ["label", "text",]
    df_dp_test.columns = ["label", "text",]
    df_dp_train["set"] = "train"
    df_dp_test["set"] = "test"

    df_dp = pd.concat([df_dp_train,df_dp_test], ignore_index=True)
    df_dp = df_dp[["set", "label", "text",]]
    df_dp = df_dp[df_dp.text.str.contains("[\u4e00-\u9FFF]+")].reset_index(drop=True)

    df_dp_tiny = selection(df_dp, cols=["set", "label"], num=2500).sample(frac=1)
    df_dp_tiny.to_csv("dianping_filtered_tiny.csv.gz", index=False, compression="gzip")

    df_dp_long_tiny = selection(df_dp[df_dp.text.str.len()>=30], cols=["set", "label"], num=2500).sample(frac=1)
    df_dp_long_tiny.to_csv("dianping_filtered_long_tiny.csv.gz", index=False, compression="gzip")

    df_dp_xl_tiny = selection(df_dp[df_dp.text.str.len()>=100], cols=["set", "label"], num=2500).sample(frac=1)
    df_dp_xl_tiny.to_csv("dianping_filtered_xl_tiny.csv.gz", index=False, compression="gzip")


jd_full
----------
鏁版嵁闆� :code:`jd_full` 鏀堕泦鑷含涓滅殑鍟嗗搧璇勮鏁版嵁锛屽叡璁″洓鍗佷綑涓囨潯锛屽瓧娈垫湁 :code:`score` 锛�1 - 5锛夛紝浠ュ強 :code:`title, content`锛屾簮鑷� `Glyph 椤圭洰 <https://github.com/zhangxiangxiao/glyph/>`_ [#glyph]_銆傜ず渚嬫暟鎹涓�

.. code:: json

    {
        "score": 5,
        "title": "鏃嬫丁鐚殑鎵炬硶锛氭潙涓婃湞鏃ュ爞鏃ヨ锛堟柊鐗堬級",
        "text": "椹媺鏉捐繖涓滆タ锛屽湪鏌愮鎰忎箟涓婃槸鐩稿綋濂囧紓鐨勪綋楠屻€傛垜鐢氳嚦瑙夊緱浜虹敓鏈� 韬殑鑹插僵閮戒細鍥犱綋楠屽拰娌′綋楠岃繃椹媺鏉捐€屽ぇ涓嶇浉鍚屻€傚敖绠′笉鑳借鏄畻鏁欎綋 楠岋紝浣嗗叾涓粛鏈夋煇绉嶄笌浜虹殑瀛樺湪瀵嗗垏鐩稿叧鐨勪笢瑗裤€傚疄闄呰窇鍥涘崄浜屽叕閲岀殑閫� 涓紝闅惧厤鐩稿綋璁ょ湡鍦拌嚜宸遍棶鑷繁锛氭垜浣曡嫤杩欎箞鑷壘鑻﹀悆锛熶笉鏄粈涔堝ソ澶勯兘 娌℃湁鍚楋紵鎴栬€呬笉濡傝鍙嶅€掑韬綋涓嶅埄锛堣劚瓒剧敳銆佽捣姘存场銆佺浜屽ぉ涓嬫ゼ闅惧彈锛� 銆傚彲鏄瓑鍒板ソ姝瑰啿杩涚粓鐐广€佸枠涓€鍙f皵鎺ヨ繃鍐板噳鐨勭綈瑁呭暏閰�&ldquo;鍜曞槦鍢�&rdquo;鍠濅笅 鍘昏繘鑰屾场杩涚儹姘撮噷鐢ㄥ埆閽堝皷鍒虹牬鑳€榧撻紦鐨勬按娉$殑鏃跺€欙紝鍙堝紑濮嬫弧鎬€璞儏鍦� 蹇冩兂锛氫笅娆′竴瀹氬啀璺戯紒杩欏埌搴曟槸浠€涔堜綔鐢ㄥ憿锛熻帿闈炰汉鏄椂涓嶆椂鎬€鏈夋綔鍦ㄧ殑 鎰挎湜锛屽瓨蹇冭鎶婅嚜宸辨姌纾ㄥ埌鏋佺偣涓嶆垚锛� 鍏跺舰鎴愬師鐢辨垜涓嶅ぇ娓呮锛屽弽姝h繖绉嶆劅鍙楁槸鍙兘鍦ㄨ窇瀹屽叏绋嬮┈鎷夋澗鏃舵墠 鑳藉嚭鐜扮殑鐗规畩鎰熷彈銆傝鏉ュ鎬紝鍗充娇璺戝崐绋嬮┈鎷夋澗涔熸病鏈夊姝ゆ劅鍙楋紝鏃犻潪 &ldquo;鎷煎懡璺戝畬浜屽崄涓€鍏噷&rdquo;鑰屽凡銆傝瘹鐒讹紝鍗婄▼璇磋緵鑻︿篃澶熻緵鑻︾殑锛屼絾閭f槸璺� 瀹屾椂鍗冲彲鏁翠釜娑堝け鐨勮緵鑻︺€傝€岃窇瀹屽叏绋嬮┈鎷夋澗鏃讹紝灏辨湁鏃犳硶绠€鍗曞寲瑙g殑鎵� 钁楃殑涓滆タ鍦ㄤ汉鐨勶紙鑷冲皯鎴戠殑锛夊績澶存尌涔嬩笉鍘汇€傝В閲婃槸瑙i噴涓嶅ソ锛屾劅瑙変笂灏卞ソ 鍍忎笉涔呰繕灏嗛伃閬囧垰鍒氬皾杩囩殑鐥涜嫤锛屽洜鑰屽繀椤荤浉搴斿仛涓€涓�&ldquo;鍠勫悗澶勭悊&rdquo;&mdash;&mdash; &ldquo;杩欎釜杩樿閲嶅鐨勶紝杩欏洖寰楅噸澶嶅緱濂戒竴浜涙墠琛岋紒&rdquo;姝e洜濡傛锛屽墠鍚庡崄浜屽勾 鏃堕棿閲屾垜鎵嶄笉椤炬瘡娆¢兘绱緱姘斿枠鍚佸悂绛嬬柌鍔涘敖鑰屼笉灞堜笉鎸犲潥鎸佽窇鍏ㄧ▼椹媺 鏉�&mdash;&mdash;褰撶劧&ldquo;鍠勫悗澶勭悊&rdquo;鏄竴鐐逛篃娌″鐞嗗ソ銆� 鎴栬鏈変汉璇存槸鑷檺锛屼絾鎴戣涓虹粷涓嶆槸浠呬粎濡傛锛岃帿濡傝绫讳技涓€绉嶅ソ濂� 蹇冿紝绫讳技涓€绉嶅姏鍥鹃€氳繃涓€娆℃澧炲姞娆℃暟涓€鐐圭偣鎻愰珮闄愬害鏉ユ妸鑷繁韬笂娼滃湪 鐨勩€佽嚜宸卞皻涓嶇煡鏅撹€屾兂涓€鐫逛负蹇殑涓滆タ涓€鎶婃媺鍒板厜澶╁寲鏃ヤ箣涓嬬殑蹇冩儏&hellip;&hellip; 缁嗘兂涔嬩笅锛岃繖鍚屾垜骞虫椂瀵归暱绡囧皬璇存€€鏈夌殑蹇冩儏鍑犱箮涓€妯′竴鏍枫€傛煇涓€澶� 绐佺劧鍔ㄤ簡鍐欓暱绡囧皬璇寸殑蹇靛ご锛屼簬鏄潗鍦ㄦ鍓嶏紝鏁版湀鎴栨暟骞村睆鎭暃姘斿皢绮剧 闆嗕腑鍦ㄦ瀬闄愮姸鎬侊紝缁堜簬鍐欏嚭涓€閮ㄩ暱绡囥€傛瘡娆¢兘绱緱鍍忕嫚鐙犳嫥杩囩殑鎶瑰竷锛屽晩 锛屽お绱簡锛岀疮姝讳簡锛佸績鎯冲啀涓嶅共閭g浜嬩簡銆備笉鏂欐椂杩囦笉涔咃紝鍐嶆蹇冭鏉ユ疆 锛氳繖鍥炲彲瑕佸ぇ骞蹭竴鍦猴紒鍙堟鐨禆鑴稿湴鍧愬湪妗屽墠鍔ㄧ瑪鍐欓暱绡囥€傜劧鑰屾棤璁烘€庝箞 鍐欐棤璁哄啓澶氬皯閮戒粛鏈夊嚌缁撶墿娌夌敻鐢稿湴娈嬬暀鍦ㄨ倸瀛愰噷銆� 鐩告瘮涔嬩笅锛岀煭绡囧皬璇村氨濂藉儚鍗佸叕閲岃禌锛屽啀闀夸笉杩囨槸鍗婄▼椹媺鏉剧舰浜嗐€� 涓嶇敤璇达紝鐭瘒鑷湁鐭瘒鐨勭嫭鐗逛綔鐢紝鑷湁鍏剁浉搴旂殑鏂囨€濆拰鎰夋偊锛屼絾缂轰箯&mdash; &mdash;褰撶劧鏄鎴戣€岃█&mdash;&mdash;娣辨繁瑙﹀強韬綋缁撴瀯鏈韩鐨勯偅绉嶅喅瀹氭€х殑鑷村懡鎬ц川鐨� 涓滆タ锛屽洜鑰�&ldquo;鐖辨啂鍙傚崐&rdquo;鐨勪笢瑗夸篃灏戜簬闀跨瘒銆� 椹媺鏉捐窇瀹屽悗锛屽幓缁堢偣闄勮繎绉戞櫘鍒╁箍鍦洪噷闈㈢殑娉㈠+椤挎渶鏈夊悕鐨勬捣椴滈 鍘�&ldquo;LEAGAL SEAFOOD&rdquo;鍠濊毈鑲夋堡锛屽悆涓€绉嶆儫鐙柊鑻辨牸鍏板湴鍖烘墠鏈夌殑鎴戝枩娆� 鍚冪殑娴疯礉銆傚コ渚嶅簲鐢熺湅鐫€鎴戞墜涓窇瀹屽叏绋嬬殑绾康绔犲じ濂栭亾锛�&ldquo;浣犺窇椹媺鏉� 浜嗭紵鍡紝濂芥湁鍕囨皵鍟婏紒&rdquo;闈炴垜鐬庤锛岃浜哄じ鏈夊媷姘旀湁鐢熶互鏉ュ樊涓嶅鏄ご涓€ 娆°€傝瀹炶瘽锛屾垜鏍规湰娌′粈涔堝媷姘斻€� 浣嗕笉绠¤皝鎬庝箞璇达紝鏈夊媷姘斾篃濂芥病鍕囨皵涔熷ソ锛岃窇瀹屽叏绋嬮┈鎷夋澗涔嬪悗鍚冪殑 瓒冲閲忕殑鐑皵鑵捐吘鐨勬櫄椁愶紝瀹炲湪鏄繖涓笘鐣屼笂鏈€缇庡鐨勪笢瑗夸箣涓€銆� 涓嶇璋佹€庝箞璇淬€侾10-13\\n"
    }


.. _jd-full-tiny:

jd_full_tiny
------------------------------------
浠� :code:`jd_full` 鐨勮缁冮泦銆佹祴璇曢泦鍒嗗埆闅忔満鎶藉彇1000鏉�1-5鍒嗭紙:code:`score`锛夋牱鏈紝鍏辫10000鏉★紝鎴戜滑鏋勫缓浜嗕竴涓皬鍨嬬殑鏁版嵁闆� :code:`jd_full_tiny`锛屼繚瀛樺湪 :code:`text/Datasets/jd_full_tiny` 涓嬨€傝鏁版嵁闆嗚繕閫氳繃闄愬畾 :code:`content` 瀛楁鐨勯暱搴︼紝鍒嗗埆鏈� :code:`long` 鐗堟湰锛堥暱搴︹墺30锛変互鍙� :code:`xl` 鐗堟湰锛堥暱搴︹墺100锛夈€備粠鏁版嵁闆� :code:`jd_full` 鐢熸垚 :code:`jd_full_tiny` 鐨勪唬鐮佸涓�

.. code:: python

    df_jd_full_train = pd.read_csv("jd_full_train.csv.xz", header=None)
    df_jd_full_test = pd.read_csv("jd_full_test.csv.xz", header=None)
    df_jd_full_train.columns = ["score", "title", "content",]
    df_jd_full_test.columns = ["score", "title", "content",]
    df_jd_full_train["set"] = "train"
    df_jd_full_test["set"] = "test"

    df_jd_full = pd.concat([df_jd_full_train,df_jd_full_test], ignore_index=True)
    df_jd_full = df_jd_full[["set", "score", "title", "content",]]
    df_jd_full = df_jd_full[df_jd_full.text.str.contains("[\u4e00-\u9FFF]+")].reset_index(drop=True)

    df_jd_full_tiny = selection(df_jd_full, cols=["set", "score"], num=1000).sample(frac=1)
    df_jd_full_tiny.to_csv("jd_full_filtered_tiny.csv.gz", index=False, compression="gzip")

    df_jd_full_long_tiny = selection(df_jd_full[df_jd_full.text.str.len()>=30], cols=["set", "score"], num=1000).sample(frac=1)
    df_jd_full_long_tiny.to_csv("jd_full_filtered_long_tiny.csv.gz", index=False, compression="gzip")

    df_jd_full_xl_tiny = selection(df_jd_full[df_jd_full.text.str.len()>=100], cols=["set", "score"], num=1000).sample(frac=1)
    df_jd_full_xl_tiny.to_csv("jd_full_filtered_xl_tiny.csv.gz", index=False, compression="gzip")


jd_binary
----------
鏁版嵁闆� :code:`jd_binary` 鏄皢 :code:`jd_full` 鐨� :code:`score` 瀛楁锛�1 - 5锛夊彉涓�2绫伙紙1锛�2锛夛紝鏇挎崲涓� :code:`label` 瀛楁寰楀埌鐨勶紝婧愯嚜 `Glyph 椤圭洰 <https://github.com/zhangxiangxiao/glyph/>`_ [#glyph]_銆�


jd_binary_tiny
---------------
浠庢暟鎹泦 :code:`jd_binary` 鑾峰緱鏁版嵁闆� :code:`jd_binary_tiny` 鐨勬柟寮忥紝涓庢暟鎹泦 :code:`dianping_tiny` 锛� :ref:`dianping-tiny` 锛変互鍙婃暟鎹泦 :code:`jd_full_tiny` 锛� :ref:`jd-full-tiny` 锛夌殑鐢熸垚鏂瑰紡绫讳技銆�


.. [#glyph] Zhang X, LeCun Y. Which encoding is the best for text classification in chinese, english, japanese and korean?[J]. arXiv preprint arXiv:1708.02657, 2017.