鏂囨湰鑷劧璇█澶勭悊鏁版嵁闆� ======================= 鐩墠鏈」鐩殑鏂囨湰鎯呮劅鍒嗘瀽妯″瀷璇勬祴妯″潡宸查泦鎴�6涓暟鎹泦锛屽嵆 1. :code:`amazon_reviews_multi` 鐨勪腑鏂囬儴鍒嗭紙璁颁负 :code:`amazon_reviews_zh`锛� 2. :code:`SST` 3. :code:`imdb_reviews` 甯︽爣绛炬暟鎹殑涓€灏忛儴鍒嗭紙璁颁负 :code:`imdb_reviews_tiny`锛� 4. :code:`dianping` 鐨勪竴灏忛儴鍒嗭紙璁颁负 :code:`dianping_tiny`锛� 5. :code:`jd_binary` 鐨勪竴灏忛儴鍒嗭紙璁颁负 :code:`jd_binary_tiny`锛� 6. :code:`jd_full` 鐨勪竴灏忛儴鍒嗭紙璁颁负 :code:`jd_full_tiny`锛� amazon_reviews_multi ---------------------- 璇ユ暟鎹泦鏀堕泦浜嗕骇鐢熶簹椹€婏紙Amazon锛夎嚜2015.11.1鑷�2019.11.1鐨勫晢鍝佽瘎璁猴紝娑电洊涓枃锛岃嫳鏂囩瓑鍏�6绉嶈瑷€銆傛暟鎹泦涓瘡涓€鏉¤褰曞寘鍚簡鍟嗗搧璇勮鐨勬枃鏈紝鏍囬锛屾槦鏍囷紙1-5鏄燂級锛岃瘎璁鸿€呯殑鍖垮悕ID锛屽晢鍝佺殑鍖垮悕ID锛屼互鍙婄矖绮掑害鐨勫晢鍝佸搧绫汇€傛瘡涓€绉嶈瑷€鍦ㄨ缁冮泦銆侀獙璇侀泦銆佹祴璇曢泦涓兘鍒嗗埆鏈�200000銆�5000浠ュ強5000鏉℃暟鎹€備互涓嬫槸鏌愭潯涓枃鐨勬暟鎹細 .. code:: json { "language": "zh", "product_category": "book", "product_id": "product_zh_0123483", "review_body": "杩欑畝鐩村氨鏄お宸簡锛佸嚭鐗堢ぞ鎬庝箞灏辫兘鍑虹増鍚楋紵鎴戜互涓烘槸鐧惧害鎽樺綍鍛紒杩欏埌搴曟槸鍝釜楸肩洰娣风彔鐨勬暀鎺堝晩锛燂紒鑳界粰鐐瑰共璐у悧锛燂紒鎬荤畻搴旈獙浜嗕竴鍙ヨ瘽锛屼竴鏈功鍝€曞彧鏈変竴鍙ヨ姳浣犳劅鍒版湁鎰忎箟涔熺畻鏄湰濂戒功銆傚搰涓轰簡鎵捐繖鏈功鍝€曚竴鍙ヤ笉鏄簾璇濈殑鍙ュ瓙閮借垂浜嗘垜鏁存暣涓€澶╂椂闂淬€傘€�", "review_id": "zh_0713738", "review_title": "绠€鐩存槸搴熻瘽锛�", "reviewer_id": "reviewer_zh_0518940", "stars": 1 } amazon_reviews_zh -------------------- 灏� :code:`language` 瀛楁涓� :code:`zh` 鐨勯儴鍒嗘娊鍙栧嚭鏉ワ紝鍏辫210000鏉★紝鏀惧叆涓€涓� :code:`DataFrame` 涓紝淇濆瓨涓� :code:`csv` 鏂囦欢锛屽苟鐢� :code:`gzip` 鍘嬬缉锛屽ぇ灏忎负19.4M銆傜洰鍓嶈繖涓暟鎹泦淇濆瓨鍦� :code:`text/Datasets/amazon_reviews_zh` 涓嬨€傚叿浣撲粠 :code:`Huggingface` 鐨勮蒋浠跺寘 :code:`datasets` 鐢熸垚鐨勪唬鐮佸涓嬶細 .. code:: python import datasets import pandas as pd ar_zh = datasets.load_dataset("amazon_reviews_multi", "zh") cols = ["set", "product_category", "product_id", "review_body", "review_id", "review_title", "reviewer_id", "stars",] temp = [] for s in ["train", "validation", "test"]: for item in ar_zh[s]: c = {"set":s} c.update(item) temp.append(c) ar_zh = pd.DataFrame(temp) ar_zh = ar_zh[cols] SST ---- 鏁版嵁闆� :code:`SST` 鍏ㄧО涓� Stanford Sentiment Treebank 锛屾槸涓€涓嫳鏂囩殑褰辫瘎鏁版嵁闆嗭紝鍏辨湁11855鏉℃暟鎹紝姣忎竴鏉℃暟鎹舰濡� .. code:: json { "label": 0.7222200036048889, "sentence": "Yet the act is still charming here .", "tokens": "Yet|the|act|is|still|charming|here|.", "tree": "15|13|13|10|9|9|11|12|10|11|12|14|14|15|0" } 鍏朵腑锛屾瘡涓€涓瓧娈垫剰涔夊涓� - sentence: 鍏充簬涓€閮ㄧ數褰辩殑涓€鏉¤瘎璁� - label: 杩欐潯璇勮鐨勬闈㈡€х殑绋嬪害锛屽€肩殑鑼冨洿涓�0.0鍒�1.0 - tokens: 浠庤繖鏉¤瘎璁哄緱鍒扮殑tokens - tree: 鍒╃敤鍙ユ硶瑙f瀽寰楀埌鐨勮繖鏉¤瘎璁虹殑鍙ユ硶鏍� 鐩墠杩欎釜鏁版嵁闆嗕繚瀛樺湪 :code:`text/Datasets/sst` 涓嬨€傚叿浣撲粠 :code:`Huggingface` 鐨勮蒋浠跺寘 :code:`datasets` 鐢熸垚鐨勪唬鐮佸涓嬶細 .. code:: python import datasets import pandas as pd sst = datasets.load_dataset("sst") cols = ["set", "label", "sentence", "tokens", "tree",] temp = [] for s in ["train", "validation", "test"]: for item in sst[s]: c = {"set":s} c.update(item) temp.append(c) sst = pd.DataFrame(temp) sst = sst[cols] imdb_reviews ------------- 璇ユ暟鎹泦涓昏鏄疘MDB涓婄殑褰辫瘎锛屾爣绛句负褰辫瘎涓鸿礋闈㈣瘎浠凤紙0锛夈€佹闈㈣瘎浠凤紙1锛夛紝鍒嗕负浜嗚缁冮泦銆佹祴璇曢泦锛屽垎鍒湁25000鏉℃暟鎹€傛澶栬繕鏈�50000鏉℃湭甯︽爣绛撅紙-1锛夌殑鏁版嵁銆傝娉ㄦ剰鐨勬槸锛屽師濮嬫暟鎹殑鏂囨湰涓彲鑳藉甫鏈塇TML鏍囩銆� .. code:: json { "label": 0, "text": "Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Mountie telling the people of Dawson City, Yukon to elect themselves a marshal (yes a marshal!) and to enforce the law themselves, then gunfighters battling it out on the streets for control of the town? <br /><br />Nothing even remotely resembling that happened on the Canadian side of the border during the Klondike gold rush. Mr. Mann and company appear to have mistaken Dawson City for Deadwood, the Canadian North for the American Wild West.<br /><br />Canadian viewers be prepared for a Reefer Madness type of enjoyable howl with this ludicrous plot, or, to shake your head in disgust." } imdb_reviews_tiny ------------------ 浠� :code:`imdb_reviews` 鐨勮缁冮泦銆佹祴璇曢泦鍒嗗埆闅忔満鎶藉彇1000鏉℃闈㈣瘎浠锋牱鏈笌1000鏉¤礋闈㈣瘎浠锋牱鏈紝鍏辫4000鏉★紝鎴戜滑鏋勫缓浜嗕竴涓皬鍨嬬殑鏁版嵁闆� :code:`imdb_reviews_tiny`锛屼唬鐮佸涓� .. code:: python import pandas as pd import tensorflow_datasets as tfds from random import sample ds = tfds.text.IMDBReviews() ds.download_and_prepare() imdb_train = tfds.load(name="imdb_reviews", split="train") imdb_test = tfds.load(name="imdb_reviews", split="test") df_imdb_train = tfds.as_dataframe(imdb_train) df_imdb_test = tfds.as_dataframe(imdb_test) df_imdb_train["set"] = "train" df_imdb_test["set"] = "test" df_imdb = pd.concat([df_imdb_train,df_imdb_test]).reset_index(drop=True) df_imdb = df_imdb[["set", "label", "text",]] train_0 = df_imdb[(df_imdb["set"]=="train") & (df_imdb["label"]==0)].index.tolist() train_1 = df_imdb[(df_imdb["set"]=="train") & (df_imdb["label"]==1)].index.tolist() test_0 = df_imdb[(df_imdb["set"]=="test") & (df_imdb["label"]==0)].index.tolist() test_1 = df_imdb[(df_imdb["set"]=="test") & (df_imdb["label"]==1)].index.tolist() imdb_tiny_indices = sorted(sample(train_0, 1000) + sample(train_1, 1000) + sample(test_0, 1000) + sample(test_1, 1000)) df_imdb_tiny = df_imdb[df_imdb.index.isin(imdb_tiny_indices)].reset_index(drop=True) df_imdb_tiny["text"] = df_imdb_tiny["text"].apply(lambda s:s.decode()) 鐩墠杩欎釜鏁版嵁闆嗕繚瀛樺湪 :code:`text/Datasets/imdb_reviews_tiny` 涓嬨€� dianping ---------- 鏁版嵁闆� :code:`dianping` 鏀堕泦鑷ぇ浼楃偣璇勭殑鍟嗗搧銆佸簵閾鸿瘎璁烘暟鎹紝鍏辫涓ょ櫨浣欎竾鏉★紝瀛楁鏈� :code:`label` 锛�1锛�2锛夛紝浠ュ強 :code:`text`锛屾簮鑷� `Glyph 椤圭洰 <https://github.com/zhangxiangxiao/glyph/>`_ [#glyph]_銆傜ず渚嬫暟鎹涓� .. code:: json { "label": 0, "text": "鍡撳瓙鐤硷紝super甯︽垜鍘婚粍鎸緳鍠濇枒鑾庡噳鑼躲€備綘璇存垜鑲畾涓嶈兘涓€鍙f皵鍠濅笅杩欎箞鑻︾殑涓滆タ锛屾垜鍍忎釜灏忓瀛愪竴鏍凤紝涓轰簡琛ㄧ幇鑷繁锛屼竴鍙f皵鍠濆畬浜嗭紝鐒跺悗鎾囩潃鍢存壘浣犺姗樼毊鍚冦€備綘鐤肩埍鍦版湜鐫€鎴戯紝涓€杈归獋鎴戔€滃彉鎬佺殑鈥濅竴杈瑰じ鎴戝帀瀹炽€俓\n浜嬪悗鎴戣娌¤儍鍙i椆鐫€瑕佸悆婀樿彍锛屼簬鏄綘甯﹀埌杩欏搴楋紝鐜濂藉鎬摝锛屾洿鍍忔槸涓€涓タ椁愬巺锛屾寕鐫€涓€浜涚嚎甯橈紝绾㈡矙鍙戙€傚凡缁忓繕浜嗗悆鐨勪粈涔堜簡銆傚悆瀹岄キ鎴戜滑涓€璺蛋鍘讳汉姘戝叕鍥紝閫斿緞7浠旇繕涔颁簡涓€鏉€濅箰鍐帮紝鎴戜滑鍍忎袱涓皬瀛╀竴鏍峰潗鍦ㄤ汉姘戝叕鍥殑闀垮嚦涓婃妸杩欐澂鎬濅箰鍐扮粰鍚稿畬浜嗐€俓\nsuper锛宯ever ever" } .. _dianping-tiny: dianping_tiny ---------------------------------- 浠� :code:`dianping` 鐨勮缁冮泦銆佹祴璇曢泦鍒嗗埆闅忔満鎶藉彇2500鏉℃闈㈣瘎浠锋牱鏈笌2500鏉¤礋闈㈣瘎浠锋牱鏈紝鍏辫10000鏉★紝鎴戜滑鏋勫缓浜嗕竴涓皬鍨嬬殑鏁版嵁闆� :code:`dianping_tiny`锛屼繚瀛樺湪 :code:`text/Datasets/imdb_reviews_tiny` 涓嬨€傝鏁版嵁闆嗚繕閫氳繃闄愬畾 :code:`text` 瀛楁鐨勯暱搴︼紝鍒嗗埆鏈� :code:`long` 鐗堟湰锛堥暱搴︹墺30锛変互鍙� :code:`xl` 鐗堟湰锛堥暱搴︹墺100锛夈€備粠鏁版嵁闆� :code:`dianping` 鐢熸垚 :code:`dianping_tiny` 鐨勪唬鐮佸涓� .. code:: python from typing import Sequence from itertools import product def selection(df:pd.DataFrame, cols:Sequence[str], num:int) -> pd.DataFrame: """ """ col_sets = product(*[set(df[c].tolist()) for c in cols]) df_out = pd.DataFrame() for combination in col_sets: df_tmp = df.copy() for c,v in zip(cols, combination): df_tmp = df_tmp[df_tmp[c]==v] df_tmp = df_tmp.sample(n=min(num,len(df_tmp))).reset_index(drop=True) df_out = pd.concat([df_out, df_tmp], ignore_index=True) return df_out df_dp_train = pd.read_csv("dianping-train.csv.xz", header=None) df_dp_test = pd.read_csv("dianping-test.csv.xz", header=None) df_dp_train.columns = ["label", "text",] df_dp_test.columns = ["label", "text",] df_dp_train["set"] = "train" df_dp_test["set"] = "test" df_dp = pd.concat([df_dp_train,df_dp_test], ignore_index=True) df_dp = df_dp[["set", "label", "text",]] df_dp = df_dp[df_dp.text.str.contains("[\u4e00-\u9FFF]+")].reset_index(drop=True) df_dp_tiny = selection(df_dp, cols=["set", "label"], num=2500).sample(frac=1) df_dp_tiny.to_csv("dianping_filtered_tiny.csv.gz", index=False, compression="gzip") df_dp_long_tiny = selection(df_dp[df_dp.text.str.len()>=30], cols=["set", "label"], num=2500).sample(frac=1) df_dp_long_tiny.to_csv("dianping_filtered_long_tiny.csv.gz", index=False, compression="gzip") df_dp_xl_tiny = selection(df_dp[df_dp.text.str.len()>=100], cols=["set", "label"], num=2500).sample(frac=1) df_dp_xl_tiny.to_csv("dianping_filtered_xl_tiny.csv.gz", index=False, compression="gzip") jd_full ---------- 鏁版嵁闆� :code:`jd_full` 鏀堕泦鑷含涓滅殑鍟嗗搧璇勮鏁版嵁锛屽叡璁″洓鍗佷綑涓囨潯锛屽瓧娈垫湁 :code:`score` 锛�1 - 5锛夛紝浠ュ強 :code:`title, content`锛屾簮鑷� `Glyph 椤圭洰 <https://github.com/zhangxiangxiao/glyph/>`_ [#glyph]_銆傜ず渚嬫暟鎹涓� .. code:: json { "score": 5, "title": "鏃嬫丁鐚殑鎵炬硶锛氭潙涓婃湞鏃ュ爞鏃ヨ锛堟柊鐗堬級", "text": "椹媺鏉捐繖涓滆タ锛屽湪鏌愮鎰忎箟涓婃槸鐩稿綋濂囧紓鐨勪綋楠屻€傛垜鐢氳嚦瑙夊緱浜虹敓鏈� 韬殑鑹插僵閮戒細鍥犱綋楠屽拰娌′綋楠岃繃椹媺鏉捐€屽ぇ涓嶇浉鍚屻€傚敖绠′笉鑳借鏄畻鏁欎綋 楠岋紝浣嗗叾涓粛鏈夋煇绉嶄笌浜虹殑瀛樺湪瀵嗗垏鐩稿叧鐨勪笢瑗裤€傚疄闄呰窇鍥涘崄浜屽叕閲岀殑閫� 涓紝闅惧厤鐩稿綋璁ょ湡鍦拌嚜宸遍棶鑷繁锛氭垜浣曡嫤杩欎箞鑷壘鑻﹀悆锛熶笉鏄粈涔堝ソ澶勯兘 娌℃湁鍚楋紵鎴栬€呬笉濡傝鍙嶅€掑韬綋涓嶅埄锛堣劚瓒剧敳銆佽捣姘存场銆佺浜屽ぉ涓嬫ゼ闅惧彈锛� 銆傚彲鏄瓑鍒板ソ姝瑰啿杩涚粓鐐广€佸枠涓€鍙f皵鎺ヨ繃鍐板噳鐨勭綈瑁呭暏閰�“鍜曞槦鍢�”鍠濅笅 鍘昏繘鑰屾场杩涚儹姘撮噷鐢ㄥ埆閽堝皷鍒虹牬鑳€榧撻紦鐨勬按娉$殑鏃跺€欙紝鍙堝紑濮嬫弧鎬€璞儏鍦� 蹇冩兂锛氫笅娆′竴瀹氬啀璺戯紒杩欏埌搴曟槸浠€涔堜綔鐢ㄥ憿锛熻帿闈炰汉鏄椂涓嶆椂鎬€鏈夋綔鍦ㄧ殑 鎰挎湜锛屽瓨蹇冭鎶婅嚜宸辨姌纾ㄥ埌鏋佺偣涓嶆垚锛� 鍏跺舰鎴愬師鐢辨垜涓嶅ぇ娓呮锛屽弽姝h繖绉嶆劅鍙楁槸鍙兘鍦ㄨ窇瀹屽叏绋嬮┈鎷夋澗鏃舵墠 鑳藉嚭鐜扮殑鐗规畩鎰熷彈銆傝鏉ュ鎬紝鍗充娇璺戝崐绋嬮┈鎷夋澗涔熸病鏈夊姝ゆ劅鍙楋紝鏃犻潪 “鎷煎懡璺戝畬浜屽崄涓€鍏噷”鑰屽凡銆傝瘹鐒讹紝鍗婄▼璇磋緵鑻︿篃澶熻緵鑻︾殑锛屼絾閭f槸璺� 瀹屾椂鍗冲彲鏁翠釜娑堝け鐨勮緵鑻︺€傝€岃窇瀹屽叏绋嬮┈鎷夋澗鏃讹紝灏辨湁鏃犳硶绠€鍗曞寲瑙g殑鎵� 钁楃殑涓滆タ鍦ㄤ汉鐨勶紙鑷冲皯鎴戠殑锛夊績澶存尌涔嬩笉鍘汇€傝В閲婃槸瑙i噴涓嶅ソ锛屾劅瑙変笂灏卞ソ 鍍忎笉涔呰繕灏嗛伃閬囧垰鍒氬皾杩囩殑鐥涜嫤锛屽洜鑰屽繀椤荤浉搴斿仛涓€涓�“鍠勫悗澶勭悊”—— “杩欎釜杩樿閲嶅鐨勶紝杩欏洖寰楅噸澶嶅緱濂戒竴浜涙墠琛岋紒”姝e洜濡傛锛屽墠鍚庡崄浜屽勾 鏃堕棿閲屾垜鎵嶄笉椤炬瘡娆¢兘绱緱姘斿枠鍚佸悂绛嬬柌鍔涘敖鑰屼笉灞堜笉鎸犲潥鎸佽窇鍏ㄧ▼椹媺 鏉�——褰撶劧“鍠勫悗澶勭悊”鏄竴鐐逛篃娌″鐞嗗ソ銆� 鎴栬鏈変汉璇存槸鑷檺锛屼絾鎴戣涓虹粷涓嶆槸浠呬粎濡傛锛岃帿濡傝绫讳技涓€绉嶅ソ濂� 蹇冿紝绫讳技涓€绉嶅姏鍥鹃€氳繃涓€娆℃澧炲姞娆℃暟涓€鐐圭偣鎻愰珮闄愬害鏉ユ妸鑷繁韬笂娼滃湪 鐨勩€佽嚜宸卞皻涓嶇煡鏅撹€屾兂涓€鐫逛负蹇殑涓滆タ涓€鎶婃媺鍒板厜澶╁寲鏃ヤ箣涓嬬殑蹇冩儏…… 缁嗘兂涔嬩笅锛岃繖鍚屾垜骞虫椂瀵归暱绡囧皬璇存€€鏈夌殑蹇冩儏鍑犱箮涓€妯′竴鏍枫€傛煇涓€澶� 绐佺劧鍔ㄤ簡鍐欓暱绡囧皬璇寸殑蹇靛ご锛屼簬鏄潗鍦ㄦ鍓嶏紝鏁版湀鎴栨暟骞村睆鎭暃姘斿皢绮剧 闆嗕腑鍦ㄦ瀬闄愮姸鎬侊紝缁堜簬鍐欏嚭涓€閮ㄩ暱绡囥€傛瘡娆¢兘绱緱鍍忕嫚鐙犳嫥杩囩殑鎶瑰竷锛屽晩 锛屽お绱簡锛岀疮姝讳簡锛佸績鎯冲啀涓嶅共閭g浜嬩簡銆備笉鏂欐椂杩囦笉涔咃紝鍐嶆蹇冭鏉ユ疆 锛氳繖鍥炲彲瑕佸ぇ骞蹭竴鍦猴紒鍙堟鐨禆鑴稿湴鍧愬湪妗屽墠鍔ㄧ瑪鍐欓暱绡囥€傜劧鑰屾棤璁烘€庝箞 鍐欐棤璁哄啓澶氬皯閮戒粛鏈夊嚌缁撶墿娌夌敻鐢稿湴娈嬬暀鍦ㄨ倸瀛愰噷銆� 鐩告瘮涔嬩笅锛岀煭绡囧皬璇村氨濂藉儚鍗佸叕閲岃禌锛屽啀闀夸笉杩囨槸鍗婄▼椹媺鏉剧舰浜嗐€� 涓嶇敤璇达紝鐭瘒鑷湁鐭瘒鐨勭嫭鐗逛綔鐢紝鑷湁鍏剁浉搴旂殑鏂囨€濆拰鎰夋偊锛屼絾缂轰箯— —褰撶劧鏄鎴戣€岃█——娣辨繁瑙﹀強韬綋缁撴瀯鏈韩鐨勯偅绉嶅喅瀹氭€х殑鑷村懡鎬ц川鐨� 涓滆タ锛屽洜鑰�“鐖辨啂鍙傚崐”鐨勪笢瑗夸篃灏戜簬闀跨瘒銆� 椹媺鏉捐窇瀹屽悗锛屽幓缁堢偣闄勮繎绉戞櫘鍒╁箍鍦洪噷闈㈢殑娉㈠+椤挎渶鏈夊悕鐨勬捣椴滈 鍘�“LEAGAL SEAFOOD”鍠濊毈鑲夋堡锛屽悆涓€绉嶆儫鐙柊鑻辨牸鍏板湴鍖烘墠鏈夌殑鎴戝枩娆� 鍚冪殑娴疯礉銆傚コ渚嶅簲鐢熺湅鐫€鎴戞墜涓窇瀹屽叏绋嬬殑绾康绔犲じ濂栭亾锛�“浣犺窇椹媺鏉� 浜嗭紵鍡紝濂芥湁鍕囨皵鍟婏紒”闈炴垜鐬庤锛岃浜哄じ鏈夊媷姘旀湁鐢熶互鏉ュ樊涓嶅鏄ご涓€ 娆°€傝瀹炶瘽锛屾垜鏍规湰娌′粈涔堝媷姘斻€� 浣嗕笉绠¤皝鎬庝箞璇达紝鏈夊媷姘斾篃濂芥病鍕囨皵涔熷ソ锛岃窇瀹屽叏绋嬮┈鎷夋澗涔嬪悗鍚冪殑 瓒冲閲忕殑鐑皵鑵捐吘鐨勬櫄椁愶紝瀹炲湪鏄繖涓笘鐣屼笂鏈€缇庡鐨勪笢瑗夸箣涓€銆� 涓嶇璋佹€庝箞璇淬€侾10-13\\n" } .. _jd-full-tiny: jd_full_tiny ------------------------------------ 浠� :code:`jd_full` 鐨勮缁冮泦銆佹祴璇曢泦鍒嗗埆闅忔満鎶藉彇1000鏉�1-5鍒嗭紙:code:`score`锛夋牱鏈紝鍏辫10000鏉★紝鎴戜滑鏋勫缓浜嗕竴涓皬鍨嬬殑鏁版嵁闆� :code:`jd_full_tiny`锛屼繚瀛樺湪 :code:`text/Datasets/jd_full_tiny` 涓嬨€傝鏁版嵁闆嗚繕閫氳繃闄愬畾 :code:`content` 瀛楁鐨勯暱搴︼紝鍒嗗埆鏈� :code:`long` 鐗堟湰锛堥暱搴︹墺30锛変互鍙� :code:`xl` 鐗堟湰锛堥暱搴︹墺100锛夈€備粠鏁版嵁闆� :code:`jd_full` 鐢熸垚 :code:`jd_full_tiny` 鐨勪唬鐮佸涓� .. code:: python df_jd_full_train = pd.read_csv("jd_full_train.csv.xz", header=None) df_jd_full_test = pd.read_csv("jd_full_test.csv.xz", header=None) df_jd_full_train.columns = ["score", "title", "content",] df_jd_full_test.columns = ["score", "title", "content",] df_jd_full_train["set"] = "train" df_jd_full_test["set"] = "test" df_jd_full = pd.concat([df_jd_full_train,df_jd_full_test], ignore_index=True) df_jd_full = df_jd_full[["set", "score", "title", "content",]] df_jd_full = df_jd_full[df_jd_full.text.str.contains("[\u4e00-\u9FFF]+")].reset_index(drop=True) df_jd_full_tiny = selection(df_jd_full, cols=["set", "score"], num=1000).sample(frac=1) df_jd_full_tiny.to_csv("jd_full_filtered_tiny.csv.gz", index=False, compression="gzip") df_jd_full_long_tiny = selection(df_jd_full[df_jd_full.text.str.len()>=30], cols=["set", "score"], num=1000).sample(frac=1) df_jd_full_long_tiny.to_csv("jd_full_filtered_long_tiny.csv.gz", index=False, compression="gzip") df_jd_full_xl_tiny = selection(df_jd_full[df_jd_full.text.str.len()>=100], cols=["set", "score"], num=1000).sample(frac=1) df_jd_full_xl_tiny.to_csv("jd_full_filtered_xl_tiny.csv.gz", index=False, compression="gzip") jd_binary ---------- 鏁版嵁闆� :code:`jd_binary` 鏄皢 :code:`jd_full` 鐨� :code:`score` 瀛楁锛�1 - 5锛夊彉涓�2绫伙紙1锛�2锛夛紝鏇挎崲涓� :code:`label` 瀛楁寰楀埌鐨勶紝婧愯嚜 `Glyph 椤圭洰 <https://github.com/zhangxiangxiao/glyph/>`_ [#glyph]_銆� jd_binary_tiny --------------- 浠庢暟鎹泦 :code:`jd_binary` 鑾峰緱鏁版嵁闆� :code:`jd_binary_tiny` 鐨勬柟寮忥紝涓庢暟鎹泦 :code:`dianping_tiny` 锛� :ref:`dianping-tiny` 锛変互鍙婃暟鎹泦 :code:`jd_full_tiny` 锛� :ref:`jd-full-tiny` 锛夌殑鐢熸垚鏂瑰紡绫讳技銆� .. [#glyph] Zhang X, LeCun Y. Which encoding is the best for text classification in chinese, english, japanese and korean?[J]. arXiv preprint arXiv:1708.02657, 2017.