심심해서 단어별? 토큰 인덱스를 알고 싶은데 이건 볼 방법 없나? 싶어서 조사
[링크 : https://makenow90.tistory.com/59]
[링크 : https://medium.com/the-research-nest/explained-tokens-and-embeddings-in-llms-69a16ba5db33]
AutoTokenizer.from_pretrained() 를 실행하니 먼가 즉석에서 받을 줄이야 -ㅁ-
| $ python3 Python 3.10.12 (main, Mar 3 2026, 11:56:32) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from transformers import AutoTokenizer Disabling PyTorch because PyTorch >= 2.4 is required but found 2.1.2 PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. >>> >>> tokenizer = AutoTokenizer.from_pretrained("gpt2") Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. config.json: 100%|██████████████████████████████| 665/665 [00:00<00:00, 819kB/s] tokenizer_config.json: 100%|█████████████████| 26.0/26.0 [00:00<00:00, 37.5kB/s] vocab.json: 1.04MB [00:00, 2.70MB/s] merges.txt: 456kB [00:00, 1.15MB/s] tokenizer.json: 1.36MB [00:00, 3.23MB/s] >>> >>> tokenizer GPT2Tokenizer(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, added_tokens_decoder={ 50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), }) >>> text="안녕? hello?" >>> tokens=tokenizer(text) >>> tokens {'input_ids': [168, 243, 230, 167, 227, 243, 30, 23748, 30], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]} |
[링크 : https://data-scient2st.tistory.com/224]
[링크 : https://huggingface.co/docs/transformers/model_doc/gpt2]
| $ find ./ -name tokenizer.json ./.cache/huggingface/hub/models--gpt2/snapshots/607a30d783dfa663caf39e06633721c8d4cfcd7e/tokenizer.json |
| $ tree ~/.cache/huggingface/hub/models--gpt2/snapshots/607a30d783dfa663caf39e06633721c8d4cfcd7e/ /home/minimonk/.cache/huggingface/hub/models--gpt2/snapshots/607a30d783dfa663caf39e06633721c8d4cfcd7e/ ├── config.json -> ../../blobs/10c66461e4c109db5a2196bff4bb59be30396ed8 ├── merges.txt -> ../../blobs/226b0752cac7789c48f0cb3ec53eda48b7be36cc ├── tokenizer.json -> ../../blobs/4b988bccc9dc5adacd403c00b4704976196548f8 ├── tokenizer_config.json -> ../../blobs/be4d21d94f3b4687e5a54d84bf6ab46ed0f8defd └── vocab.json -> ../../blobs/1f1d9aaca301414e7f6c9396df506798ff4eb9a6 |
+
2026.04.18
windows

beautifier + 일부 발췌
라틴어군에 대해서 있는 것 같고 토큰으로 잘리는걸 보면 단어 단위가 아닌
단어도 막 토막난 수준으로 잘릴 느낌?
| tokenizer.json | vocab.json |
| { "version": "1.0", "truncation": null, "padding": null, "added_tokens": [ { "id": 50256, "special": true, "content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true } ], "normalizer": null, "pre_tokenizer": { "type": "ByteLevel", "add_prefix_space": false, "trim_offsets": true }, "post_processor": { "type": "ByteLevel", "add_prefix_space": true, "trim_offsets": false }, "decoder": { "type": "ByteLevel", "add_prefix_space": true, "trim_offsets": true }, "model": { "dropout": null, "unk_token": null, "continuing_subword_prefix": "", "end_of_word_suffix": "", "fuse_unk": false, "vocab": { "0": 15, "1": 16, "2": 17, "3": 18, "4": 19, "5": 20, "6": 21, "7": 22, "8": 23, "9": 24, "!": 0, "\"": 1, "#": 2, "$": 3, "%": 4, "&": 5, "'": 6, "(": 7, ")": 8, "*": 9, "+": 10, ",": 11, "-": 12, ".": 13, "/": 14, ":": 25, ";": 26, "<": 27, "=": 28, ">": 29, "?": 30, "@": 31, "A": 32, "B": 33, "C": 34, "D": 35, "E": 36, "F": 37, "G": 38, "H": 39, "I": 40, "J": 41, "K": 42, "L": 43, "M": 44, "N": 45, "O": 46, "P": 47, "Q": 48, "R": 49, "S": 50, "T": 51, "U": 52, "V": 53, "W": 54, "X": 55, "Y": 56, "Z": 57, "[": 58, "\\": 59, "]": 60, "^": 61, "_": 62, "`": 63, "a": 64, "b": 65, "c": 66, "d": 67, "e": 68, "f": 69, "g": 70, "h": 71, "i": 72, "j": 73, "k": 74, "l": 75, "m": 76, "n": 77, "o": 78, "p": 79, "q": 80, "r": 81, "s": 82, "t": 83, "u": 84, "v": 85, "w": 86, "x": 87, "y": 88, "z": 89, "{": 90, "|": 91, "}": 92, "~": 93, "¡": 94, "¢": 95, "£": 96, "¤": 97, "¥": 98, "¦": 99, "he": 258, "in": 259, "re": 260, "on": 261, "Ġthe": 262, "er": 263, "Ġs": 264, "at": 265, "Ġw": 266, "Ġo": 267, "en": 268, "Ġc": 269, "it": 270, "is": 271, "an": 272, "or": 273, "es": 274, "Ġb": 275, "ed": 276, "Ġf": 277, "ing": 278, "Ġp": 279, "ou": 280, "Ġan": 281, "al": 282, "ar": 283, "Ġto": 284, "Ġm": 285, "Ġof": 286, "Ġin": 287, "Ġd": 288, "Ġh": 289, "Ġand": 290, "ic": 291, "as": 292, "le": 293, "Ġth": 294, "ion": 295, "om": 296, "ll": 297, "ent": 298, "Ġn": 299, "Ġl": 300, "st": 301, "Ġre": 302, "ve": 303, "Ġe": 304, "ro": 305, "ly": 306, "Ġbe": 307, "Ġg": 308, "ĠT": 309, "ct": 310, "ĠS": 311, "id": 312, "ot": 313, "ĠI": 314, "ut": 315, "et": 316, "ĠA": 317, "Ġis": 318, "Ġon": 319, "im": 320, "am": 321, "ow": 322, "ay": 323, "ad": 324, "se": 325, "Ġthat": 326, "ĠC": 327, "ig": 328, "Ġfor": 329, "ac": 330, "Ġy": 331, "ver": 332, "ur": 333, "Ġu": 334, "ld": 335, "Ġst": 336, "ĠM": 337, "'s": 338, "Ġhe": 339, "Ġit": 340, "ation": 341, "ith": 342, "ir": 343, "ce": 344, "Ġyou": 345, "il": 346, "ĠB": 347, "Ġwh": 348, "ol": 349, "ĠP": 350, "Ġwith": 351, "Ġ1": 352, "ter": 353, "ch": 354, "Ġas": 355, "Ġwe": 356, "Ġ(": 357, "nd": 358, "ill": 359, "ĠD": 360, "if": 361, "Ġ2": 362, "ag": 363, "ers": 364, "ke": 365, "Ġ\"": 366, "ĠH": 367, "em": 368, "Ġcon": 369, "ĠW": 370, "ĠR": 371, "her": 372, "Ġwas": 373, "Ġr": 374, "od": 375, "ĠF": 376, "ul": 377, "ate": 378, "Ġat": 379, "ri": 380, "pp": 381, "ore": 382, "ĠThe": 383, "Ġse": 384, "us": 385, "Ġpro": 386, "Ġha": 387, "um": 388, "Ġare": 389, "Ġde": 390, "ain": 391, "and": 392, |
{ "0": 15, "1": 16, "2": 17, "3": 18, "4": 19, "5": 20, "6": 21, "7": 22, "8": 23, "9": 24, "!": 0, "\"": 1, "#": 2, "$": 3, "%": 4, "&": 5, "'": 6, "(": 7, ")": 8, "*": 9, "+": 10, ",": 11, "-": 12, ".": 13, "/": 14, ":": 25, ";": 26, "<": 27, "=": 28, ">": 29, "?": 30, "@": 31, "A": 32, "B": 33, "C": 34, "D": 35, "E": 36, "F": 37, "G": 38, "H": 39, "I": 40, "J": 41, "K": 42, "L": 43, "M": 44, "N": 45, "O": 46, "P": 47, "Q": 48, "R": 49, "S": 50, "T": 51, "U": 52, "V": 53, "W": 54, "X": 55, "Y": 56, "Z": 57, "[": 58, "\\": 59, "]": 60, "^": 61, "_": 62, "`": 63, "a": 64, "b": 65, "c": 66, "d": 67, "e": 68, "f": 69, "g": 70, "h": 71, "i": 72, "j": 73, "k": 74, "l": 75, "m": 76, "n": 77, "o": 78, "p": 79, "q": 80, "r": 81, "s": 82, "t": 83, "u": 84, "v": 85, "w": 86, "x": 87, "y": 88, "z": 89, "{": 90, "|": 91, "}": 92, "~": 93, "¡": 94, "¢": 95, "£": 96, "¤": 97, "¥": 98, "¦": 99, "he": 258, "in": 259, "re": 260, "on": 261, "Ġthe": 262, "er": 263, "Ġs": 264, "at": 265, "Ġw": 266, "Ġo": 267, "en": 268, "Ġc": 269, "it": 270, "is": 271, "an": 272, "or": 273, "es": 274, "Ġb": 275, "ed": 276, "Ġf": 277, "ing": 278, "Ġp": 279, "ou": 280, "Ġan": 281, "al": 282, "ar": 283, "Ġto": 284, "Ġm": 285, "Ġof": 286, "Ġin": 287, "Ġd": 288, "Ġh": 289, "Ġand": 290, "ic": 291, "as": 292, "le": 293, "Ġth": 294, "ion": 295, "om": 296, "ll": 297, "ent": 298, "Ġn": 299, "Ġl": 300, "st": 301, "Ġre": 302, "ve": 303, "Ġe": 304, "ro": 305, "ly": 306, "Ġbe": 307, "Ġg": 308, "ĠT": 309, "ct": 310, "ĠS": 311, "id": 312, "ot": 313, "ĠI": 314, "ut": 315, "et": 316, "ĠA": 317, "Ġis": 318, "Ġon": 319, "im": 320, "am": 321, "ow": 322, "ay": 323, "ad": 324, "se": 325, "Ġthat": 326, "ĠC": 327, "ig": 328, "Ġfor": 329, "ac": 330, "Ġy": 331, "ver": 332, "ur": 333, "Ġu": 334, "ld": 335, "Ġst": 336, "ĠM": 337, "'s": 338, "Ġhe": 339, "Ġit": 340, "ation": 341, "ith": 342, "ir": 343, "ce": 344, "Ġyou": 345, "il": 346, "ĠB": 347, "Ġwh": 348, "ol": 349, "ĠP": 350, "Ġwith": 351, "Ġ1": 352, "ter": 353, "ch": 354, "Ġas": 355, "Ġwe": 356, "Ġ(": 357, "nd": 358, "ill": 359, "ĠD": 360, "if": 361, "Ġ2": 362, "ag": 363, "ers": 364, "ke": 365, "Ġ\"": 366, "ĠH": 367, "em": 368, "Ġcon": 369, "ĠW": 370, "ĠR": 371, "her": 372, "Ġwas": 373, "Ġr": 374, "od": 375, "ĠF": 376, "ul": 377, "ate": 378, "Ġat": 379, "ri": 380, "pp": 381, "ore": 382, "ĠThe": 383, "Ġse": 384, "us": 385, "Ġpro": 386, "Ġha": 387, "um": 388, "Ġare": 389, "Ġde": 390, "ain": 391, "and": 392, |
+
| >>> tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-E2B") config.json: 4.91kB [00:00, 8.83MB/s] C:\Users\minimonk\AppData\Local\Programs\Python\Python313\Lib\site-packages\huggingface_hub\file_download.py:138: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\minimonk\.cache\huggingface\hub\models--google--gemma-4-E2B. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations. To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development warnings.warn(message) tokenizer_config.json: 100%|██████████████████████████████████████████████████████████| 906/906 [00:00<00:00, 2.57MB/s] tokenizer.json: 100%|█████████████████████████████████████████████████████████████| 32.2M/32.2M [00:02<00:00, 13.3MB/s] |
'프로그램 사용 > ai 프로그램' 카테고리의 다른 글
| ollama 모델 저장소 뜯어보기 (0) | 2026.04.19 |
|---|---|
| llm tokenizer - phi3 (0) | 2026.04.19 |
| llama.cpp (0) | 2026.04.17 |
| lm studio (0) | 2026.04.17 |
| 사람의 욕심은 끝이없고 - ollama multiple GPU support (0) | 2026.04.17 |
