{"id":25308,"date":"2025-06-14T19:57:23","date_gmt":"2025-06-14T17:57:23","guid":{"rendered":"https:\/\/www.wjst.de\/blog\/?p=25308"},"modified":"2025-08-07T08:17:12","modified_gmt":"2025-08-07T06:17:12","slug":"how-to-run-llama-on-your-local-pdfs","status":"publish","type":"post","link":"https:\/\/www.wjst.de\/blog\/sciencesurf\/2025\/06\/how-to-run-llama-on-your-local-pdfs\/","title":{"rendered":"How to run LLaMA on your local PDFs"},"content":{"rendered":"<p>I needed this urgently for indexing PDFs as Spotlight on the Mac is highly erratic after all this years.<\/p>\n<p><a href=\"https:\/\/github.com\/Mintplex-Labs\/anything-llm\">Anything LLM<\/a> seemed the most promising approach with an easy to use GUI and being well documented. But indexing failed after several hours, so I went on with LM Studio. Also this installation turned out to be more complicated than expected due to library &#8220;dependency hell&#8221; and version mismatch spiralling&#8230;<\/p>\n<ol>\n<li>Download and install <a href=\"https:\/\/lmstudio.ai\/\">LM Studio<\/a><\/li>\n<li>From inside LM Studio download your preferred model<\/li>\n<li>Index your PDFs in batches of 1,000 using the Python script below<\/li>\n<li>Combine indices and run queries against the full index<\/li>\n<\/ol>\n<p>30.000 PDFs result in a 4G index while the system is unfortunately not very responsive (yet)<\/p>\n<p><!--more--><\/p>\n<p><strong>index.py<\/strong><\/p>\n<pre class=\"brush: php; title: ; notranslate\" title=\"\">\r\nimport os\r\nfrom tqdm import tqdm\r\nfrom multiprocessing import get_context, TimeoutError\r\nfrom pdfminer.high_level import extract_text\r\nfrom sentence_transformers import SentenceTransformer\r\nfrom llama_index.core import VectorStoreIndex, Document, Settings\r\nfrom llama_index.llms.ollama import Ollama\r\nfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding\r\n\r\n# === CONFIG ===\r\nos.environ&#x5B;&quot;TOKENIZERS_PARALLELISM&quot;] = &quot;false&quot;\r\nPDF_DIR = &quot;\/Users\/xxx\/Documents&quot;\r\nINDEX_BASE_DIR = &quot;\/Users\/xxx\/Store&quot;\r\nMODEL_NAME = &quot;sentence-transformers\/all-MiniLM-L6-v2&quot;\r\nLLM_API_BASE = &quot;http:\/\/localhost:11434\/v1&quot;\r\nCHUNK_SIZE = 10000\r\nSTART_CHUNK = 1\r\nEND_CHUNK = 1\r\nBAD_LOG = os.path.join(INDEX_BASE_DIR, &quot;bad_files.txt&quot;)\r\nTIMEOUT_SECONDS = 60\r\n\r\n# === Step 1: Get all PDFs ===\r\ndef get_all_pdfs(path):\r\n    pdf_files = &#x5B;]\r\n    for root, _, files in os.walk(path):\r\n        for file in files:\r\n            if file.lower().endswith(&quot;.pdf&quot;):\r\n                pdf_files.append(os.path.join(root, file))\r\n    return sorted(pdf_files)\r\n\r\n# === Step 2: Timeout-safe PDF parsing ===\r\ndef parse_pdf_safe(path):\r\n    try:\r\n        text = extract_text(path)\r\n        return {&quot;success&quot;: True, &quot;doc&quot;: Document(text=text, metadata={&quot;file_path&quot;: path})}\r\n    except Exception as e:\r\n        return {&quot;success&quot;: False, &quot;file_path&quot;: path, &quot;error&quot;: str(e)}\r\n\r\ndef parse_pdfs_parallel(paths, timeout=TIMEOUT_SECONDS):\r\n    documents = &#x5B;]\r\n    print(f&quot;Parsing {len(paths)} PDFs (timeout: {timeout}s each)...&quot;)\r\n\r\n    ctx = get_context(&quot;spawn&quot;)\r\n    pool = ctx.Pool(processes=os.cpu_count())\r\n\r\n    try:\r\n        for path in tqdm(paths, desc=&quot;Parsing PDFs&quot;):\r\n            async_result = pool.apply_async(parse_pdf_safe, (path,))\r\n            try:\r\n                result = async_result.get(timeout=timeout)\r\n                if result&#x5B;&quot;success&quot;]:\r\n                    documents.append(result&#x5B;&quot;doc&quot;])\r\n                else:\r\n                    print(f&quot;{result&#x5B;&#039;file_path&#039;]}: {result&#x5B;&#039;error&#039;]}&quot;)\r\n                    with open(BAD_LOG, &quot;a&quot;) as log:\r\n                        log.write(f&quot;FAIL: {result&#x5B;&#039;file_path&#039;]} :: {result&#x5B;&#039;error&#039;]}\\n&quot;)\r\n            except TimeoutError:\r\n                print(f&quot;Timeout: {path}&quot;)\r\n                with open(BAD_LOG, &quot;a&quot;) as log:\r\n                    log.write(f&quot;TIMEOUT: {path}\\n&quot;)\r\n    finally:\r\n        pool.terminate()\r\n        pool.join()\r\n\r\n    return documents\r\n\r\n# === Step 3: Build and save index ===\r\ndef build_index(documents, index_dir):\r\n    print(f&quot;Indexing {len(documents)} documents \u2192 {index_dir}&quot;)\r\n    embed_model = HuggingFaceEmbedding(model_name=&quot;sentence-transformers\/all-MiniLM-L6-v2&quot;)\r\n    # llm = OpenAI(api_base=&quot;http:\/\/localhost:11434\/v1&quot;, api_key=&quot;lm-studio&quot;)\r\n    llm = Ollama(model=&quot;llama3&quot;)\r\n\r\n    Settings.embed_model = embed_model\r\n    Settings.llm = llm\r\n\r\n    index = VectorStoreIndex.from_documents(documents)\r\n    index.storage_context.persist(persist_dir=index_dir)\r\n    print(f&quot;Index saved to {index_dir}&quot;)\r\n\r\n# === MAIN ===\r\nif __name__ == &quot;__main__&quot;:\r\n    all_pdfs = get_all_pdfs(PDF_DIR)\r\n    total = len(all_pdfs)\r\n    total_chunks = (total + CHUNK_SIZE - 1) \/\/ CHUNK_SIZE\r\n\r\n    os.makedirs(INDEX_BASE_DIR, exist_ok=True)\r\n    open(BAD_LOG, &quot;w&quot;).close()  # clear previous log\r\n\r\n    for chunk_num in range(START_CHUNK, END_CHUNK + 1):\r\n        start = (chunk_num - 1) * CHUNK_SIZE\r\n        end = min(start + CHUNK_SIZE, total)\r\n        chunk_paths = all_pdfs&#x5B;start:end]\r\n\r\n        index_dir = os.path.join(INDEX_BASE_DIR, f&quot;part_{chunk_num}&quot;)\r\n        if os.path.exists(index_dir):\r\n            print(f&quot;hunk {chunk_num} already exists, skipping: {index_dir}&quot;)\r\n            continue\r\n\r\n        print(f&quot;\\nProcessing chunk {chunk_num}\/{total_chunks} \u2192 files {start+1} to {end}&quot;)\r\n        docs = parse_pdfs_parallel(chunk_paths)\r\n        build_index(docs, index_dir)\r\n<\/pre>\n<p><strong>query.py<\/strong><\/p>\n<pre class=\"brush: php; title: ; notranslate\" title=\"\">\r\nimport os\r\nfrom llama_index.core import StorageContext, load_index_from_storage, VectorStoreIndex, Document, Settings\r\nfrom llama_index.llms.openai import OpenAI\r\nfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding\r\n\r\n# === CONFIG ===\r\nINDEX_PARTS_DIR = &quot;\/Users\/xxx\/Store&quot;\r\nMERGED_INDEX_DIR = os.path.join(INDEX_PARTS_DIR, &quot;merged&quot;)\r\n\r\nPART_DIRS = &#x5B;\r\n    os.path.join(INDEX_PARTS_DIR, d)\r\n    for d in os.listdir(INDEX_PARTS_DIR)\r\n    if d.startswith(&quot;part_&quot;) and os.path.isdir(os.path.join(INDEX_PARTS_DIR, d))\r\n]\r\n\r\n# === Setup LM Studio OpenAI-compatible API ===\r\nos.environ&#x5B;&quot;OPENAI_API_KEY&quot;] = &quot;not-needed&quot;  # dummy value to bypass API key checks\r\nos.environ&#x5B;&quot;OPENAI_API_BASE&quot;] = &quot;http:\/\/localhost:1234\/v1&quot;  # LM Studio API base\r\n\r\n# Embedding + LLM setup using LM Studio OpenAI API\r\nSettings.embed_model = HuggingFaceEmbedding(model_name=&quot;sentence-transformers\/all-MiniLM-L6-v2&quot;)\r\nSettings.llm = OpenAI(\r\n    model=&quot;llama-3.2-3b-instruct&quot;,   # or your LM Studio supported model\r\n    api_key=&quot;not-needed&quot;,    # dummy key, LM Studio ignores it\r\n    api_base=&quot;http:\/\/localhost:1234\/v1&quot;,\r\n)\r\n\r\n# === Load &amp; collect all documents from partial indices ===\r\nall_documents = &#x5B;]\r\n\r\nfor part_dir in sorted(PART_DIRS):\r\n   print(f&quot;Loading index from: {part_dir}&quot;)\r\n   storage_context = StorageContext.from_defaults(persist_dir=part_dir)\r\n   index = load_index_from_storage(storage_context)\r\n\r\n   for node in storage_context.docstore.docs.values():\r\n       if isinstance(node, Document):\r\n           all_documents.append(node)\r\n       else:\r\n           if hasattr(node, &quot;get_content&quot;) and hasattr(node, &quot;metadata&quot;):\r\n               all_documents.append(Document(text=node.get_content(), metadata=node.metadata))\r\n\r\nprint(f&quot;Total documents to merge: {len(all_documents)}&quot;)\r\n\r\n# === Build and persist merged index ===\r\nmerged_index = VectorStoreIndex.from_documents(all_documents)\r\nmerged_index.storage_context.persist(persist_dir=MERGED_INDEX_DIR)\r\nprint(f&quot;Merged index saved to: {MERGED_INDEX_DIR}&quot;)\r\n\r\n# === Load merged index for querying ===\r\nstorage = StorageContext.from_defaults(persist_dir=MERGED_INDEX_DIR)\r\nindex = load_index_from_storage(storage)\r\nquery_engine = index.as_query_engine()\r\n\r\n# === Interactive query loop ===\r\nprint(&quot;\\nAsk anything about the case. Type &#039;exit&#039; or &#039;quit&#039; to stop.&quot;)\r\nwhile True:\r\n    query = input(&quot;You: &quot;).strip()\r\n    if query.lower() in (&quot;exit&quot;, &quot;quit&quot;):\r\n        break\r\n    try:\r\n        response = query_engine.query(query)\r\n        print(f&quot;Response: {response}\\n&quot;)\r\n    except Exception as e:\r\n        print(f&quot;Error: {e}\\n&quot;)\r\n<\/pre>\n<p><span style=\"text-decoration: underline;\">7 Aug 2025<\/span><\/p>\n<p>So far, I used Deepseek inside LM studio as my model of choice. Since yesterday we can use also OpenAI&#8217;s <a href=\"https:\/\/openai.com\/de-DE\/index\/introducing-gpt-oss\">gpt-oss-20b which is basically OpenAI o4-mini<\/a>.<\/p>\n\n<p>&nbsp;<\/p>\n<div class=\"bottom-note\">\n  <span class=\"mod1\">CC-BY-NC Science Surf , accessed 05.04.2026<\/span>\n <\/div>","protected":false},"excerpt":{"rendered":"<p>I needed this urgently for indexing PDFs as Spotlight on the Mac is highly erratic after all this years. Anything LLM seemed the most promising approach with an easy to use GUI and being well documented. But indexing failed after several hours, so I went on with LM Studio. Also this installation turned out to &hellip; <a href=\"https:\/\/www.wjst.de\/blog\/sciencesurf\/2025\/06\/how-to-run-llama-on-your-local-pdfs\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">How to run LLaMA on your local PDFs<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9,3774],"tags":[5054,4253,5053],"class_list":["post-25308","post","type-post","status-publish","format-standard","hentry","category-computer-software","category-tech","tag-deepseek","tag-llm","tag-llama"],"_links":{"self":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/25308","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/comments?post=25308"}],"version-history":[{"count":13,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/25308\/revisions"}],"predecessor-version":[{"id":25504,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/25308\/revisions\/25504"}],"wp:attachment":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/media?parent=25308"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/categories?post=25308"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/tags?post=25308"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}