A recent project concept, using a serverless application powered by Docling document ingestion/preparation capacities.
Introduction
As part of my professional activities, I am very often engaged in helping our business partners to gain technical hands-on experience with technologies and tools we recommend to them. What follows is a part of a global project in which we helped our partner by some coding samples to accelerate the first phase of their project.
> The code provided below is to used as a starter or helper, and is adopted to the real use-case. So it should not be considered as finished or an end-to-end project, but a project starter/helper.
The main idea is;
- An application uploads documents by users on a cloud file system.
- A serverless job application using Docling fetches documents and prepares them for future utilization and drops the result in another cloud file system.
The serverless application deployed on IBM Code Engine, fetches source and updates from a private GitHub repository.
What is Docling and what is it used for
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
Features
- ️ Parsing of multiple document formats incl. PDF, DOCX, XLSX, HTML, images, and more
- Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
- 🧬 Unified, expressive DoclingDocument representation format
- ↪️ Various export formats and options, including Markdown, HTML, and lossless JSON
- Local execution capabilities for sensitive data and air-gapped environments
- 🤖 Plug-and-play integrations incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
- Extensive OCR support for scanned PDFs and images
- Simple and convenient CLI
The file uploading application
I proposed two simple application to upload and store files. At first I wrote an application using Fastapi.
File uploading using Fastapi
import osfrom fastapi import FastAPI, Request, File, UploadFile, HTTPExceptionfrom fastapi.responses import HTMLResponse, RedirectResponsefrom fastapi.templating import Jinja2Templatesapp = FastAPI()templates = Jinja2Templates(directory="templates")UPLOAD_DIR = "uploads"os.makedirs(UPLOAD_DIR, exist_ok=True)def get_uploaded_files():try:files = os.listdir(UPLOAD_DIR)files.sort()return filesexcept FileNotFoundError:return []@app.get("/", response_class=HTMLResponse)async def read_root(request: Request):uploaded_files = get_uploaded_files()return templates.TemplateResponse("index.html", {"request": request, "filename": None, "message": None, "uploaded_files": uploaded_files})@app.post("/upload", response_class=HTMLResponse)async def upload_file(request: Request, file: UploadFile = File(...)):filename = file.filenamefilepath = os.path.join(UPLOAD_DIR, filename)if os.path.exists(filepath):return templates.TemplateResponse("confirm.html", {"request": request, "filename": filename})else:with open(filepath, "wb") as f:contents = await file.read()f.write(contents)uploaded_files = get_uploaded_files() # Refresh file listreturn templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"File '{filename}' uploaded successfully.", "uploaded_files": uploaded_files})@app.post("/confirm_replace", response_class=HTMLResponse)async def confirm_replace(request: Request):form = await request.form()filename = form.get("filename")replace = form.get("replace")if not filename or not replace:return templates.TemplateResponse("index.html", {"request": request, "message": "Missing filename or replace value."})filepath = os.path.join(UPLOAD_DIR, filename)if replace == "yes":try:files = await request.files() # Correct way to get the filefile = files.get("file")if not file:return templates.TemplateResponse("index.html", {"request": request, "message": "No file uploaded for replacement."})contents = await file.read()with open(filepath, "wb") as f:f.write(contents)uploaded_files = get_uploaded_files() # Refresh file listreturn templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"File '{filename}' replaced successfully.", "uploaded_files": uploaded_files})except Exception as e:return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"Error replacing file: {e}"})elif replace == "no":uploaded_files = get_uploaded_files() # Refresh file listreturn templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"No action taken for '{filename}'. File already exists.", "uploaded_files": uploaded_files})else:return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": "Invalid response."})@app.post("/delete", response_class=RedirectResponse)async def delete_files(request: Request):form = await request.form()files_to_delete = form.getlist("files")if files_to_delete:for file_to_delete in files_to_delete:filepath = os.path.join(UPLOAD_DIR, file_to_delete)try:os.remove(filepath)except Exception as e:print(f"Error deleting {file_to_delete}: {e}")return RedirectResponse("/", status_code=303)return RedirectResponse("/", status_code=303)import os from fastapi import FastAPI, Request, File, UploadFile, HTTPException from fastapi.responses import HTMLResponse, RedirectResponse from fastapi.templating import Jinja2Templates app = FastAPI() templates = Jinja2Templates(directory="templates") UPLOAD_DIR = "uploads" os.makedirs(UPLOAD_DIR, exist_ok=True) def get_uploaded_files(): try: files = os.listdir(UPLOAD_DIR) files.sort() return files except FileNotFoundError: return [] @app.get("/", response_class=HTMLResponse) async def read_root(request: Request): uploaded_files = get_uploaded_files() return templates.TemplateResponse("index.html", {"request": request, "filename": None, "message": None, "uploaded_files": uploaded_files}) @app.post("/upload", response_class=HTMLResponse) async def upload_file(request: Request, file: UploadFile = File(...)): filename = file.filename filepath = os.path.join(UPLOAD_DIR, filename) if os.path.exists(filepath): return templates.TemplateResponse("confirm.html", {"request": request, "filename": filename}) else: with open(filepath, "wb") as f: contents = await file.read() f.write(contents) uploaded_files = get_uploaded_files() # Refresh file list return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"File '{filename}' uploaded successfully.", "uploaded_files": uploaded_files}) @app.post("/confirm_replace", response_class=HTMLResponse) async def confirm_replace(request: Request): form = await request.form() filename = form.get("filename") replace = form.get("replace") if not filename or not replace: return templates.TemplateResponse("index.html", {"request": request, "message": "Missing filename or replace value."}) filepath = os.path.join(UPLOAD_DIR, filename) if replace == "yes": try: files = await request.files() # Correct way to get the file file = files.get("file") if not file: return templates.TemplateResponse("index.html", {"request": request, "message": "No file uploaded for replacement."}) contents = await file.read() with open(filepath, "wb") as f: f.write(contents) uploaded_files = get_uploaded_files() # Refresh file list return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"File '{filename}' replaced successfully.", "uploaded_files": uploaded_files}) except Exception as e: return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"Error replacing file: {e}"}) elif replace == "no": uploaded_files = get_uploaded_files() # Refresh file list return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"No action taken for '{filename}'. File already exists.", "uploaded_files": uploaded_files}) else: return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": "Invalid response."}) @app.post("/delete", response_class=RedirectResponse) async def delete_files(request: Request): form = await request.form() files_to_delete = form.getlist("files") if files_to_delete: for file_to_delete in files_to_delete: filepath = os.path.join(UPLOAD_DIR, file_to_delete) try: os.remove(filepath) except Exception as e: print(f"Error deleting {file_to_delete}: {e}") return RedirectResponse("/", status_code=303) return RedirectResponse("/", status_code=303)import os from fastapi import FastAPI, Request, File, UploadFile, HTTPException from fastapi.responses import HTMLResponse, RedirectResponse from fastapi.templating import Jinja2Templates app = FastAPI() templates = Jinja2Templates(directory="templates") UPLOAD_DIR = "uploads" os.makedirs(UPLOAD_DIR, exist_ok=True) def get_uploaded_files(): try: files = os.listdir(UPLOAD_DIR) files.sort() return files except FileNotFoundError: return [] @app.get("/", response_class=HTMLResponse) async def read_root(request: Request): uploaded_files = get_uploaded_files() return templates.TemplateResponse("index.html", {"request": request, "filename": None, "message": None, "uploaded_files": uploaded_files}) @app.post("/upload", response_class=HTMLResponse) async def upload_file(request: Request, file: UploadFile = File(...)): filename = file.filename filepath = os.path.join(UPLOAD_DIR, filename) if os.path.exists(filepath): return templates.TemplateResponse("confirm.html", {"request": request, "filename": filename}) else: with open(filepath, "wb") as f: contents = await file.read() f.write(contents) uploaded_files = get_uploaded_files() # Refresh file list return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"File '{filename}' uploaded successfully.", "uploaded_files": uploaded_files}) @app.post("/confirm_replace", response_class=HTMLResponse) async def confirm_replace(request: Request): form = await request.form() filename = form.get("filename") replace = form.get("replace") if not filename or not replace: return templates.TemplateResponse("index.html", {"request": request, "message": "Missing filename or replace value."}) filepath = os.path.join(UPLOAD_DIR, filename) if replace == "yes": try: files = await request.files() # Correct way to get the file file = files.get("file") if not file: return templates.TemplateResponse("index.html", {"request": request, "message": "No file uploaded for replacement."}) contents = await file.read() with open(filepath, "wb") as f: f.write(contents) uploaded_files = get_uploaded_files() # Refresh file list return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"File '{filename}' replaced successfully.", "uploaded_files": uploaded_files}) except Exception as e: return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"Error replacing file: {e}"}) elif replace == "no": uploaded_files = get_uploaded_files() # Refresh file list return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": f"No action taken for '{filename}'. File already exists.", "uploaded_files": uploaded_files}) else: return templates.TemplateResponse("index.html", {"request": request, "filename": filename, "message": "Invalid response."}) @app.post("/delete", response_class=RedirectResponse) async def delete_files(request: Request): form = await request.form() files_to_delete = form.getlist("files") if files_to_delete: for file_to_delete in files_to_delete: filepath = os.path.join(UPLOAD_DIR, file_to_delete) try: os.remove(filepath) except Exception as e: print(f"Error deleting {file_to_delete}: {e}") return RedirectResponse("/", status_code=303) return RedirectResponse("/", status_code=303)
Enter fullscreen mode Exit fullscreen mode
Index.html
/* index.html */<!DOCTYPE html><html><head><title>File Upload</title><style>body {font-family: sans-serif;background-color: #f4f4f4;color: #333;margin: 20px;display: flex;flex-direction: column;align-items: center; /* Center content horizontally */}h1 {color: #007bff; /* Blue heading */margin-bottom: 20px;}form {background-color: #fff;padding: 20px;border-radius: 8px;box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1);margin-bottom: 20px;width: 400px; /* Set a fixed width for the form */}input[type="file"] {margin-bottom: 10px;}input[type="submit"] {background-color: #007bff;color: #fff;padding: 10px 15px;border: none;border-radius: 4px;cursor: pointer;}input[type="submit"]:hover {background-color: #0056b3;}h2 {margin-top: 20px;color: #343a40; /* Darker heading */}ul {list-style: none;padding: 0;}li {margin-bottom: 5px;display: flex; /* Align checkbox and label */align-items: center;}input[type="checkbox"] {margin-right: 5px;}p {color: #d9534f; /* Red message for errors or feedback */margin-top: 10px;}.uploaded-file-list { /* Style the uploaded files list */background-color: #fff;padding: 15px;border-radius: 8px;box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1);width: 400px; /* Match the form width */}</style><script>function validateForm() {const fileInput = document.querySelector('input[type="file"]');if (fileInput.files.length === 0) {alert("No files selected!");return false; // Prevent form submission}return true; // Allow form submission}function validateDeleteForm() {const checkboxes = document.querySelectorAll('input[type="checkbox"]:checked');if (checkboxes.length === 0) {alert("No files selected for deletion!");return false;}return true;}</script></head><body><h1>Upload a File</h1><form action="/upload" method="post" enctype="multipart/form-data" onsubmit="return validateForm();"><input type="file" name="file"><input type="submit" value="Upload"></form>{% if filename %}<h2>Uploaded File: {{ filename }}</h2>{% endif %}{% if message %}<p>{{ message }}</p>{% endif %}<div class="uploaded-file-list"> <h2>Uploaded Files:</h2><form action="/delete" method="post"><ul>{% for file in uploaded_files %}<li><input type="checkbox" name="files" value="{{ file }}" id="{{ file }}"><label for="{{ file }}">{{ file }}</label></li>{% endfor %}</ul><input type="submit" value="Delete Selected"></form></div></body></html>/* index.html */ <!DOCTYPE html> <html> <head> <title>File Upload</title> <style> body { font-family: sans-serif; background-color: #f4f4f4; color: #333; margin: 20px; display: flex; flex-direction: column; align-items: center; /* Center content horizontally */ } h1 { color: #007bff; /* Blue heading */ margin-bottom: 20px; } form { background-color: #fff; padding: 20px; border-radius: 8px; box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1); margin-bottom: 20px; width: 400px; /* Set a fixed width for the form */ } input[type="file"] { margin-bottom: 10px; } input[type="submit"] { background-color: #007bff; color: #fff; padding: 10px 15px; border: none; border-radius: 4px; cursor: pointer; } input[type="submit"]:hover { background-color: #0056b3; } h2 { margin-top: 20px; color: #343a40; /* Darker heading */ } ul { list-style: none; padding: 0; } li { margin-bottom: 5px; display: flex; /* Align checkbox and label */ align-items: center; } input[type="checkbox"] { margin-right: 5px; } p { color: #d9534f; /* Red message for errors or feedback */ margin-top: 10px; } .uploaded-file-list { /* Style the uploaded files list */ background-color: #fff; padding: 15px; border-radius: 8px; box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1); width: 400px; /* Match the form width */ } </style> <script> function validateForm() { const fileInput = document.querySelector('input[type="file"]'); if (fileInput.files.length === 0) { alert("No files selected!"); return false; // Prevent form submission } return true; // Allow form submission } function validateDeleteForm() { const checkboxes = document.querySelectorAll('input[type="checkbox"]:checked'); if (checkboxes.length === 0) { alert("No files selected for deletion!"); return false; } return true; } </script> </head> <body> <h1>Upload a File</h1> <form action="/upload" method="post" enctype="multipart/form-data" onsubmit="return validateForm();"> <input type="file" name="file"> <input type="submit" value="Upload"> </form> {% if filename %} <h2>Uploaded File: {{ filename }}</h2> {% endif %} {% if message %} <p>{{ message }}</p> {% endif %} <div class="uploaded-file-list"> <h2>Uploaded Files:</h2> <form action="/delete" method="post"> <ul> {% for file in uploaded_files %} <li> <input type="checkbox" name="files" value="{{ file }}" id="{{ file }}"> <label for="{{ file }}">{{ file }}</label> </li> {% endfor %} </ul> <input type="submit" value="Delete Selected"> </form> </div> </body> </html>/* index.html */ <!DOCTYPE html> <html> <head> <title>File Upload</title> <style> body { font-family: sans-serif; background-color: #f4f4f4; color: #333; margin: 20px; display: flex; flex-direction: column; align-items: center; /* Center content horizontally */ } h1 { color: #007bff; /* Blue heading */ margin-bottom: 20px; } form { background-color: #fff; padding: 20px; border-radius: 8px; box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1); margin-bottom: 20px; width: 400px; /* Set a fixed width for the form */ } input[type="file"] { margin-bottom: 10px; } input[type="submit"] { background-color: #007bff; color: #fff; padding: 10px 15px; border: none; border-radius: 4px; cursor: pointer; } input[type="submit"]:hover { background-color: #0056b3; } h2 { margin-top: 20px; color: #343a40; /* Darker heading */ } ul { list-style: none; padding: 0; } li { margin-bottom: 5px; display: flex; /* Align checkbox and label */ align-items: center; } input[type="checkbox"] { margin-right: 5px; } p { color: #d9534f; /* Red message for errors or feedback */ margin-top: 10px; } .uploaded-file-list { /* Style the uploaded files list */ background-color: #fff; padding: 15px; border-radius: 8px; box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1); width: 400px; /* Match the form width */ } </style> <script> function validateForm() { const fileInput = document.querySelector('input[type="file"]'); if (fileInput.files.length === 0) { alert("No files selected!"); return false; // Prevent form submission } return true; // Allow form submission } function validateDeleteForm() { const checkboxes = document.querySelectorAll('input[type="checkbox"]:checked'); if (checkboxes.length === 0) { alert("No files selected for deletion!"); return false; } return true; } </script> </head> <body> <h1>Upload a File</h1> <form action="/upload" method="post" enctype="multipart/form-data" onsubmit="return validateForm();"> <input type="file" name="file"> <input type="submit" value="Upload"> </form> {% if filename %} <h2>Uploaded File: {{ filename }}</h2> {% endif %} {% if message %} <p>{{ message }}</p> {% endif %} <div class="uploaded-file-list"> <h2>Uploaded Files:</h2> <form action="/delete" method="post"> <ul> {% for file in uploaded_files %} <li> <input type="checkbox" name="files" value="{{ file }}" id="{{ file }}"> <label for="{{ file }}">{{ file }}</label> </li> {% endfor %} </ul> <input type="submit" value="Delete Selected"> </form> </div> </body> </html>
Enter fullscreen mode Exit fullscreen mode
Confirm.html
/* confirm.html */<!DOCTYPE html><html><head><title>Confirm Replace</title><style>body {font-family: sans-serif;background-color: #f4f4f4;color: #333;margin: 20px;display: flex;flex-direction: column;align-items: center; /* Center content horizontally */}h1 {color: #d9534f; /* Red heading for warning */margin-bottom: 20px;}p {margin-bottom: 20px;}form {background-color: #fff;padding: 20px;border-radius: 8px;box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1);width: 400px; /* Set a fixed width for the form */}input[type="file"] {margin-bottom: 10px;width: calc(100% - 10px); /* Ensures the file input doesn't overflow */}label {margin-right: 10px; /* Space between radio button and label */}input[type="radio"] {margin-right: 5px;}input[type="submit"] {background-color: #007bff;color: #fff;padding: 10px 15px;border: none;border-radius: 4px;cursor: pointer;margin-top: 10px; /* Space above the button */}input[type="submit"]:hover {background-color: #0056b3;}</style></head><body><h1>File Already Exists</h1><p>The file '{{ filename }}' already exists. Do you want to replace it?</p><form action="/confirm_replace" method="post" enctype="multipart/form-data"><input type="hidden" name="filename" value="{{ filename }}"><input type="file" name="file" required><br> <input type="radio" id="yes" name="replace" value="yes" required><label for="yes">Yes</label><input type="radio" id="no" name="replace" value="no"><label for="no">No</label><br><input type="submit" value="Confirm"></form></body></html>/* confirm.html */ <!DOCTYPE html> <html> <head> <title>Confirm Replace</title> <style> body { font-family: sans-serif; background-color: #f4f4f4; color: #333; margin: 20px; display: flex; flex-direction: column; align-items: center; /* Center content horizontally */ } h1 { color: #d9534f; /* Red heading for warning */ margin-bottom: 20px; } p { margin-bottom: 20px; } form { background-color: #fff; padding: 20px; border-radius: 8px; box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1); width: 400px; /* Set a fixed width for the form */ } input[type="file"] { margin-bottom: 10px; width: calc(100% - 10px); /* Ensures the file input doesn't overflow */ } label { margin-right: 10px; /* Space between radio button and label */ } input[type="radio"] { margin-right: 5px; } input[type="submit"] { background-color: #007bff; color: #fff; padding: 10px 15px; border: none; border-radius: 4px; cursor: pointer; margin-top: 10px; /* Space above the button */ } input[type="submit"]:hover { background-color: #0056b3; } </style> </head> <body> <h1>File Already Exists</h1> <p>The file '{{ filename }}' already exists. Do you want to replace it?</p> <form action="/confirm_replace" method="post" enctype="multipart/form-data"> <input type="hidden" name="filename" value="{{ filename }}"> <input type="file" name="file" required><br> <input type="radio" id="yes" name="replace" value="yes" required> <label for="yes">Yes</label> <input type="radio" id="no" name="replace" value="no"> <label for="no">No</label><br> <input type="submit" value="Confirm"> </form> </body> </html>/* confirm.html */ <!DOCTYPE html> <html> <head> <title>Confirm Replace</title> <style> body { font-family: sans-serif; background-color: #f4f4f4; color: #333; margin: 20px; display: flex; flex-direction: column; align-items: center; /* Center content horizontally */ } h1 { color: #d9534f; /* Red heading for warning */ margin-bottom: 20px; } p { margin-bottom: 20px; } form { background-color: #fff; padding: 20px; border-radius: 8px; box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1); width: 400px; /* Set a fixed width for the form */ } input[type="file"] { margin-bottom: 10px; width: calc(100% - 10px); /* Ensures the file input doesn't overflow */ } label { margin-right: 10px; /* Space between radio button and label */ } input[type="radio"] { margin-right: 5px; } input[type="submit"] { background-color: #007bff; color: #fff; padding: 10px 15px; border: none; border-radius: 4px; cursor: pointer; margin-top: 10px; /* Space above the button */ } input[type="submit"]:hover { background-color: #0056b3; } </style> </head> <body> <h1>File Already Exists</h1> <p>The file '{{ filename }}' already exists. Do you want to replace it?</p> <form action="/confirm_replace" method="post" enctype="multipart/form-data"> <input type="hidden" name="filename" value="{{ filename }}"> <input type="file" name="file" required><br> <input type="radio" id="yes" name="replace" value="yes" required> <label for="yes">Yes</label> <input type="radio" id="no" name="replace" value="no"> <label for="no">No</label><br> <input type="submit" value="Confirm"> </form> </body> </html>
Enter fullscreen mode Exit fullscreen mode
The Dockerfile which builds an image for the application.
# Use a Python base imageFROM python:3.11-slim-buster# Set the working directory inside the containerWORKDIR /app# Copy the requirements file (if you have one)# --- Create this file if you use external packagesCOPY requirements.txt .# Install dependenciesRUN pip install --no-cache-dir -r requirements.txt # Install from requirements.txt# Or install dependencies directly (if you don't have a requirements.txt file)# RUN pip install --no-cache-dir fastapi uvicorn Jinja2 python-multipart# Copy the application codeCOPY . .# Expose the port that Uvicorn will run onEXPOSE 8000# Start the Uvicorn serverCMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]# Use a Python base image FROM python:3.11-slim-buster # Set the working directory inside the container WORKDIR /app # Copy the requirements file (if you have one) # --- Create this file if you use external packages COPY requirements.txt . # Install dependencies RUN pip install --no-cache-dir -r requirements.txt # Install from requirements.txt # Or install dependencies directly (if you don't have a requirements.txt file) # RUN pip install --no-cache-dir fastapi uvicorn Jinja2 python-multipart # Copy the application code COPY . . # Expose the port that Uvicorn will run on EXPOSE 8000 # Start the Uvicorn server CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]# Use a Python base image FROM python:3.11-slim-buster # Set the working directory inside the container WORKDIR /app # Copy the requirements file (if you have one) # --- Create this file if you use external packages COPY requirements.txt . # Install dependencies RUN pip install --no-cache-dir -r requirements.txt # Install from requirements.txt # Or install dependencies directly (if you don't have a requirements.txt file) # RUN pip install --no-cache-dir fastapi uvicorn Jinja2 python-multipart # Copy the application code COPY . . # Expose the port that Uvicorn will run on EXPOSE 8000 # Start the Uvicorn server CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Enter fullscreen mode Exit fullscreen mode
And some sample YAML file for the deployment part (which does not represent the actual cluster).
apiVersion: apps/v1kind: Deploymentmetadata:name: my-fastapi-deploymentnamespace: files # Deploy to the "files" namespacespec:replicas: 3 # Number of pods (adjust as needed)selector:matchLabels:app: my-fastapi-apptemplate:metadata:labels:app: my-fastapi-appspec:containers:- name: my-fastapi-containerimage: my-fastapi-image:latest # Replace with your Docker image name and tagports:- containerPort: 8000volumeMounts:- name: uploads-volumemountPath: /app/uploads # Mount the volume to the uploads directoryresources: # Resource requests and limitsrequests:cpu: 100mmemory: 256Milimits:cpu: 500mmemory: 512Mivolumes:- name: uploads-volumepersistentVolumeClaim: # Use a PersistentVolumeClaim for persistent storageclaimName: my-fastapi-pvc # Create this PVC separately---apiVersion: v1kind: Servicemetadata:name: my-fastapi-servicenamespace: filesspec:selector:app: my-fastapi-appports:- protocol: TCPport: 80 # External porttargetPort: 8000 # Container porttype: LoadBalancer # Use a LoadBalancer to expose the service externally---apiVersion: v1kind: PersistentVolumeClaimmetadata:name: my-fastapi-pvcnamespace: filesspec:accessModes: [ "ReadWriteOnce" ] # Or ReadWriteMany if neededresources:requests:storage: 1Gi # Adjust storage size as neededapiVersion: apps/v1 kind: Deployment metadata: name: my-fastapi-deployment namespace: files # Deploy to the "files" namespace spec: replicas: 3 # Number of pods (adjust as needed) selector: matchLabels: app: my-fastapi-app template: metadata: labels: app: my-fastapi-app spec: containers: - name: my-fastapi-container image: my-fastapi-image:latest # Replace with your Docker image name and tag ports: - containerPort: 8000 volumeMounts: - name: uploads-volume mountPath: /app/uploads # Mount the volume to the uploads directory resources: # Resource requests and limits requests: cpu: 100m memory: 256Mi limits: cpu: 500m memory: 512Mi volumes: - name: uploads-volume persistentVolumeClaim: # Use a PersistentVolumeClaim for persistent storage claimName: my-fastapi-pvc # Create this PVC separately --- apiVersion: v1 kind: Service metadata: name: my-fastapi-service namespace: files spec: selector: app: my-fastapi-app ports: - protocol: TCP port: 80 # External port targetPort: 8000 # Container port type: LoadBalancer # Use a LoadBalancer to expose the service externally --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: my-fastapi-pvc namespace: files spec: accessModes: [ "ReadWriteOnce" ] # Or ReadWriteMany if needed resources: requests: storage: 1Gi # Adjust storage size as neededapiVersion: apps/v1 kind: Deployment metadata: name: my-fastapi-deployment namespace: files # Deploy to the "files" namespace spec: replicas: 3 # Number of pods (adjust as needed) selector: matchLabels: app: my-fastapi-app template: metadata: labels: app: my-fastapi-app spec: containers: - name: my-fastapi-container image: my-fastapi-image:latest # Replace with your Docker image name and tag ports: - containerPort: 8000 volumeMounts: - name: uploads-volume mountPath: /app/uploads # Mount the volume to the uploads directory resources: # Resource requests and limits requests: cpu: 100m memory: 256Mi limits: cpu: 500m memory: 512Mi volumes: - name: uploads-volume persistentVolumeClaim: # Use a PersistentVolumeClaim for persistent storage claimName: my-fastapi-pvc # Create this PVC separately --- apiVersion: v1 kind: Service metadata: name: my-fastapi-service namespace: files spec: selector: app: my-fastapi-app ports: - protocol: TCP port: 80 # External port targetPort: 8000 # Container port type: LoadBalancer # Use a LoadBalancer to expose the service externally --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: my-fastapi-pvc namespace: files spec: accessModes: [ "ReadWriteOnce" ] # Or ReadWriteMany if needed resources: requests: storage: 1Gi # Adjust storage size as needed
Enter fullscreen mode Exit fullscreen mode
File uploading using Streamlit
However it seemed that a framework like Streamlit comes more handy and easy to deploy as a containerized application using a cluster based deployment.
import osimport streamlit as stfrom pathlib import PathUPLOAD_DIR = Path("uploads") # Use Path for better path handlingUPLOAD_DIR.mkdir(exist_ok=True) # Create uploads directory if it doesn't existdef get_uploaded_files():return sorted([f.name for f in UPLOAD_DIR.iterdir()])st.title("File Upload and Management")uploaded_file = st.file_uploader("Choose a file", type=None) # Allow any file typeif uploaded_file is not None:filepath = UPLOAD_DIR / uploaded_file.nameif filepath.exists():replace = st.radio(f"File '{uploaded_file.name}' already exists. Replace?", ("Yes", "No"))if replace == "Yes":with open(filepath, "wb") as f:f.write(uploaded_file.getbuffer())st.success(f"File '{uploaded_file.name}' replaced successfully.")else:st.info(f"No action taken for '{uploaded_file.name}'. File already exists.")else:with open(filepath, "wb") as f:f.write(uploaded_file.getbuffer())st.success(f"File '{uploaded_file.name}' uploaded successfully.")st.subheader("Uploaded Files:")uploaded_files = get_uploaded_files()if uploaded_files:for file in uploaded_files:if st.checkbox(file): # Checkbox for each fileif st.button(f"Delete {file}"): # Delete button next to checkboxtry:(UPLOAD_DIR / file).unlink() # Delete the filest.experimental_rerun() # Refresh the app to reflect changesst.success(f"File '{file}' deleted successfully.")except Exception as e:st.error(f"Error deleting '{file}': {e}")else:st.info("No files uploaded yet.")import os import streamlit as st from pathlib import Path UPLOAD_DIR = Path("uploads") # Use Path for better path handling UPLOAD_DIR.mkdir(exist_ok=True) # Create uploads directory if it doesn't exist def get_uploaded_files(): return sorted([f.name for f in UPLOAD_DIR.iterdir()]) st.title("File Upload and Management") uploaded_file = st.file_uploader("Choose a file", type=None) # Allow any file type if uploaded_file is not None: filepath = UPLOAD_DIR / uploaded_file.name if filepath.exists(): replace = st.radio(f"File '{uploaded_file.name}' already exists. Replace?", ("Yes", "No")) if replace == "Yes": with open(filepath, "wb") as f: f.write(uploaded_file.getbuffer()) st.success(f"File '{uploaded_file.name}' replaced successfully.") else: st.info(f"No action taken for '{uploaded_file.name}'. File already exists.") else: with open(filepath, "wb") as f: f.write(uploaded_file.getbuffer()) st.success(f"File '{uploaded_file.name}' uploaded successfully.") st.subheader("Uploaded Files:") uploaded_files = get_uploaded_files() if uploaded_files: for file in uploaded_files: if st.checkbox(file): # Checkbox for each file if st.button(f"Delete {file}"): # Delete button next to checkbox try: (UPLOAD_DIR / file).unlink() # Delete the file st.experimental_rerun() # Refresh the app to reflect changes st.success(f"File '{file}' deleted successfully.") except Exception as e: st.error(f"Error deleting '{file}': {e}") else: st.info("No files uploaded yet.")import os import streamlit as st from pathlib import Path UPLOAD_DIR = Path("uploads") # Use Path for better path handling UPLOAD_DIR.mkdir(exist_ok=True) # Create uploads directory if it doesn't exist def get_uploaded_files(): return sorted([f.name for f in UPLOAD_DIR.iterdir()]) st.title("File Upload and Management") uploaded_file = st.file_uploader("Choose a file", type=None) # Allow any file type if uploaded_file is not None: filepath = UPLOAD_DIR / uploaded_file.name if filepath.exists(): replace = st.radio(f"File '{uploaded_file.name}' already exists. Replace?", ("Yes", "No")) if replace == "Yes": with open(filepath, "wb") as f: f.write(uploaded_file.getbuffer()) st.success(f"File '{uploaded_file.name}' replaced successfully.") else: st.info(f"No action taken for '{uploaded_file.name}'. File already exists.") else: with open(filepath, "wb") as f: f.write(uploaded_file.getbuffer()) st.success(f"File '{uploaded_file.name}' uploaded successfully.") st.subheader("Uploaded Files:") uploaded_files = get_uploaded_files() if uploaded_files: for file in uploaded_files: if st.checkbox(file): # Checkbox for each file if st.button(f"Delete {file}"): # Delete button next to checkbox try: (UPLOAD_DIR / file).unlink() # Delete the file st.experimental_rerun() # Refresh the app to reflect changes st.success(f"File '{file}' deleted successfully.") except Exception as e: st.error(f"Error deleting '{file}': {e}") else: st.info("No files uploaded yet.")
Enter fullscreen mode Exit fullscreen mode
Building a container for the code above!
# Use a Python base imageFROM python:3.11-slim-buster# Set the working directoryWORKDIR /app# Copy requirements.txt (recommended)COPY requirements.txt .# Install dependenciesRUN pip install --no-cache-dir -r requirements.txt# Copy application codeCOPY . .# Expose the Streamlit port (default is 8501)EXPOSE 8501# Run StreamlitCMD ["streamlit", "run", "main_st.py"] # Replace app.py with your Streamlit file name# Use a Python base image FROM python:3.11-slim-buster # Set the working directory WORKDIR /app # Copy requirements.txt (recommended) COPY requirements.txt . # Install dependencies RUN pip install --no-cache-dir -r requirements.txt # Copy application code COPY . . # Expose the Streamlit port (default is 8501) EXPOSE 8501 # Run Streamlit CMD ["streamlit", "run", "main_st.py"] # Replace app.py with your Streamlit file name# Use a Python base image FROM python:3.11-slim-buster # Set the working directory WORKDIR /app # Copy requirements.txt (recommended) COPY requirements.txt . # Install dependencies RUN pip install --no-cache-dir -r requirements.txt # Copy application code COPY . . # Expose the Streamlit port (default is 8501) EXPOSE 8501 # Run Streamlit CMD ["streamlit", "run", "main_st.py"] # Replace app.py with your Streamlit file name
Enter fullscreen mode Exit fullscreen mode
Sample Docling application using Streamlit framwork
Hereafter a starter code which is used as a helper for a Docling web based application.
import jsonimport loggingimport timefrom pathlib import Pathimport osimport shutil # For copying directoriesimport streamlit as stfrom docling.backend.pypdfium2_backend import PyPdfiumDocumentBackendfrom docling.datamodel.base_models import InputFormatfrom docling.datamodel.pipeline_options import (AcceleratorDevice,AcceleratorOptions,PdfPipelineOptions,)from docling.document_converter import DocumentConverter, PdfFormatOption_log = logging.getLogger(__name__)# Define the mount pathsKUBERNETES_VOLUME_MOUNT_PATH = "/app/uploads"SCRATCH_VOLUME_MOUNT_PATH = "/app/scratch"def process_pdf(input_doc_path, scratch_dir, pipeline_options):"""Processes a single PDF file."""doc_converter = DocumentConverter(format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)})try:conv_result = doc_converter.convert(input_doc_path)doc_filename = conv_result.input.file.stemwith (scratch_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp:json.dump(conv_result.document.export_to_dict(), fp)with (scratch_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp:fp.write(conv_result.document.export_to_text())with (scratch_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:fp.write(conv_result.document.export_to_markdown())with (scratch_dir / f"{doc_filename}.doctags").open("w", encoding="utf-8") as fp:fp.write(conv_result.document.export_to_document_tokens())return True # Indicate successexcept Exception as e:st.error(f"Error processing {input_doc_path}: {e}")return False # Indicate failuredef main():logging.basicConfig(level=logging.INFO)st.title("Docling Document Conversion")# Kubernetes volume directorykubernetes_volume_dir = Path(KUBERNETES_VOLUME_MOUNT_PATH)if not kubernetes_volume_dir.exists():st.error(f"Kubernetes volume not found at {KUBERNETES_VOLUME_MOUNT_PATH}")return# Scratch directoryscratch_dir = Path(SCRATCH_VOLUME_MOUNT_PATH)scratch_dir.mkdir(parents=True, exist_ok=True)# ... (pipeline options, OCR language, number of threads - same as before)# ... (Make sure pipeline_options is defined here)if st.button("Convert Documents in Volume"):with st.spinner("Converting documents..."):start_time = time.time()success_count = 0fail_count = 0for file_path in kubernetes_volume_dir.rglob("*.pdf"): # Recursive search for PDFsif process_pdf(file_path, scratch_dir, pipeline_options):success_count += 1else:fail_count += 1end_time = time.time() - start_timest.write(f"Conversion completed in {end_time:.2f} seconds.")st.write(f"Successfully converted {success_count} PDFs.")st.write(f"Failed to convert {fail_count} PDFs.")st.write(f"Files saved to {SCRATCH_VOLUME_MOUNT_PATH}")if __name__ == "__main__":main()import json import logging import time from pathlib import Path import os import shutil # For copying directories import streamlit as st from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( AcceleratorDevice, AcceleratorOptions, PdfPipelineOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption _log = logging.getLogger(__name__) # Define the mount paths KUBERNETES_VOLUME_MOUNT_PATH = "/app/uploads" SCRATCH_VOLUME_MOUNT_PATH = "/app/scratch" def process_pdf(input_doc_path, scratch_dir, pipeline_options): """Processes a single PDF file.""" doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) try: conv_result = doc_converter.convert(input_doc_path) doc_filename = conv_result.input.file.stem with (scratch_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp: json.dump(conv_result.document.export_to_dict(), fp) with (scratch_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp: fp.write(conv_result.document.export_to_text()) with (scratch_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp: fp.write(conv_result.document.export_to_markdown()) with (scratch_dir / f"{doc_filename}.doctags").open("w", encoding="utf-8") as fp: fp.write(conv_result.document.export_to_document_tokens()) return True # Indicate success except Exception as e: st.error(f"Error processing {input_doc_path}: {e}") return False # Indicate failure def main(): logging.basicConfig(level=logging.INFO) st.title("Docling Document Conversion") # Kubernetes volume directory kubernetes_volume_dir = Path(KUBERNETES_VOLUME_MOUNT_PATH) if not kubernetes_volume_dir.exists(): st.error(f"Kubernetes volume not found at {KUBERNETES_VOLUME_MOUNT_PATH}") return # Scratch directory scratch_dir = Path(SCRATCH_VOLUME_MOUNT_PATH) scratch_dir.mkdir(parents=True, exist_ok=True) # ... (pipeline options, OCR language, number of threads - same as before) # ... (Make sure pipeline_options is defined here) if st.button("Convert Documents in Volume"): with st.spinner("Converting documents..."): start_time = time.time() success_count = 0 fail_count = 0 for file_path in kubernetes_volume_dir.rglob("*.pdf"): # Recursive search for PDFs if process_pdf(file_path, scratch_dir, pipeline_options): success_count += 1 else: fail_count += 1 end_time = time.time() - start_time st.write(f"Conversion completed in {end_time:.2f} seconds.") st.write(f"Successfully converted {success_count} PDFs.") st.write(f"Failed to convert {fail_count} PDFs.") st.write(f"Files saved to {SCRATCH_VOLUME_MOUNT_PATH}") if __name__ == "__main__": main()import json import logging import time from pathlib import Path import os import shutil # For copying directories import streamlit as st from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( AcceleratorDevice, AcceleratorOptions, PdfPipelineOptions, ) from docling.document_converter import DocumentConverter, PdfFormatOption _log = logging.getLogger(__name__) # Define the mount paths KUBERNETES_VOLUME_MOUNT_PATH = "/app/uploads" SCRATCH_VOLUME_MOUNT_PATH = "/app/scratch" def process_pdf(input_doc_path, scratch_dir, pipeline_options): """Processes a single PDF file.""" doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } ) try: conv_result = doc_converter.convert(input_doc_path) doc_filename = conv_result.input.file.stem with (scratch_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp: json.dump(conv_result.document.export_to_dict(), fp) with (scratch_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp: fp.write(conv_result.document.export_to_text()) with (scratch_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp: fp.write(conv_result.document.export_to_markdown()) with (scratch_dir / f"{doc_filename}.doctags").open("w", encoding="utf-8") as fp: fp.write(conv_result.document.export_to_document_tokens()) return True # Indicate success except Exception as e: st.error(f"Error processing {input_doc_path}: {e}") return False # Indicate failure def main(): logging.basicConfig(level=logging.INFO) st.title("Docling Document Conversion") # Kubernetes volume directory kubernetes_volume_dir = Path(KUBERNETES_VOLUME_MOUNT_PATH) if not kubernetes_volume_dir.exists(): st.error(f"Kubernetes volume not found at {KUBERNETES_VOLUME_MOUNT_PATH}") return # Scratch directory scratch_dir = Path(SCRATCH_VOLUME_MOUNT_PATH) scratch_dir.mkdir(parents=True, exist_ok=True) # ... (pipeline options, OCR language, number of threads - same as before) # ... (Make sure pipeline_options is defined here) if st.button("Convert Documents in Volume"): with st.spinner("Converting documents..."): start_time = time.time() success_count = 0 fail_count = 0 for file_path in kubernetes_volume_dir.rglob("*.pdf"): # Recursive search for PDFs if process_pdf(file_path, scratch_dir, pipeline_options): success_count += 1 else: fail_count += 1 end_time = time.time() - start_time st.write(f"Conversion completed in {end_time:.2f} seconds.") st.write(f"Successfully converted {success_count} PDFs.") st.write(f"Failed to convert {fail_count} PDFs.") st.write(f"Files saved to {SCRATCH_VOLUME_MOUNT_PATH}") if __name__ == "__main__": main()
Enter fullscreen mode Exit fullscreen mode
And a Dockerfile to build an image.
FROM python:3.11-slim-busterWORKDIR /app# Create a requirements.txt with docling and its dependenciesCOPY requirements.txt .RUN pip install -r requirements.txtCOPY . .CMD ["streamlit", "run", "Docling_st.py"]FROM python:3.11-slim-buster WORKDIR /app # Create a requirements.txt with docling and its dependencies COPY requirements.txt . RUN pip install -r requirements.txt COPY . . CMD ["streamlit", "run", "Docling_st.py"]FROM python:3.11-slim-buster WORKDIR /app # Create a requirements.txt with docling and its dependencies COPY requirements.txt . RUN pip install -r requirements.txt COPY . . CMD ["streamlit", "run", "Docling_st.py"]
Enter fullscreen mode Exit fullscreen mode
A YAML helper if the Docling application to be deployed inside a cluster later (for the time being it is a severless test application).
apiVersion: apps/v1kind: Deploymentmetadata:name: docling-deploymentnamespace: files # Deploy to the same "files" namespacespec:replicas: 1 # Adjust as neededselector:matchLabels:app: docling-apptemplate:metadata:labels:app: docling-appspec:containers:- name: docling-containerimage: docling-image:latest # Replace with your Docling Docker imageports:- containerPort: 8501 # Streamlit default portvolumeMounts:- name: scratch-volumemountPath: /app/scratch # Mount the scratch volume- name: uploads-volume # Mount the existing uploads volumemountPath: /app/uploads # Or another suitable pathresources:requests:cpu: 200m # Adjust as neededmemory: 512Milimits:cpu: 1000mmemory: 1Givolumes:- name: scratch-volumepersistentVolumeClaim:claimName: docling-pvc # Create this PVC separately- name: uploads-volume # Use the existing uploads volumepersistentVolumeClaim:claimName: my-fastapi-pvc # The existing PVC---apiVersion: v1kind: Servicemetadata:name: docling-servicenamespace: filesspec:selector:app: docling-appports:- protocol: TCPport: 8501 # External porttargetPort: 8501 # Container porttype: LoadBalancer # Or ClusterIP if internal access is sufficient---apiVersion: v1kind: PersistentVolumeClaimmetadata:name: docling-pvcnamespace: filesspec:accessModes: [ "ReadWriteOnce" ] # Or ReadWriteMany if neededresources:requests:storage: 1Gi # Adjust storage size as neededapiVersion: apps/v1 kind: Deployment metadata: name: docling-deployment namespace: files # Deploy to the same "files" namespace spec: replicas: 1 # Adjust as needed selector: matchLabels: app: docling-app template: metadata: labels: app: docling-app spec: containers: - name: docling-container image: docling-image:latest # Replace with your Docling Docker image ports: - containerPort: 8501 # Streamlit default port volumeMounts: - name: scratch-volume mountPath: /app/scratch # Mount the scratch volume - name: uploads-volume # Mount the existing uploads volume mountPath: /app/uploads # Or another suitable path resources: requests: cpu: 200m # Adjust as needed memory: 512Mi limits: cpu: 1000m memory: 1Gi volumes: - name: scratch-volume persistentVolumeClaim: claimName: docling-pvc # Create this PVC separately - name: uploads-volume # Use the existing uploads volume persistentVolumeClaim: claimName: my-fastapi-pvc # The existing PVC --- apiVersion: v1 kind: Service metadata: name: docling-service namespace: files spec: selector: app: docling-app ports: - protocol: TCP port: 8501 # External port targetPort: 8501 # Container port type: LoadBalancer # Or ClusterIP if internal access is sufficient --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: docling-pvc namespace: files spec: accessModes: [ "ReadWriteOnce" ] # Or ReadWriteMany if needed resources: requests: storage: 1Gi # Adjust storage size as neededapiVersion: apps/v1 kind: Deployment metadata: name: docling-deployment namespace: files # Deploy to the same "files" namespace spec: replicas: 1 # Adjust as needed selector: matchLabels: app: docling-app template: metadata: labels: app: docling-app spec: containers: - name: docling-container image: docling-image:latest # Replace with your Docling Docker image ports: - containerPort: 8501 # Streamlit default port volumeMounts: - name: scratch-volume mountPath: /app/scratch # Mount the scratch volume - name: uploads-volume # Mount the existing uploads volume mountPath: /app/uploads # Or another suitable path resources: requests: cpu: 200m # Adjust as needed memory: 512Mi limits: cpu: 1000m memory: 1Gi volumes: - name: scratch-volume persistentVolumeClaim: claimName: docling-pvc # Create this PVC separately - name: uploads-volume # Use the existing uploads volume persistentVolumeClaim: claimName: my-fastapi-pvc # The existing PVC --- apiVersion: v1 kind: Service metadata: name: docling-service namespace: files spec: selector: app: docling-app ports: - protocol: TCP port: 8501 # External port targetPort: 8501 # Container port type: LoadBalancer # Or ClusterIP if internal access is sufficient --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: docling-pvc namespace: files spec: accessModes: [ "ReadWriteOnce" ] # Or ReadWriteMany if needed resources: requests: storage: 1Gi # Adjust storage size as needed
Enter fullscreen mode Exit fullscreen mode
Conclusion
The sample codes provided here are the building blocks for a web based application which prepares a file repository/volume with document types such as images, PDFs and word. These documents are ingested and changed to MD files by Docling which makes them ready for a generative ai application.
Again, this is not an end-to-end project, but portions of code to be enhanced, industrialized and deployed.
Thanks for reading 🤟
Useful links
- Docling: https://github.com/DS4SD/docling
- Docling documentation: https://ds4sd.github.io/docling/
原文链接:Yet another document ingestion project with Docling and IBM Cloud Code Engine (serverless)
暂无评论内容