Art Vision — An AI Art Docent by Rystan Qualls

§ 01 — Overview

The wall label is
a bottleneck.

Art Vision is not a replacement for the museum — it is a second voice in your ear, whispering the gossip and mistakes and rivalries that the plaque never had room to print.

Most museum plaques are written for no one. They hedge between scholar and schoolchild and end up satisfying neither. Visitors want the story — the drama, the mistakes, the rivalries, the hidden symbols. The why, not just the what.

Art Vision is a wearable device, powered by a Raspberry Pi 5 and a custom-trained YOLOv8 model, that identifies paintings in real time and serves a richer, choose-your-own narrative to the visitor's phone: Style. Technique. The Artist's Life. What's Easy to Miss. The whole system runs on-device — no cloud, no tracking, no big-tech in the gallery.

§ 03 — Process

Four phases,
two pivots, one working device.

I

Hardware · Sensor

The Sensor Crisis.

I started with the Raspberry Pi Camera Module 3 and its delicate CSI ribbon cable. It worked beautifully on the desk and terribly on the body. The ribbon was short, fragile, and comically ill-suited to weaving through a T-shirt. Within a week I had pivoted to an Arducam IMX708 USB module.

The Module 3 was the right camera — for a product that lived on a tripod. Art Vision needed something that could survive a backpack.

The USB path meant slightly bulkier hardware, but standard v4l2 drivers, a replaceable USB-C cable, and a 3D-printed enclosure that could finally be designed around a normal port instead of a precious slot.

II

Hardware · Power

The Portability Wall.

The Pi 5 is a hungry chip. Running YOLO inference on top of a USB camera stream made it hungrier still. Standard power banks caused voltage sag; the CPU would throttle mid-detection and the whole thing would stutter. I tried three power supplies before landing on a PD 3.0 bank capable of a constant 5V / 5A.

This is the kind of problem no tutorial warns you about. It is also the kind of problem that is invisible when it's solved and catastrophic when it isn't.

III

ML · Dataset

A handmade dataset.

No off-the-shelf model knows what a Basquiat looks like in my living room. So I built the training set by hand: 3242 images across ten canonical paintings — Starry Night, The Gulf Stream, Mona Lisa, two Basquiats, and more — shot in every lighting condition I could manufacture. When the Roboflow web uploader choked on my batch, I wrote a Python script against their API to finish the job overnight.

Good datasets are not downloaded. They are cooked.

The YOLOv8 Nano model trained for 25 epochs on Colab's free T4. Validation mAP50-95 landed at 0.966. The training curves look like a textbook — which, for once, is a compliment.

IV

Software · Serving

A headless museum server.

The Pi runs a Flask server on port 8080, exposing a single /api/data endpoint. A background thread captures frames, runs inference at conf=0.5, and writes the most recent detection into a shared Python object. The visitor's phone — connected to the Pi's Wi-Fi — polls the endpoint every two seconds and re-renders the active painting.

The art content itself lives in a local art_data.json file. Each painting has seven fields: description, artist life, impact, inspirations, who they inspired, why it's famous, and what's easy to miss. This is where the project's voice lives.

§ 04 — Architecture

Camera → Model
→ JSON → Phone.

§ 05 — The Core Loop

The detection thread,
in thirty lines.

# run_ai.py — the detection thread that feeds the Flask server
def run_ai():
    global current_detection
    model = YOLO('best2.pt')
    cap = cv2.VideoCapture(0)

    while True:
        success, frame = cap.read()
        if not success: break

        # inference — small image, half precision-ish speed
        results = model.predict(frame, conf=0.5, imgsz=320, verbose=False)

        if len(results[0].boxes) > 0:
            box   = results[0].boxes[0]
            label = model.names[int(box.cls[0])]
            conf  = float(box.conf[0])
            info  = ART_DATABASE.get(label, ART_DATABASE["default"])

            # hand off to the Flask thread via shared dict
            current_detection = {
                "title":       info.get("title", label),
                "artist":      info.get("artist", "Unknown"),
                "description": info.get("description", "..."),
                "why_famous":  info.get("why_famous", "..."),
                "easy_to_miss": info.get("easy_to_miss", "..."),
                "confidence":  f"{conf:.2%}",
            }
        else:
            current_detection = ART_DATABASE["default"]
            current_detection["confidence"] = "0%"

§ 06 — The Collection

Ten paintings.
3242 captures.

Starry Night

van Gogh · 1889

STN · 520.995

Mona Lisa

da Vinci · c. 1503

MLS · 400.995

The Gulf Stream

Homer · 1899

GST · 580.891

Skull Sketch

Basquiat · 1981

BSK · 450.873

Bebop

Basquiat · 1983

BBP · 480.862

Bridge + Lilies

Monet · 1899

BRL · 300.771

La Grande Jatte

Seurat · 1886

LGJ · 360.877

Pool Parlor

Lawrence · 1942

PPR · 440.833

Narcissus

Caravaggio · 1599

NAR · 420.813

Rose Lover

Various · —

RLV · 390.774

§ 07 — User Testing

The humans weigh in.

"I like how the detected painting is in the background of the app. It felt like the artwork was still the main character."

HX Round 1 · Roommate

"I like how it gave me different options to specialize on, instead of one giant paragraph."

HX Round 1 · Partner

"It's very simple."

HX Round 1 · Mother

"Museums are more of a social thing than I realized. Being able to share and explain and listen is one of the best parts."

HX Round 2 · Synthesis

Three forms.
One clear winner.

Participants were shown three form factors: an eyewear-mounted version, a pendant worn at the chest, and a pocket device with a clip-on camera. Each was tested against a single question: does this form factor match the ideals of the project?

The pendant won on warmth — closest in shape to a museum badge. The clip-on camera won on practicality. The eyewear lost immediately on the "anti-big-tech" axis. The final prototype splits the difference: a small chest-worn enclosure with an external camera.

§ 08 — What I Learned

Three honest lessons.

Lesson · 01

Ship the pivot, not the plan.

I started with a CSI camera, a cloud API, and a dream. I ended with a USB sensor, a local Flask server, and something that actually worked. Every one of the good decisions on this project was a revision of an earlier, worse one. Planning mattered less than being willing to burn the plan.

Lesson · 02

Data is the real work.

Training a model is the glamorous bit. Curating 3242 images, fixing the Roboflow upload with an API script, re-shooting half the Basquiat set because a lampshade was in frame — that's where the project lived. Hand-built datasets are unsexy and irreplaceable.

Lesson · 03

Interfaces are opinions.

The moment I stopped asking "what information should I include?" and started asking "what does the visitor want to feel?" was the moment the UI got good. A museum plaque is not a contract. It is an opening line. Art Vision is my attempt to give that line back to the visitor.

Art
Vision

The wall label is
a bottleneck.

A pocket-sized docent that sees.

A quiet brag.