<rss version="2.0">
<channel>
<title>Brandon Rohrer</title>
<link>https://www.brandonrohrer.com</link>
<description>Brandon Rohrer's blog</description>

            
  <item>
    <title>
    Python packaging with uv
    </title>
    <link>
    https://brandonrohrer.org/python_packaging.html
    </link>
    <pubDate>
    Fri, 17 Apr 2026 06:36:00 EDT
    </pubDate>
    <guid>
    https://brandonrohrer.org/python_packaging.html
    </guid>
    <description><![CDATA[


<p>
For a long time, every time I needed to create a new project or
Python package I copied an existing repository. It worked pretty well, but
never perfectly. It always felt like I was wearing someone else’s shoes.
And then when I went to make changes, I realized quickly how little I
understood about how projects, builds, and distributions work.
</p>

<p>
Here are some questions that I’ve had and some answers I've found.
They focus on the uv toolset of environment management and packaging.
This post doesn’t say anything about setuptools, even though setuptools,
setup.py, and setup.cfg
are still present in a lot of well-built projects, especially pre-2025 ones.
</p>

<p>
I expect this list of questions to grow and evolve over time as I learn.
My primary audience is a clueless future me, but I hope it helps you too.
</p>

<h2><a id="How-do-I-name-projects?-Packages?-Modules?-Repositories?"></a><a href="#How-do-I-name-projects?-Packages?-Modules?-Repositories?">How do I name projects? Packages? Modules? Repositories?</a></h2>

<p>
As the saying goes, naming things is one of the hardest problems in
computer science. This is particularly true when it comes to packaging.
There are a lot of different entities to be named, and it’s unclear sometimes
what names belong to which. There is a repository name, a top level directory
name, a project name, and a package name. All of these can be different.
For extra confusion, they can get confused with module names and function
names as well.
</p>

<p>
The Python interpreter and build tools have no problem knowing whether
a particular name is supposed to reference the project or the package.
They determine this from context. For human brains, especially the ones
new to packaging,
this can be a lot to keep track of. To save yourself unnecessary hassle,
a good trick is to use the same name for all these things. A brief,
memorable, all-lowercase name is ideal. The one way in which these names
will differ is in how they handle multiword names. For project, repository,
and top level directory names, separate words with a hyphen, as in
<code>my-amazing-tool</code>. For Python
packages and modules separate multiple words with an underscore, for example
<code>my_amazing_tool</code>. This keeps
things consistent with the conventions of the various tools and communities.
</p>

<p>
But don’t worry too much if you feel the need to deviate from this.
Plenty of smart people have differing opinions and it’s a matter
of convention only.
The <a href="https://peps.python.org/pep-0008/#package-and-module-names">PEP 8 recommendation</a>
is to give packages single-word names, without underscores, but this can
be challenging to do in a readable way.  Everyone ignores this.
</p>

<p>
If you plan to distribute it publically on <a href="PyPI">https://pypi.org</a>,
check it first to make sure the package name isn't already taken.
</p>

<h2><a id="What-are-wheels-and-sdists?"></a><a href="#What-are-wheels-and-sdists?">What are wheels and sdists?</a></h2>

<p>
An sdist is a <em>source distribution</em> and a wheel is a <em>binary distribution</em>.
I have no idea why it's called a wheel. Source is the code itself&mdash;<code>.py</code>
files and their supporting cast. It comes in a single <code>.tar.gz</code> archive
which has to be unzipped with <code>tar -xvf</code> before it can be properly read.
<a href="https://packaging.python.org/en/latest/discussions/package-formats/#what-is-a-source-distribution">Detail on sdists here.</a>
</p>

<p>
The wheel is the compiled version of the source, containing
only the files needed to actually run the code. Because compilation
is platform specific, a wheel is tied to a particular operating system,
processor architecture, and Python version. A single project can have many
wheels if it's meant to be run on many platforms and architectures.
The big caveat here is that Python files don't get pre-built into binaries.
The local Python interpreter does that at runtime. So if it's a purely Python
package, then one wheel is usually sufficient for all platforms and OS's.
<a href="https://packaging.python.org/en/latest/discussions/package-formats/#what-is-a-wheel">Detail on wheels here.</a>
</p>

<h2><a id="What-are-build-tools-and-why-do-they-matter?"></a><a href="#What-are-build-tools-and-why-do-they-matter?">What are build tools and why do they matter?</a></h2>

<p>
Build tools do the work of taking the original files and the information
from <code>pyproject.toml</code> and using them as ingredients for building
the sdist and the wheel.
</p>

<p>
There are two parts to this, a build frontend and a build backend. For the
purposes of this post, the frontend is
<a href="https://docs.astral.sh/uv/concepts/projects/build/#using-uv-build">uv build</a>
it does some gathering and interpretation of the files and prepares them
for the next step. pip and build are other popular build frontends.
</p>

<p>
There are a few common build backend tools, including hatchling, setuptools,
and uv's own uv_build.
<a href="https://pydevtools.com/handbook/explanation/what-is-a-build-backend/#choosing-a-backend">Here's a short guide</a>
for choosing between them, but when in doubt hatchling is a good choice.
The backend needs to be called out in pyproject.toml
<a href="https://pydevtools.com/handbook/explanation/what-is-a-build-backend/#how-the-frontend-finds-the-backend">like this</a>
</p>

<p>
<pre>
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
</pre>
</p>

<p>
<a href="https://packaging.python.org/en/latest/tutorials/packaging-projects/#choosing-a-build-backend">Here are some examples</a>
for the other backends as well.
</p>

<h2>Why have a <code>src</code> directory?</h2>

<p>
There are two common patterns for organizing projects
<a href="https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/#src-layout-vs-flat-layout">flat and src layouts</a>.
</p>

<p>
A flat layout is intuitive. The package code sits at the top level of the
project and is more straightforward to access. But because of how imports work
it's easy to lose track of whether your other code, like tests, are
referencing the working copy of your code in the project or a previously
installed version of the package. It can result in maddening bugs.
</p>

<p>
A src layout alleviates this. Because the package sits one level lower,
it's not so easily reachable for direct import. Any import would need
an installed version of the package. Using an editable install makes
sure that your most recent changes to the code are what is run.
</p>

<p>
The src layout gives the benefit of protecting us from ourselves
a bit more. And it comes at the cost of a slightly more complex file
structure and an extra step to ensure an editable install.
</p>

<h2><a id="How-do-I-make-a-package-visible-across-the-project?"></a><a href="#How-do-I-make-a-package-visible-across-the-project?">How do I make a package visible across the project?</a></h2>

<p>
To make a package a visible from other locations in a project
that are outside the package directories, for instance in <code>tests/</code>,
the most reliable way is to do an editable install.
From within the top level directory of the project run
</p>

<p>
<code>uv pip install -e .</code>
</p>

<p>
Modules outside your package shouldn't need <code>__init__.py</code> files
in each directory. But now they will be able to <code>import mypackage</code>
and go to town.
</p>

<p>
Note that it's also totally valid to include <code>tests\</code> within the
package. In that case they will very much need their <code>__init__.py</code>
files. More detail in
<a href="https://docs.pytest.org/en/latest/explanation/goodpractices.html">pytest best practices</a>.
</p>

<h2><a id="How-do-I-add-other-file-types-to-the-package?-And-how-do-I-access-them-from-the-code?"></a><a href="#How-do-I-add-other-file-types-to-the-package?-And-how-do-I-access-them-from-the-code?">How do I add other file types to the package? And how do I access them from the code?</a></h2>

<p>
The easiest way is to include them under the package directory tree.
By default, hatchling includes in the sdist any non-Python files under
the <code>mypackage</code> directory that are not in
<code>.gitignore</code>. This behavior can be aribitrarily modified for both
types of build targets, sdists and wheels.
<a href="https://hatch.pypa.io/1.16/config/build/#file-selection">Examples here.</a>
They can be instructed to include files from outside the project as well.
</p>

<p>
From within the code, these files can be accessed by their absolute path
The <code>__file__</code> attribute gives the absolute path of a given module.
It can the be modified to point to the data instead. For example for this
structure
</p>

<p>
<pre>
myproject
├── pyproject.toml
└── src
    └── mypackage
        ├── __init__.py
        ├── mymodule.py
        └── data
            └── mydata.json
</pre>
</p>

<p>
within <code>mymodule</code>
</p>

<p>
<pre>import os<br>
mymodule_abspath = __file__
mydata_abspath = os.path.join(mymodule_abspath, 'data')
mydata_absfilename = os.path.join(mydata_abspath, 'mydata.json')
with open(mydata_absfilename, 'rt') as f:
    mydata = f.read()
</pre>
</p>

<h2>What goes into <code>pyproject.toml</code>?</h2>

<p>
While <code>pyproject.toml</code> files are powerful and flexible and can be quite long,
<a href="https://www.pyopensci.org/python-package-guide/package-structure-code/pyproject-toml-python-package-metadata.html">a minimal pyproject.toml</a>
contains just some basic project and build system information, like this.
</p>

<p>
<pre>
[build-system]
requires = ['hatchling']
build-backend = "hatchling.build"<br>
[project]
name = 'myproject'
version = '0.1.0'
</pre>
</p>

<p>
There are some other
<a href="https://www.pyopensci.org/python-package-guide/package-structure-code/pyproject-toml-python-package-metadata.html#optional-fields-to-include-in-the-project-table">commonly used fields</a>
for the <code>project</code> table, including description, license, authors, and
keywords. project-dependencies will list other packages that yours depends on.
The classifiers field gives a set of standard tags that can help humans
and software tools alike make better use of your package.
<a href="https://pypi.org/classifiers/">The full list of classifiers</a> is lengthy,
but some especially helpful ones are
</p>

<ul>
<li> Development Status</li>
<li> Intended Audience</li>
<li> Topic</li>
<li> Programming language</li>
</ul>

<h2><a id="Resources"></a><a href="#Resources">Resources</a></h2>

<p>
These are the references that I find most useful when answering these
questions.
</p>

<ul>
<li> <a href="https://packaging.python.org/en/latest/">python.org packaging</a></li>
<li> <a href="https://docs.astral.sh/uv/concepts/projects/">uv project configuration</a></li>
<li> <a href="https://pydevtools.com/handbook/explanation/what-is-a-build-frontend/">build frontends</a></li>
<li> <a href="https://pydevtools.com/handbook/explanation/what-is-a-build-backend/">build backends</a></li>
<li> <a href="https://hatch.pypa.io/1.16/config/build/">hatch configuration</a></li>
<li> <a href="https://www.pyopensci.org/python-package-guide/package-structure-code/pyproject-toml-python-package-metadata.html">pyproject.toml configuration</a></li>
</ul>
    ]]></description>
  </item>

  <item>
    <title>
    Artisanal Language Models: Define a task and write evals
    </title>
    <link>
    https://brandonrohrer.org/alms_task.html
    </link>
    <pubDate>
    Tue, 14 Apr 2026 08:36:00 EDT
    </pubDate>
    <guid>
    https://brandonrohrer.org/alms_task.html
    </guid>
    <description><![CDATA[
<p>
A defining feature of an ALM is that it is purpose built for a well-defined
task. So far the task has only been described in general terms: proofreading
English prose of a novel draft.
</p>

<p>
It’s time to get more specific about what this ALM will do, what
its inputs and outputs will be, so that I can actually start building it.
It’s also time to define some tests and performance measures.
I’ll need them when I have something running and I make a change;
I’ll need some way to measure whether it got better.
</p>

<h2><a id="Inputs"></a><a href="#Inputs">Inputs</a></h2>

<p>
After the ALM has been trained, I’ll want to feed it text to proofread.
To start with I’ll plan to do this through the simplest and clunkiest way
I can think of: passing it the path to a text file containing the text
to be proofread. In an actual professional product built for users,
this is probably not ideal, but it’s a nice generalizable front door
that slicker user interfaces can be built around in the future.
</p>

<h2><a id="Outputs"></a><a href="#Outputs">Outputs</a></h2>

<p>
After the proofreader has done its job and identified segments that might
need correction, those segments will be reported as positions in
the input text. Specifically, when the text file is read in as a string,
the index of starting character position and ending character will be used
to tag segments for inspection and correction.
Position of the suspected error will be reported as a pair of indices.
This collection of start/finish pairs will be the output of the proofreader.
</p>

<p>
There are a lot of other things that could be enhanced about this to provide
a good user experience, and I leave the door open to add those later.
For instance, the text file could be modified to include special characters
marking the suspect segments. Or a fancier graphical UI could simply
highlight the potential errors or underline them, as is common in
word processors. An even fancier extension could propose corrections
and offer the user a single keystroke way to select from a number
of suggestions. But all of these could be built on top of a pair of
start and end markers for each proofreading note.
</p>

<h2><a id="Evaluation"></a><a href="#Evaluation">Evaluation</a></h2>

<p>
I’ll need a way to evaluate the progress. If I try to enhance my
proofreading model, how will I know if it worked? The variety
of all possible text to be proofread is so large, it’s impossible to test
every variation exhaustively. The best I can hope to do is come up with is
a representative sample.
</p>

<p>
This falls somewhere in between the traditional software engineering practice
of testing, where the notion of right and wrong answers and what a function
must do is fairly clean cut, and bench marking, which is a consensus
driven measure for comparing solutions in a broadly recognized way.
It’s what has come to be known in LLM development as evaluations or,
more affectionately, <strong>evals</strong>.
Evals are a reasonable sample of the space in which a language model
is expected to work. They lack the recognition and respect has
a full-blown benchmark, and also lack the rigor and confidence of carefully
designed tests. But despite having the worst of both worlds,
evals operate in a space that we cannot ignore, and for which there is
no better solution that I know of.
</p>

<p>
In practice, evals are organic. They grow to cover new use cases
and new failure modes during the development process. But it’s helpful
think through a reasonable initial set. For proofreading there
are several classes of errors that are important to cover.
</p>

<ul>
<li> <strong>Spelling</strong>. Febuary. Febrewary. Februry. Fabuwary.</li>
<li> <strong>Word choice</strong>. Catching when it should be "imply" and when it should be "infer".
To/two/too. There/their/they're.</li>
<li> <strong>Punctuation</strong>. Appropriate sentence termination. Comma usage.
Quotation mark usage.</li>
<li> <strong>White space</strong>. Extra spaces. Spaces around punctuation. Weird indents.</li>
<li> <strong>Capitalization</strong>.</li>
<li> <strong>Grammar</strong>. Verb tense. Pronoun-antecedent agreement.
Preposition choice.</li>
</ul>

<p>
These aren’t exhaustive, but there’s no need for evals to be exhaustive
in order to be useful. I will almost certainly add more later as I
discover new categories that aren’t getting picked up well.
But they are a good place to start.
</p>

<h2><a id="Creating-evals-for-each-category"></a><a href="#Creating-evals-for-each-category">Creating evals for each category</a></h2>

<p>
In practice, to test how well a given language model performs in each of
these areas, I’ll need to create an evaluation data set. For each of the
areas above, I’ll pull five paragraphs arbitrarily from an evaluation text
(Frankenstein by Mary Wollstonecraft (Godwin) Shelley) and throw 10 errors into
the text of a given type. Having five paragraphs full of spelling errors
gives a total of 50 spelling errors to detect. Each paragraph will come
with its own answer key, the beginning and end of each word or phrase
containing the error. After the proofreading model processes the paragraph,
the errors it detects will be compared against the ground truth.
</p>

<ul>
<li> A ground truth error that is overlapped by at least one model detected-error
is considered detected (<strong>true positive</strong>). This is not quite the same thing as</li>
<li> A model-detected error that overlaps at least
one ground truth error. This is considered an accurate detection, but there
may be several of these per ground truth error. I can't use this as the
true positive count because it could result in inflated counts.</li>
<li> A model-detected
error that doesn’t overlap a ground truth error will be considered
a <strong>false positive</strong>.</li>
<li> A ground truth error that is not overlapped by at least
one model detected error will be considered a <strong>false negative</strong>.</li>
</ul>

<p>
<img alt="Examples of true positives, false positives, and false negatives.
" src="https://raw.githubusercontent.com/brohrer/blog_images/refs/heads/main/alms_task/errors_pos_neg.png">
</p>

<p>
<strong>Recall</strong> will be defined as the total number of model-detected ground truth
errors (true positives) over the total number of ground truth errors
(true positives plus false negatives).
</p>

<p>
<strong>Accuracy</strong> will be a total number of model-detected ground truth errors
(true positives) divided by the total number of true positives
plus false positives. When there are no true positives or false positives,
accuracy will be undefined.
</p>

<h2><a id="Creating-the-evaluation-dataset"></a><a href="#Creating-the-evaluation-dataset">Creating the evaluation dataset</a></h2>

<p>
Putting this into computer-readable form required creating a Python script
with some error-ridden example text and the locations of the errors.
I created
<a href="https://codeberg.org/brohrer/alms/src/commit/57294820c5035a0be140d54e7f07b6a470c31c7e/data/eval/spelling.py">the initial set of evals for spelling errors</a>,
but held off on creating evals for the other error types (punctiation,
capitalization, etc.) for now.
By the time you read this, there is a good chance it will already have evolved.
If that's the case, you can find
<a href="https://codeberg.org/brohrer/alms/src/branch/main/data/eval/spelling.py">the latest version here</a>.
The evaluation dataset is a list of dicts, each of which contain a paragraph
of text taken from a different chapter of Frankenstein which I modified to
contain ten spelling mistakes. It also contains a list of ten
dicts, each containing
</p>

<ul>
<li> the mis-spelled word</li>
<li> the index of its first and last character</li>
<li> the corrected version of the word</li>
</ul>

<p>
Here's a snippet of the result
</p>

<p>
<pre>
evaluation_dataset = [
    {
        'source': 'Frankenstein',
        'chapter': 'L1',
        'paragraph': """
I am already far north of London, and as I walk in the streets of
Petersburgh, I feel a cold northern breeze play upon my cheeks, which
braces my nerves and fills me with delight. Do you understand this
feeling? This breeze, which has travelled from the regions towards
which I am advancing, gives me a foretaste of those icey climes.
Inspirited by this wind of promise, my daydreams become more fervent
and vivid. I try in vain to be persuaded that the pole is the seat of
frost and desolation; it ever presents itself to my imagniation as the
region of beauty and delight. There, Margaret, the sun is for ever
visible, its broad disk just skirting the horizon and diffusing a
perpettual splendour. There—for with your leave, my sister, I will put
...
requisite; or by ascertaining the secret of the magnet, which, if at
all possible, can only be effected by an undertaking such as mine.
                """,
        'mistakes': [
            {
                'first_char': 322,
                'last_char': 325,
                'wrong_text': 'icey',
                'correct_text': 'icy',
            },
            {
                'first_char': 526,
                'last_char': 536,
                'wrong_text': 'imagniation',
                'correct_text': 'imagination',
            },
            {
                'first_char': 678,
                'last_char': 687,
                'wrong_text': 'perpettual',
                'correct_text': 'perpetual',
            },
            ...
        ]
    },
]
</pre>
</p>

<h2><a id="End-to-end-prototyping"></a><a href="#End-to-end-prototyping">End-to-end prototyping</a></h2>

<p>
It’s fair to ask why I don’t go ahead and complete the evaluation datasets
for the other types of errors. It seems logical to completely finish this
step before moving onto the next. We can imagine this as a breadth-first
solution to the problem, thoroughly working through one stage of development,
putting some polish on it before moving to the next. When this work is
spread across teams, this is called waterfall style development. One team
completes a whole stage of the project like design or backend support
before passing it onto the next.
</p>

<p>
The alternative to this is a depth-first development strategy.
Building a bare bones end-to-end solution and then adding breadth and
sophistication to it in subsequent passes. Starting with an end-to-end
prototype means leaving a lot of things incomplete in the first pass.
It means creating something that you would be embarrassed to show to
your friends. If you’re working across multiple teams, it means a whole lot
more communication up front.
</p>

<p>
In theory, both of these approaches are valid and will produce a good result
in similar timeframes. But in practice, they don’t. The waterfall approach
assumes that all of the work done at each stage gets to remain in its
final form. In fact, every additional stage teaches us things we didn’t
know about what needed to come before. This requires a lot of rework on
stages that we had previously thought were complete. In an end-to-end
prototyping approach this rework happens quickly. The whole project has
a lot less momentum and can pivot more gracefully. It is more agile.
</p>

<p>
This lesson can take a long time to learn, and in some cases, it is in
managers' interest to ignore it, depending on the incentives of
the organization. But since I am all of the engineering teams and all
of the levels of management for this project I get to decide:
We’re going to start with a lightweight end-to-end prototype.
</p>

<p>
So now that the spelling evals are done, onto the next stage&mdash;building a
model to detect misspelled words.
</p>

    ]]></description>
  </item>

  <item>
    <title>
    Blog Highlights
    </title>
    <link>
    https://brandonrohrer.org/blog.html#highlights
    </link>
    <pubDate>
    Sun, 12 Apr 2026 06:36:00 EDT
    </pubDate>
    <guid>
    https://brandonrohrer.org/blog.html#highlights
    </guid>
    <description><![CDATA[
        <p>
          I revamped my blog to include a highlights reel.
          It was a big unfriendly wall of links.
          Now it starts with a small unfriendly wall of links.
        </p>
        <p>
          Tutorials, projects, code, and thoughts
          collected into topic groups I've generously called Book Projects.
        </p>

        <h3>Highlights</h3>
        <p>
          <strong>New Releases</strong>
        </p>
        <ul>
          <li>
            <a href="ds_roles.html">
              Being a Staff+ Data Scientist in 2026
            </a>
          </li>
          <li>
            <a href="alms_tokenizer.html">
              Build a custom tokenizer
            </a>
          </li>
          <li>
            <a href="alms.html">
              Artisanal Language Models
            </a>
          </li>
        </ul>

        <p>
          <strong>Most Visited</strong>
        </p>
        <ul>
          <li>
            <a href="transformers.html">
              Transformers from scratch
            </a>
          </li>
          <li>
            <a href="ssh_at_home.html">
              Setting up an ssh server</a>
          </li>
          <li>
            <a href="convert_rgb_to_grayscale.html">
              How to convert RGB color images to grayscale
            </a>
          </li>
          <li>
            <a href="convolution_one_d.html">
              Convolution in one dimension
            </a>
          </li>
        </ul>

        <p>
          <strong>Most Loved</strong>
        </p>
        <ul>
          <li>
            <a href="professional_path.html">
              Choose your professional path
            </a>
          </li>
          <li>
            <a href="org_response.html">
              What to do when a leader does something wrong
            </a>
          </li>
          <li>
            <a href="microsuffering.html">
              On microsuffering
            </a>
          </li>
        </ul>

        <p>
          <strong>I'm most proud of</strong>
        </p>
        <ul>
          <li>
            <a href="pendulum.html">
               Solving an easy reinforcement learning problem on hard mode:
               Inverting a pendulum
            </a>
          </li>
          <li>
            <a href="cartographer">
              Naive Cartographer: A Markov Decision Process Learner
            </a>
          </li>
          <li>
            <a href="ziptie">
              Ziptie: Learning Useful Features
            </a>
          </li>
        </ul>
    ]]></description>
  </item>



  <item>
    <title>
    Being a Staff+ Data Scientiest in 2026
    </title>
    <link>
    https://brandonrohrer.org/ds_roles.html
    </link>
    <pubDate>
    Thu, 09 Apr 2026 08:36:00 EDT
    </pubDate>
    <guid>
    https://brandonrohrer.org/ds_roles.html
    </guid>
    <description><![CDATA[

<p>
I became a data scientist in 2013 when the title was young.
It was so new that most companies had no idea what a data scientist should
be doing, only that they desperately needed one or they would be left behind.
Sound familiar?
</p>

<p>
I've tried to survey the job description of data science a couple of times with
varying degrees of success, most recently
to go with
<a href="https://github.com/brohrer/academic_advisory">some informal recommendations</a>
for creating data science degree programs.
Together with a group of colleages we tried to summarize
<a href="https://github.com/brohrer/academic_advisory/blob/main/what_DS_do.md">what data scientists do</a>
and <a href="https://brandonrohrer.com/data_science_archetypes.html">the data science subtypes of maker, oracle, detective, generalist</a>.
But in the face of changing expectations this doesn't feel like enough
anymore. It's time for a refresh.
</p>

<h2><a id="A-brief-and-biased-history-of-the-Data-Scientist-role"></a><a href="#A-brief-and-biased-history-of-the-Data-Scientist-role">A brief and biased history of the Data Scientist role</a></h2>

<h3><a id="In-the-beginning..."></a><a href="#In-the-beginning...">In the beginning...</a></h3>

<p>
The field of data science was named in 1997,
and the discipline has existed by other names
for a very long time. After all,
people have been answering questions using data for thousands of years.
</p>

<p>
When data science first got huge, organizations expected
data scientists to spin straw into gold&mdash;to transform
unorganized data archives into profit. Big Data, it was believed,
held inherent value, which only needed to be coaxed into cash form.
This rarely panned out, so the
approach evolved into a) data scientists produce "insights" and then
b) "insights" generate profit. This also proved elusive in the end.
Eventually data scientists settled into various niches involving
<em>answering questions using data</em>, and some companies decided they didn't
need as many data scientists as they had originally thought.
</p>

<h3><a id="The-Neural-Network-era"></a><a href="#The-Neural-Network-era">The Neural Network era</a></h3>

<p>
Then came the Neural Network revolution, where the machine learning
hammer of choice became the convolutional neural network and many
problems got recast as an image recognition problem. The software
engineering, data engineering, and operations engineering that production
CNNs required merited the job title of Machine Learning Engineer.
Image recognition became synonymous with "modern machine learning".
A data scientist's job description got blurry. Did it include ML
Engineering as a subset? Should a good data scientist candidate have
CNN experience? Every team took their own stance. No consensus emerged.
</p>

<h3><a id="The-age-of-Large-Language-Models"></a><a href="#The-age-of-Large-Language-Models">The age of Large Language Models</a></h3>

<p>
Then the spotlight abruptly shifted to Transformers and Large Language Models.
With their massive scale, the engineering and operations requirements
increased yet again. Only a small handful of enterprises are even
capable of training such a model, and they accomplish this only with
an army of engineers and an unfathomable amount of specialized computing
power. LLMs are effectively black boxes. Their structure is known but their
vast collection of parameters makes their behavior unpredictable and
inexplicable. Using them is a matter of API calls, rather than
training and evaluation. Most importantly, they are generative, rather than inductive. They
don't actually answer questions, they create answer-looking things that
are correct often enough to lull us into complacency.
</p>

<p>
LLMs are far enough removed from a core data scientist's skillset that
their care and feeding isn't part of the data science job description.
But they can still have a big effect on a data scientist's day.
(More on that below.)
</p>

<h3><a id="The-next-Big-Thing"></a><a href="#The-next-Big-Thing">The next Big Thing</a></h3>

<p>
The next hot new trend has not yet emerged, but if the historical 5 year cycle
holds true, it's due any day now.
</p>

<h2></a>What is "data science"?</h2>

<p>
I define data science as "answering questions using data".
Also note that I am referring to lowercase-d data science here.
This can include job titles of Data Scientist, Data Analyst,
Quantitative Analyst, Data Engineer, and others.
I'm not including data-fueled features like product recommendations
or travel time estimates, which are typically the domain of
Machine Learning Engineers due to their scale and latency requirements.
These are typically dominated by a separate set of constraints, tools,
and skills (although there is plenty of overlap).
</p>

<p>
The questions data scientsts get to wrestle with are varied, and they
map closely to a company's org structure. Here are some greatest hits.
</p>

<h4><a id="Product"></a><a href="#Product">Product</a></h4>

<ul>
<li> Personalization: Which one should I show you?</li>
<li> Tip/donation recommendation: What suggestions should I give someone for how
much to give?</li>
<li> Product experience: Which pages do users visit? What buttons to they click?
What features do they use? What does this tell us about how we can improve
their experience?</li>
<li> Demand forecasting: How many people will buy my product next year?</li>
</ul>

<h4><a id="Operations"></a><a href="#Operations">Operations</a></h4>

<ul>
<li> Optimization: Which plan is the best? How can I minimize inventory?</li>
</ul>

<h4><a id="Marketing"></a><a href="#Marketing">Marketing</a></h4>

<ul>
<li> Price optimization: How much should I charge this customer for this thing or
service?</li>
<li> Marketing Mix Modeling: What is the return on investment for each additional
dollar spent in each of my marketing channels?</li>
</ul>

<h4><a id="Finance"></a><a href="#Finance">Finance</a></h4>

<ul>
<li> Forecasting: What will our revenue and expenses be next quarter?</li>
</ul>

<h4><a id="All-organizations"></a><a href="#All-organizations">All organizations</a></h4>

<ul>
<li> Decision support: Which choice should I make?</li>
<li> Experimentation: Which version is better?</li>
</ul>

<h2><a id="What-about-staff+-data-scientists?"></a><a href="#What-about-staff+-data-scientists?">What about staff+ data scientists?</a></h2>

<p>
A well-defined technical problem, including all those listed above, are
an excellent fit for the skills of a new data scientist or one with 3-5
years experience. These are quite challenging, but they are all
challenges you can train for and practice on.
They have fairly clear demands and success criteria.
</p>

<p>
Just like every other part of technology, the very hardest part is people.
In data science, this takes the form of understanding what people want,
setting their expectations, and coordinating misaligned or competing incentives.
Being familiar with sophisticated tools and approaches is definitely valuable,
but is often something that a data scientist with a couple years of experience
can do an excellent job on. As a staff+ you can still expect to tackle some
of the most challenging or time-sensitive technical tasks, but those
will probably not be the toughest part of your job.
It falls to staff+ data folks to navigate
the uncharted and shifting terrain of stakeholder management, cross-team
coordination, and communication.
(I'm using "staff+" to refer to staff level and higher, typically
folks who have been at it 5 to 7 years or more, although the actual time
in the role varies widely.)
</p>

<p>
Most of these are struggles that data scientists have faced since
the beginning. Often, one of the biggest contributions that a staff+
data scientist can make is to be a bridge between data science work
and the rest of the org&mdash;including engineers, marketing, finance, product, and
C-suite officers of all stripes. Staff+ data scientists are expected
to take the ambiguity out of the message, for instance translating what a
probability distribution means for reporting quarterly performance.
</p>

<p>
Here are the stickiest recurring topics I've been hearing about.
</p>

<h2><a id="When-stakeholders-prefer-a-cheap,-fast-wrong-answer-to-a-good-one"></a><a href="#When-stakeholders-prefer-a-cheap,-fast-wrong-answer-to-a-good-one">When stakeholders prefer a cheap, fast wrong answer to a good one</a></h2>

<p>
The biggest impact of AI (large language model) assistants on data science
is the idea that anyone can query data with natural language Q-and-A.
Sadly the reality of such systems is that they are trained on data
that doesn't share the same set of quirks that your
org is working with. It produces condifent, plausible answers 100% of the
time, but they are accurate only 70% of the time, and the problem is that
you can't know whether a particular answer is in the 30% until you
dive in and re-create the analysis yourself.
</p>

<p>
There is a school of thought that being confident and fast is better than
being cautious and correct.
It has bled over from strategic leadership (where
ambiguity is ubiquitous and one of the greatest risks is indecision)
to analysis and engineering (where incaution can lead to loss of limb,
life, and cash). And in most large organizations individual stakeholders
don't get to feel the effects of being wrong. Those usually take time to
materialize. So they can prioritize being fast and confident, which
their AI-assisted analytics queries are all too happy to help them out with.
</p>

<p>
In almost every case, the overriding concern is not accuracy or rigor,
but rather someone rationally pursuing what is best for them and their team.
One of the hardest things a staff+ data scientist will ever have to do
is build a mental model of these incentives and chart a course that
successfully splits the difference
<a href="https://en.wikipedia.org/wiki/Between_Scylla_and_Charybdis">between Scylla and Charybdis</a>.
</p>

<h2><a id="When-stakeholders-undervalue-the-skill-and-underestimate-the-time-required"></a><a href="#When-stakeholders-undervalue-the-skill-and-underestimate-the-time-required">When stakeholders undervalue the skill and underestimate the time required</a></h2>

<p>
A closely related trend is a common assumption that data analysis has
somehow gotten easier and faster, so much so that it is disposable.
Randy Au calls this
<a href="https://www.counting-stuff.com/data-work-in-the-fast-fashion-code-era/">"data work in the fast fashion code era"</a>.
It's no big deal to extract nuanced insights from your collected data,
just feed it in to NotebookLM and ask, right? You should be able to have
something by this afternoon right? Not the full analysis of course, but
"rough numbers". Right?
</p>

<p>
I can't even come up with a rough number of the times I've had the conversation
that "rough numbers" are very rough indeed. Not just off by a few percent,
but maybe in the completely wrong direction. And there's no way to know for
sure until you go back and do the careful numbers.
</p>

<p>
It doesn't help that questions that seem easy are actually hard to answer well.
What caused this blip on this graph? Why didn't this feature boost engagement
metrics? It seems natural that there should be simple answers to these
questions, and obvious that someone familiar with the data should be able to
pull them out quickly. So when a data scientist starts talking about the
philosophical foundations of causality, and questioning whether we can really
know whether anything causes anything, you can literally feel the eye-rolling
of the product leader even if their camera is off.
</p>

<h2><a id="I-want-it-now"></a><a href="#I-want-it-now">I want it now</a></h2>

<p>
In a separate but related issue, stakeholders are sometimes reluctant to invest
the time required to get high quality results. Bullish demands for unrealistic
timelines have somehow become confused with strong leadership.
Conveying the return on that time investment is a recurring challenge.
How do you communicate the importance of carefully gathered data?
the return on a carefully run experiment? the cost of accurately attributing
a metric shift to a product change?
How do you advocate for six-month-plus time investments in
a company that never looks more than three months ahead?
And heaven help you if you are trying to make a case for improving
the robustness of your pipelines or pre-emptively cleaning your data.
</p>

<p>
Learning things from data can be expensive. A carefully constructed
experiment takes time to plan and execute, particularly when data volumes
are low. There is a certain philosophy amongst some leaders that everything
should be experimented on, and than an experiment with a positive outcome
should be required before any change is made, no matter how small.
Communicating the opportunity cost of running an experiment is a regularly
occurring challenge. Leaders also may be seeking to proactively defend
themselves against challenges to a given decision by having receipts,
a successful experiment to point back to if their judgment is questioned.
</p>

<h2><a id="Real-time-dashboards"></a><a href="#Real-time-dashboards">Real-time dashboards</a></h2>

<p>
One particular example of "I want it now" is common enough to call out on its
own: The up-to-the-minute data dashboard. The sense of power it gives is
intoxicating, so it's no surprise that it is such a commonly requested
data product.
</p>

<p>
How do you communicate the cost of real time data availability?
On the surface, it seems like a reasonable request.  A leader looking at
a dashboard is like a pilot in the cockpit of a fighter jet.
They can see their instruments, look out the windows, and use all
the information at their fingertips to make quick decisions and save the day.
Of course they would want that information to be real time.
If it were delayed, then they might miss critical opportunities
and get shot down.
</p>

<p>
But an analytics dashboard is different than a fighter cockpit in more
than one way. The world it’s representing doesn’t meaningfully change
from one second to the next unless you’re doing high-speed trading.
In most cases, it barely changes from one day to the next. When that’s
the case, real time updates feel useful, but don’t deliver any
actionable information.
</p>

<p>
More importantly, the decisions that get made based on those dashboards
don’t get made minute-to-minute or even hour-to-hour. They are
typically decisions that factor into quarterly planning
or sprint planning&mdash;decisions that occur every few months or weeks.
Maybe every few days. Because of that, having the dash more update
frequently this does not drive better decisions.
</p>

<p>
The engineering effort required to go from nightly updates to few-second
latency are considerable. Nightly updates, or even hourly, can be done by
some DAG-based pipeline orchestror. It operates on tables and produces tables,
which are easy to read and write to in code. Real time updates involve
using stream technology, like Kafka or Flink. These are amazing when you
need them, but they have many more moving parts, more things that can break,
more things that you have to keep an eye on, more things that can cause
the pipeline to go down and the dashboard to get wonky and
require laborious backfills and carefully worded responses to frustrated
questions from leaders about why their nerve center dashboard suite
has gone down. To operate at the same reliability, it might require
3 to 5 times the effort and dollars.
</p>

<p>
This is a conversation that most staff+ data scientists end up having
at least once in their careers. And they usually lose.
</p>

<h2><a id="The-perrenial-promise-of-self-serve"></a><a href="#The-perrenial-promise-of-self-serve">The perrenial promise of self-serve</a></h2>

<p>
Every data science organization I've worked in has gone through this cycle:
</p>

<ol>
<li> Data scientists generate useful results</li>
<li> Stakeholders find them valuable</li>
<li> Stakeholders ask for more such results, with increasing frequency</li>
<li> Data scientists get tired of running similar queries over and over</li>
<li> A "self-serve analytics" function is proposed</li>
</ol>

<p>
I've never seen this approach solve the original problem of getting stakeholders
all the information they need without burdening the data scientists.
One of these things happens instead.
</p>

<ul>
<li> a basic self-serve system is fielded, which inevitably leads to follow-up
questions outside of its scope. Data scientists are answering more questions than ever.</li>
<li> a complex self-serve system is fielded, which requires stakeholders to
learn a querying language, like SQL or a simplified version of it.
They don't, and data scientists remain in the role of human user interface.</li>
<li> data scientists get deep into building an intuitive, highly capable
self-serve system. The project scope is large and occupies all of their
attention and is never quite finished.
They aren't available to field query questions of any sort.</li>
</ul>

<h2><a id="Communicating-uncertainty"></a><a href="#Communicating-uncertainty">Communicating uncertainty</a></h2>

<p>
Statistics, the native language of the data scientist, is all about
distributions. But decisions get made based on concrete values.
A business decision maker may have a rule of thumb in their head like
“If the cost is less than three dollars, buy it, otherwise pass”.
So they ask a data scientist, how much it will cost.
</p>

<p>
DS: About three and a half dollars.
</p>

<p>
BDM: What do you mean "about"? What will the actual price be?
</p>

<p>
DS: Between two and five dollars.
</p>

<p>
BDM: That’s a huge range. But you’re saying it definitely won’t be less
than two dollars? And definitely not more than five?
</p>

<p>
DS: Well, no, it might be less than two or more than five.
But it probably won’t.
</p>

<p>
BDM: <em> Reaches for magic eight ball </em>
</p>

<p>
Translating from the fuzzy smear of a probability distribution
to concrete values to support decision making is one of the hardest things
a data scientist has to do. The two representations are fundamentally
mismatched, and the mental models required to reason about them are
nearly irreconcilable.
</p>

<p>
If your audience is familiar with gambling, this gives some useful
footholds like over-under and  odds ratios. Even better if they are
familiar with rolling 20-sided dice. You can also try using percentages,
statistical significance, confidence intervals, and upper/lower bounds,
and see which you have the greatest success with.
</p>

<p>
The more senior you get as a data scientist, the more of these conversations
you end up having, talking with people who haven’t spent years
building mental models of distributions.
This conversation is a recurring one. Some version of it occurs with every
analysis, and every decision. Bridging this gap well is what lets
the hard work of a data science team carry maximum weight in the rest
of the company. Learning to do it well is well worth it.
</p>

<h2><a id="Building-within-team-consensus-on-how-things-are-done"></a><a href="#Building-within-team-consensus-on-how-things-are-done">Building within-team consensus on how things are done</a></h2>

<p>
Closely related is helping data science teams communicate about
uncertainty consistently within themselves. New career data scientists
are usually taught frequentist null hypothesis significance testing.
The <em>p</em> &lt; .05 threshold is drilled into them as an axiom. As a rule
of thumb it’s useful, but as an iron law, it is limiting.
Another very useful thing a staff+ data scientist can do is help
the rest of the team look past the <em>p</em>-value and consider the context
of the decision. Talking through the cost of false positives and
false negatives, the likely distribution of classes in practice,
the opportunity cost of running long experiments in order to reach
significance, alternative ways of evaluating distributions
for decision-making. Helping the whole team to level up and to speak
the same language naturally falls on the staff+ data scientist
as a technical leader.
</p>

<p>
This work extends to coordinating definitions, tools, practices,
and decision criteria more generally. There is tension between the
scientific freedom of allowing everyone to use the analyses, statistical
tools, modeling techniciques, presentation formats, and feature definitions
that they prefer, and the chaos that brings about. The larger the data
science team, the greater the chaos. Setting norms for these,
particularly across multiple organizations, is a big undertaking and
difficult to do in a way that doesn't feel heavy handed. Staff+
data scientists that are able to pull this off are rare but their impact
is huge.
</p>

<h2><a id="Collaboration-with-partner-teams"></a><a href="#Collaboration-with-partner-teams">Collaboration with partner teams</a></h2>

<p>
Data scientists work as a specialized piece of a larger machine.
They work with
data analysts to define data models and metric definitions. They work with
data engineers to establish availability, freshness, data types, and
standard transformations. They work with infrastructure engineers to make
sure the data they need is accessible at reasonable latencies.
They work with frontend engineers to make sure important events are
instrumented and logged. They work with backend engineers to make sure
important events are logged there as well and sometimes to pre-compute
expensive features.
</p>

<p>
Data scientists are part of an interconnected web and can only do what they do
because of the other work that goes on around them. Coordinating with
these teams, supporting them in turn, keeping those communication
lines open and trust high.
</p>

<h2><a id="Designing-a-data-science-organization:-Centralized-vs-Distributed"></a><a href="#Designing-a-data-science-organization:-Centralized-vs-Distributed">Designing a data science organization: Centralized vs Distributed</a></h2>

<p>
At a certain level of seniority in a small company, you may be asked to help
grow a data science team from scratch. Once the product team and the
marketing team and the finance team all see how useful a data scientist can be,
they'll each want their own team. A budding data science manager may prefer
to keep the team together under one umbrella, perhaps in the same org as
the analytics or the data engineering team. It is a recurring dilemma
of central vs distributed data science teams.
</p>

<p>
There is no right answer. Either can work <em>if</em> the organization is healthy
and incentives are properly aligned.  If not, than <em>neither</em> will work well.
But all else being equal, a central data science team that works closely
with other teams is the easiest to pull off. It's good for getting
data scientists the support they need and building capabilities
that are applicable across the company. It's natural to morph into a
matrixed arrangement from there, where a data scientist reports to their
DS manager, but attends a lot of team meetings with the specific team
they are embedded with.
</p>

<h2><a id="Delivering-Disappointing-Results"></a><a href="#Delivering-Disappointing-Results">Delivering Disappointing Results</a></h2>

<p>
Most leaders don’t like it when numbers disagree with their intuition
or expectations. As much as they profess to be data-driven, more often
than not they seek to be data-validated. If you get the answer they expect,
they will accept and maybe celebrate it. If you come up with something
surprising, disappointing, or uncomplimentary, they may push back,
question assumptions, ask you to revisit it, suggest changes to the approach,
or simply reject the result.
</p>

<p>
There's no one size fits all answer to this, but here are some things to
watch out for.
</p>

<ul>
<li> Don't assume that the stakeholder will be persuaded by a rigorous and
watertight analysis. They are balancing a lot of antagonistic concerns.
Correctness is just one of them.</li>
<li> Don't assume that the stakeholder wants to know that real answer.
Sometimes the incentives to believe a difference answer are just too
compelling.</li>
<li> Don't assume that the stakeholder doesn't want to know the real answer.
Pushback and follow up questions are a natural due diligence step.
When a result is surprising and will require a lot of work to acknowledge, some
questioning is appropriate.</li>
<li> Don't assume the stakeholder is questioning your capabilities, motives,
or integrity.</li>
</ul>

<p>
Processing difficult analyses is a conversation, not a homework assignment
where you get to stick it to the professor. It works best when you approach
the leader as another human being who has their own biases, fears, and
goals. Talk it through. Expect to walk them through the numbers, re-run them,
examing your assumptions, and run some follow up analyses. More times than I
care to admit, these pushback sessions have exposed my blindspots and
resulted in more nuanced conclusions. At the very least, give the stakeholders
space to mourn their lost hopes.
</p>

<p>
<hr>
</p>

<p>
This list isn't exhaustive of course. Every company and team has its own
personality. But it gives a general flavor of the the life of a staff+
data scientist in 2026. Have I sold you on it?
</p>

<h2><a id="The-good-bits-are-very-very-good"></a><a href="#The-good-bits-are-very-very-good">The good bits are very very good</a></h2>

<p>
With all of these sticky wickets, you might be wondering whether a
career in data science is even worth it. What I haven’t mentioned yet
is the upside, the marvelous feeling of being in the zone when your team
is working together with other teams in synchrony, bridging the chasm
between ones and zeros and the biggest decisions in the company.
At the staff+ level, this is very similar to the Zen of a software engineer
or a machine learning engineer who is in the zone, for all the same reasons.
There’s nothing quite like it. It’s a satisfaction you feel somewhere
deeply, somehwere just behind your sternum.
You'll have to answer this question for yourself, but for me, yes, it’s worth it.
</p>
    ]]></description>
  </item>



  <item>
    <title>
    Data Engineering for Beginners
    </title>
    <link>
    https://brandonrohrer.org/data_eng_for_beginners.html
    </link>
    <pubDate>
    Wed, 08 Apr 2026 08:36:00 EDT
    </pubDate>
    <guid>
    https://brandonrohrer.org/data_eng_for_beginners.html#The-building-blocks
    </guid>
    <description><![CDATA[

<p>
If you need to climb the Data Engineering learning curve quickly (perhaps in preparation for an interview) I made
<a href="https://brandonrohrer.org/data_eng_for_beginners.html">
a handy one-stop guide</a>.
</p>

<p>
It covers all the basic concepts in a beginner-friendly way. 
The cool part is that most of them aren't too complex, once you strip away
the fancy names. when you're done, you'll know the difference between
strong consistency and eventual consistency, between a flat schema
and a star schema, between OLAP and OLTP, and you'll be able to explain
what makes a data warehouse, a data catalog, a data lake, and a data mart.
</p>
    ]]></description>
  </item>


  <item>
    <title>
    Transformers from the inside out
    </title>
    <link>
    https://brandonrohrer.org/transformers
    </link>
    <pubDate>
    Tue, 17 Mar 2026 08:36:00 EDT
    </pubDate>
    <guid>
    https://brandonrohrer.org/transformers
    </guid>
    <description><![CDATA[

<p>
If your agentic coding environment is a Formula One car then the engine
is the Transformer. The coolest thing to me about transformers is that
they can be broken down into their tiniest pieces and understood,
the way an engine can be disassembled to its smallest parts.
</p>

<p>
I wrote
<a href="https://brandonrohrer.org/transformers">
a step by step walkthrough</a> of the re-assembly of a transformer
starting from its tiniest bits. No machine learning background required.
If you like looking under the hood, this is for you.
</p>

    ]]></description>
  </item>

  <item>
    <title>
    Training your own Language Model Tokenizer
    </title>
    <link>
    https://brandonrohrer.org/alms_tokenizer.html
    </link>
    <pubDate>
    Sun, 22 Feb 2026 06:42:00 EDT
    </pubDate>
    <guid>
    https://brandonrohrer.org/alms_tokenizer.html
    </guid>
    <description><![CDATA[

<p>
The first step in building a small-scale LLM, an
<a href="https://brandonrohrer.com/alms.html">Artisanal Language Model</a>, is to
create a vocabulary of tokens. This converts a long string of text into
a somewhat shorter list of integers. We don't think about this step
often because it happens behind the scenes of LLM creation, but
building a model from scratch gives us the rare opportunity to look
more closely at tokenization.
</p>

<p>
The tokenizer is the first heavy duty processing that input data gets
exposed to when it churns through an LLM. The tokenizer takes the text
or audio or image and breaks it down into a sequence of bite-sized pieces
called tokens.
</p>

<p>
In theory, an LLM could operate on any discrete piece
of information, even fine-grained ones like characters, pixels, and bytes,
but in practice these are too small to carry much information individually
and working with them directly puts a greater burden on the model
to try to combine them and make sense of them.
</p>

<p>
To give the models a leg up, the tiny pieces of input get pre-combined
into more useful bits called tokens. For instance, when processing a
string of English text like <code>How much wood could a woodchuck chuck?</code>, it
might be tokenized into <code>How much</code>, <code>wood</code>, <code>could</code>, <code>a wood</code>, <code>chuck</code>,
<code>chuck</code>, <code>?</code>. Tokens don't fall strictly on word boundaries and can contain
more than one word. Handling 7 tokens instead of 38 characters makes life
a lot easier for the stages that follow. Internally, each of these word
pieces gets assigned a number for efficient handling, so this sentence
ends up looking like <code>[6387, 73, 593, 5365, 837, 837, 24]</code>.
</p>

<p>
Tokenization is an example of automated feature engineering,
also known as feature learning or
feature discovery. It is understood among machine learning practitioners
that once you get a good set of features, you're 80% of the way to
solving your problem. It is an important step to get right, and has a huge
impact on the quality of results.
</p>

<p>
An advantage to an Artisanal Language Model with a curated data set is
that it gives an opportunity to create a model-specific tokenizer.
The representation of input data can be tailored to the needs of your
specific problem, instead of having to create something that could
potentially cover every conceivable question someone could ask and
task they could propose.
</p>

<h2>Vocabulary</h2>

<p>
The collection of tokens that a tokenizer works with is called its
vocabulary. The tokenizers used by popular LLMs have vocabularies of
100,000 or more different tokens.
</p>

<p>
The size of the vocabluary is an important design parameter
that has a big effect on how big the model needs to be and how well it can
perform. A larger vocabulary means that word chunks and other pieces
of input can be bigger, saving the model the trouble of splicing them
together. It means that a given sequence of tokens can represent
a longer history of inputs. But it also means that the model has
a larger number of relationships to consider. Any given input token
might be predictive of any given output token, making the size of the
that space <em>O(N^2)</em>&mdash;something that grows as the <em>square</em> of the the
number of tokens. And when you add in the fact that it's not just
the current token, but the whole context history of tokens, that complexity
goes up further. (Attention is the magic trick that keeps it from
exploding catastrophically. More on that later.)
</p>

<p>
The flip side of this relationship is that reducing a vocabulary to 10%
of its original size means that the demands on
the model are reduced by far more than
that, down to something like 1% or 0.1%. The amount of training data needed,
the computation required, the inference time, all get reduced. The smaller
the vocabulary, the less time and expense to train and use a model.
</p>

<p>
Having an ALM focused on a single task with a curated data set makes
this possible. The more targeted the data set, the fewer the tokens
needed to represent it well.
</p>

<p>
All the numbers I mention above are very hand wavy because LLMs are so very
expensive to train and evaluate. I'm not aware of any research published
about the exact nature of the relationship between
vocabulary size, number of model
parameters, and model performance. A smaller ALM that is feasible to
re-train a number of times make this investigation approachable.
It means that the tokenizer for a given ALM can be trial-and-error optimized
for its intended usage.
</p>
    ]]></description>
  </item>

  <item>
    <title>
    Artisanal Language Models
    </title>
    <link>
    https://brandonrohrer.com/alms.html
    </link>
    <pubDate>
    Wed, 18 Feb 2026 08:36:00 EDT
    </pubDate>
    <guid>
    https://brandonrohrer.com/alms.html
    </guid>
    <description><![CDATA[


<p>
Large Language Models are everywhere, and in February 2026
it's hard to imagine what the machine learning world would look like
without them. But it's fun to imagine some alternatives.
</p>

<h2><a id="What-if-an-LLM-were-trained-to-complete-a-specific-task?"></a><a href="#What-if-an-LLM-were-trained-to-complete-a-specific-task?">What if an LLM were trained to complete a specific task?</a></h2>

<p>
LLMs are general purpose tools, trained to be able to do many
different things. As a result they are vast. And they are better at
doing some things than others.
</p>

<p>
If an LLM were instead focused on a particular task, trained to perform it
as well as possible and ignore everything else, then it would have
a better chance at being excellent at that thing, rather than just
being OK at everything.
</p>

<p>
The structure around the model, the preprocessing of the input data and
postprocessing out the output data, could all be specially built to
give the best results on a single task. Whether for document retrieval
or spell checking or autocomplete or interactive chat, whether for text or
for photos or for music, an Artisanal Language Model could be focused
on doing one thing and doing it well.
</p>

<h2><a id="What-if-we-could-train-an-LLM-on-a-data-set-we-collected-ourselves?"></a><a href="#What-if-we-could-train-an-LLM-on-a-data-set-we-collected-ourselves?">What if we could train an LLM on a data set we collected ourselves?</a></h2>

<p>
LLMs are trained on practically the whole internet.
</p>

<p>
The downside on such a broad swath of training data is that LLMs are also
trained on every stupid thing someone decided to post in a drunken rant.
ML engineers try to correct for this with post-processing filter steps
and prescriptive prompts, but there's no way to completely eliminate
idiocy from the model once it's been trained in.
</p>

<p>
With a set of curated training
data, every inclusion is a deliberate decision, and garbage-in-garbage-out
becomes less of a problem.
Curated training data would also avoid the many legal and ethical viiolations
called out in the training sets of popular LLMs, including copyright
violation, license violation, nonconsensual sharing of personal images,
and inclusion of CSAM.
</p>

<p>
Another advantage of an ALM focused on a specific task is that it can
limit its training data to what is relevant to the problem at hand.
If building a Javascript code completion model, we wouldn't need to include
large amounts of prose every language of the world, or even code in other
computer languages. We wouldn't need to include catalogs of audio recordings
or millions of images. We could limit the training data to Javascript, making
it orders of magnitude smaller.
</p>

<p>
With smaller training data sets, training itself becomes more feasible.
For LLMs we can count on one hand the companies large enough to train their
own general purpose models from scratch. The investment in data engineering
and computation is huge. But for a training data set that is 0.01% of
the size, the training time and cost come down within the reach of
many more organizations, researchers, and hobbyists.
</p>

<h2><a id="What-if-an-LLM-could-be-trained-on-CPUs?"></a><a href="#What-if-an-LLM-could-be-trained-on-CPUs?">What if an LLM could be trained on CPUs?</a></h2>

<p>
Training a modern LLM requires centuries of GPU time. This gets spread across
many thousands of GPUs so that it completes in a reasonable amount of time,
but it remains massive. And it results in huge hardware and power bills.
</p>

<p>
If a much smaller model were capable of being trained on CPU only, even if
it required many of them, it would remove a huge barrier to language model
training. The rate of experimentation, innovation, diversification, and
specialization would increase tremendously.
</p>

<h2><a id="What-if-an-LLM-could-run-on-your-laptop?"></a><a href="#What-if-an-LLM-could-run-on-your-laptop?">What if an LLM could run on your laptop?</a></h2>

<p>
The current generation of LLMs require clusters of GPUs for inference.
Even after they are
trained, they require too much computation and hardware to sit comfortably
in anything but a data center.
</p>

<p>
If there were language models small enough to run on a laptop and nimble
enough to return inference results quickly on laptop hardware, their
deployment could be made more robust. No network latency, no reliance on
connectivity, no dependence on external services' uptime. The reliability
engineering would become considerably more straightforward and the
expectations of availability could be raised.
</p>

<p>
This also opens the way for proprietary models and applications requiring
high security. The ability to run isolated from an external network
opens up new domains.
</p>

<h2><a id="What-if-an-LLM-could-continue-to-learn-as-you-use-it?"></a><a href="#What-if-an-LLM-could-continue-to-learn-as-you-use-it?">What if an LLM could continue to learn as you use it?</a></h2>

<p>
Current LLMs are too unwieldy to be updated on the fly. While there are
ways to refine them, such as fine tuning or reinforcement learning
from human feedback, these are either tweaks to a small part of the model
or post-processing steps. They aren't capable of updating the model as a whole.
</p>

<p>
If it were possible to update the model based on every interaction, every
new bit of input data and user response, that would make it possible for
the model to get better over time in a meaningful way. And most importantly,
it would continue to adapt to the specific users, tasks, and input data
it was exposed to. It would learn the parameters of its job under
the continuous mentorship of its human users.
</p>

<p>
<br>
<br>
What if we had Artisanal Language Models?
</p>

<p>
I hope to find out.
</p>

    ]]></description>
  </item>

  <item>
    <title>
    Graffiti Wall: A commons drawing app
    </title>
    <link>
    https://graffitiwall.nexus/graffiti_wall.html
    </link>
    <pubDate>
    Mon, 09 Feb 2026 08:36:00 EDT
    </pubDate>
    <guid>
    https://graffitiwall.nexus/graffiti_wall.html
    </guid>
    <description><![CDATA[
<p>
I made public space where anyone and everyone can go and scribble anonymously.
Probably a mistake, but it taught me a lot.
</p>
<p>
<a href="https://graffitiwall.nexus/graffiti_wall.html">
Graffiti Wall</a> (
<a href="https://codeberg.org/brohrer/graffitiwall.nexus">
  frontend code</a>,
<a href="https://codeberg.org/brohrer/graffitiwall-server">
  backend code</a>)
</p>
<p>
It's the latest step on my journey, stitching together everything
I've been learning about
self-hosting, deploying a WSGI server, and backing it with a
Postgres database.
</p>
<p>
Please enjoy it!
</p>

    ]]></description>
  </item>


  <item>
    <title>
    An addition app
    </title>
    <link>
    https://graffitiwall.nexus//graffiti_adder.html
    </link>
    <pubDate>
    Sun, 28 Dec 2025 08:36:00 EDT
    </pubDate>
    <guid>
    https://graffitiwall.nexus//graffiti_adder.html
    </guid>
    <description><![CDATA[
<p>
I made a little
<a href="https://graffitiwall.nexus/graffiti_adder.html">
Addition App</a> (
<a href="https://codeberg.org/brohrer/graffitiwall.nexus">
  frontend code</a>,
<a href="https://codeberg.org/brohrer/graffitiwall-server">
  backend code</a>)
to stitch together everything I've been learning about
self-hosting, deploying a WSGI server, and backing it with a
Postgres database.
</p>
<p>
Please enjoy it!
</p>

    ]]></description>
  </item>


  </channel>
</rss>
