Automatic categorization of text is a core tool now

This is a big change I noticed in 2018: automatic categorization of text is a tool that companies are using whenever they can.

Let me be specific, because ‘AI’ means a lot of things to a lot of people. I’m talking about taking a database with short freeform text fields and automatically tagging them according to a tagged sample corpus. I’m not talking about text synthesis, anything to do with speech, automatic chat, question answering, or Alexa Skills.

Example: two companies
Not clear to me what changed
Speech and conversation
Career questions

Example: two companies

Because I’m NY-based, I meet a lot of e-commerce B2B startups. This year I had a weird deja vu moment where a company that was about to launch in 2018 had almost identical DNA to a 2016 company.

Almost identical. Similar B2B positioning (shopping cart plugin), similar set of skills on the founding team, but with one addition: the 2018 company has a founding data scientist and will do the labor-intensive step of their onboarding by machine. This will give them a faster bridge to self-serve.

Onboarding was the thorniest part of the funnel for the 2016 company (not unusual in this sector). It was labor intensive, buggy and slow. The complex on-boarding process delayed their implementation of self-serve which in theory increased their cost of customer acquisition.

The 2018 company is launching with text classification in their toolbox. Everyone in the company will have to embrace these tools, in the same way that everyone has some awareness of databases now. And their competition will need to get up this curve in order to stay competitive.

Not clear to me what changed

We’ve had some of the building blocks for this kind of text processing for decades, including the stats tools and the training corpuses. Does deep learning help? I don’t know but at minimum it helps by delivering sexy headlines that keep AI in the news, which in turn convinces business stakeholders this is something they can get behind.

It wasn’t magic before and it’s not magic now; the output of these algorithms still requires some amount of quality control and manual inspection. But business leaders are now willing to admit that the old manual way of doing things also had drawbacks.

‘Availability of big data’ is one claim people make but this feels like ‘medium data’ to me. The input data for these classification projects is typically on the scale of the company’s existing business, and the categorization task is something they’re doing by hand now. For example, coding medical claims: the insurance company has hundreds of thousands of these that they’ve done by hand or with a hand-coded classifier.

Investor appetite for ML-based companies probably has some impact here.

Now that we’re using these techniques they don’t feel ‘high tech’, they feel like we could have been doing them for years, and I think that’s what it’s suppose to feel like when a technology becomes mainstream.

In a sense this is the continuation of the software-ization of the service industry. In-person to phone banks to internet to mobile – each of these steps has been about consumer access to new kinds of terminals, not just companies ‘figuring out’ how to sell books online.

Speech and conversation

Speech interfaces, especially conversational speech, feel ‘not there yet’ for now. Alexa is bad because it’s not conversational enough. I don’t have one but while housesitting I got to enjoy conversations like ‘alexa lamp off / which lamp / list lamps / I don’t know how to do that’.

Alexa also witnessed a crime and violated the GDPR while trying to comply with the GDPR, so it’s been a busy year.

I was fascinated and disturbed by the duplex conversation agent demo G posted on their blog this summer. The machine learned a ‘guess and check’ conversational pattern that will be familiar to anybody who’s tried to schedule an appointment over a shaky connection. If this is the future, yikes. When the machine was uncertain (which was frequently), it defaulted to browbeating the poor phone clerk with uptalk.

I hope they add the ‘operator’ command.

Duplex feels like an extension of the gig economy; this is about college grads not wanting to waste their time negotiating with grunts. The tech is cool, the social impact troubling. The ‘synthesis’ side of AI (vs recognition / classification) feels generally fraught. Watch for the phantom earring.

Career questions

As a working programmer, should you drop everything and learn NLP now? Will there still be ‘systems programmers’ in 2 years? People are still making money writing COBOL so probably, but who wants to be writing COBOL (figuratively or otherwise).

The part of the question I can’t answer is how big is the job pool, how long will the bubble last and how much expertise do you need to get more money than you make now? Grad students are getting snapped up by big tech & hedge funds but if you wanted to go to grad school, you’d probably be there already.

For myself, I’m learning the basic techniques because they feel core to my industry skillset. I’m staying open to chances to apply them and to work with experts. I’m not even at the midpoint of my career and want to stay ahead of the curve.