Training Data quality

Recently Argsen is working on a project which requires a text comparison component (by semantic meaning). In order to speed up the development process, we have explored both commercial and open source APIs for their suitability in our product. Simultaneously, we have trained our models. I am not going to discuss the result in detail here, but here is the essence – in the IT hardware world, there is no bad hardware, only inadequately priced hardware. Similarly, in AI world, there is no bad model, but models that doesn’t fit the purpose.

Here are some examples: “software maintenance” and “grave maintenance”; “software debug” and “software maintenance”. Obviously, the first set should return low similarity and the second one should return a high similarity score. Unfortunately, many models cannot give correct results in both cases.

Semantic comparison is based on context. Most models are built as word vectors. Therefore, we picked some models, which we can access the training data to look for the underlying causes. Now things get interesting. The model that matches “grave maintenance” and “software maintenance” is sourcing the explanation / context from Wikipedia. When search “grave maintenance” on English Wikipedia, the term does not exist and “software maintenance” becomes the first possible match. Given the same text used in the training mode, the model gives a very high score when comparing two terms.

I guess some people will say that such errors can be ruled out during the ongoing tagging / training. While I acknowledge that it is possible, it really affects the end-user experience. I also doubt how much human effort has been put to clean the training data sets, especially with those complex models. More importantly, without the ability accessing to the original training data sets, it is almost impossible to understand the root causes.

Just need to be cautious when using third-party models.