Etsy is a worldwide marketplace for unique goods. This means that the moment an item becomes popular, it runs the risk of selling out. Machine learning options that memorize the items that are popular are much less effective, and crafting attributes that generalize well across things in our stock is vital. Additionally, some articles features like titles are much less enlightening because these are seller may be noisy, and provided.


Similarly, we consider the bought item in a particular session to be a worldwide contextual token which applies to the entire arrangement of user interactions. The intuition behind this is that there are so we want to explain the purchase intent across the actions that they took. This is also known as the linear attribution version.
We’re excited about the selection of different applications of the model which range from personalization to ranking to candidate set choice. Stay tuned!

What are embeddings?

Additionally, we wish to be able to provide an individual journey that ended in a purchase more importance in the model. We specify an importance weight per user interaction (click, dwell, add to cart, and buy ) and integrate this to our reduction function as well.

This job is a collaboration between Xiaoting Zhao and Nishan Subedi in the Search Ranking group. We would like to thank our manager, Liangjie Hong for insightful discussions and service, the Recommendation Systems and Hunt Ranking teams for their input throughout the project, namely Raphael Louca and Adam Henderson for launching products based on models, Stan Rozenraukh, Allison McKnight and Mohit Nayyar for reviewing this post, and Mihajlo Grbovic, leading writer of the semantic embeddings newspaper for detailed answers to our questions.
In this blog post, I’ll cover a machine learning technique we are using at Etsy which lets us extract meaning from our data without using content features like names, modeling just the consumer journeys throughout the site. This post assumes understanding of machine learning theories.

For a product that’s a cookie of steer and cacti design, we see the prior method latch onto material from the term’steer’ and ignore’cactus’, whereas the semantic embeddings put meanings on biscuits. We find that this has the advantage of not having to guess that the importance of a particular item, and rely on user engagement to guide us.
We found semantic embeddings to likewise provide better similar things to a specific item compared to a candidate set production model that is based on material. This example comes from a version we released recently to generate things .

Semantic embeddings are agnostic to the content of things such as their titles, tags, descriptions, and let us leverage aggregate user interactions on the site to extract items that are semantically similar. They provide us the capability to embed our search questions, items, stores, classes, and places. This provides compression, which improves rates compared to representing them as encodings, and leads to candidate selection and featurization across multiple machine learning problems. Modeling user travels gives us information that differs from content-based procedures that titles and leverage descriptions of items, and so these methods may be utilised in conjunction.

We also found a substantial improvement in performance by training the model on the last year’s data for the present and forthcoming month to bring some forecasting abilities, eg. To get a version serving production from the month of December month December and January data was added, so our version would see interactions related to Christmas during that time.


This first set of question similarities catches many different animals for the query jaguar.  The next group  indicates the version also able to relate across different languages.
We aim to learn a vector representation for each unique token, in which a token could be set id, store identification, query, class, or anything else which is a portion of an individual ’s  interaction. We could train up embeddings to 100 measurements on a box. Our final models have the ability to create embeddings for tens of millions of tokens that are special and take in billions of tokens.

Training program certain models gave us better performance. For example, if we’re interested in getting store level embeddings, training about the stores for a product instead of just the things yields performance that is better than averaging the embeddings. We are actively experimenting with these models and plan to integrate session and user specific data in the future.
These applicants are created based on a k-nn search round the semantic representations of items. We could run state of the art recall algorithms, unconstrained by memory on our training boxes themselves.
Note that all these relations are made with no model being fed any content attributes. These are outcomes of this embeddings projected onto tensorboard and filtered to only search inquiries.

Word2vec is a favorite way in natural language processing for learning a semi-supervised version from unsupervised information to detect similarity across words in a corpus with an unlabelled body . Relating co-occurrence of phrases does this and depends on the premise that words that appear together are more applicable than words which are far apart.

User action can be broadly defined to any sort of explicit or implicit engagement of the consumer with this product. We extract user interactions from multiple sources like the search, class, market, and store home pages, where these interactions are aggregated rather than tied to a user.

We’ve already productionized utilization of those embeddings across product recommendations, guided hunt experiences and they show great guarantee on our ranking algorithms as well. External to Etsy, similar semantic embeddings are used to successfully master representations for delivering advertisements as product recommendations through email and matching relevant advertisements to queries in Yahoo; and to improve their search ranking and derive similar listings for recommendations at AirBnB.
Etsy has over 50 million active things listed on the site from more than 2 million vendors, and tens of millions of unique search queries each month. This amounts to billions of tokens (things or user activities – equal to phrase in NLP word2vec) for training. We could train embeddings when simulating a string of user interactions as a 30, but we quickly ran into some limitations. The outcome did not give satisfactory performance to us. This gave further assurance to us that a number of extensions so extended the model, to the word2vec implementation were necessary.

We originally began training the embeddings as a Skip-gram model with adverse sampling (NEG as summarized in the first word2vec paper) method. The Skip-gram model performs better compared to Continuous Bag Of Words (CBOW) version for bigger vocabularies. It tries to maximize the likelihood of seeing some of the context tokens given a target market and models the context given token to a goal. The sampling that is negative brings a token from the entire corpus with a frequency that is directly proportional to the frequency of the token appearing in the corpus.

Skip-gram version and extensions

Coaching a Skip-gram version on only randomly selected negatives, but ignores implicit contextual signals that we’ve discovered to be indicative of user preference from different contexts. For instance, if a user clicks on the thing for a search query, the user saw, but did not enjoy. By appending these implicit negative signs to the Skip-gram loss 17, the loss function is extended by us.
The exact same method can be used to model user connections on Etsy by simulating users journeys in aggregate for a sequence of user activities. Each user session is comparable to a sentence, and each user actions (clicking on a product, visiting a shop’s home page, issuing a search query) is analogous to some word at NLP word2vec parlance. This technique of modeling connections allows us to reflect items or other entities (stores, locations, users, queries) as low dimensional constant vectors (semantic embeddings), where the similarity across two different vectors represents their co-relatedness. Without knowing anything about any 12, this method can be used.
Estate pipe signifies tobacco pipes which are previously owned. We find the the model able to identity distinct items the pipe is created from (briar, corn cob, meerschaum), different brands of manufacturers (Dunhill and Peterson), and identifies accessories that are relevant to this particular sort of pipe (pipe tamper) while not showing significance with glass pipes which are not valid in this circumstance. Content based methods have not been effective in dealing with this. The embeddings are able to catch different styles hippie, gypsy, gypsysoul all being related fashions.  

The model performed significantly better when we thresholded tokens according to their kind. For example, the frequency count and distribution for questions are normally very distinct from that of things, or shops. User inquiries are unbounded and have a very long tail, and arrangement of magnitudes. So whereas items or limit queries according to a cutoff, we want to capture all of the stores in the vector space that is embeddings.


These are some fascinating highlights of what the semantic embeddings are able to catch: