Yury Lifshits | Publications

Research papers

Evolution of Two-Sided Markets

Ravi Kumar, Yury Lifshits, Andrew Tomkins

WSDM, 2010

Show abstract

Hide abstract

Two-sided markets arise when two different types of users may realize gains by interacting with one another through one or more platforms or mediators. We initiate a study of the evolution of such markets. We present an empirical analysis of the value accruing to members of each side of the market, based on the presence of the other side. We codify the range of value curves into a general theoretical model, characterize the equilibrium states of two-sided markets in our model, and prove that each platform will converge to one of these equilibria. We give some early experimental results of the stability of two-sided markets, and close with a theoretical treatment of the formation of different kinds of coalitions in such markets.
Combinatorial Approach to Similarity Search

Yury Lifshits

SISAP, 2009

Show abstract

Hide abstract

We present an overview of the combinatorial framework for similarity search. An algorithm is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Namely, the input dataset is represented by a comparison oracle that given any three points x,y,z answers whether y or z is closer to x. We assume that the similarity order of the dataset satisfies the four variations of the following disorder inequality: if x is the a'th most similar object to y and y is the b'th most similar object to z, then x is among the D(a+b) most similar objects to z, where D is a relatively small disorder constant. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions.

Ranwalk, the first known combinatorial solution for nearest neighbors, is randomized, exact, zero-error algorithm with query time that is logarithmic in number of objects. But Ranwalk preprocessing time is quadratic. Later on, another solution, called combinatorial nets, was discovered. It is deterministic and exact algorithm with near-linear time and space complexity of preprocessing, and near-logarithmic time complexity of search. Combinatorial nets also have a number of side applications. For near-duplicate detection they lead to the first known deterministic algorithm that requires just near-linear time + time proportional to the size of output. For any dataset with small disorder combinatorial nets can be used to construct a visibility graph: the one in which greedy routing deterministically converges to the nearest neighbor of a target in logarithmic number of steps. The later result is the first known work-around for Navarro's impossibility of generalizing Delaunay graphs.
Conditional Probability Tree Estimation Analysis and Algorithms

Alina Beygelzimer, John Langford, Yury Lifshits, Gregory Sorkin, Alex Strehl

UAI, 2009

Show abstract

Hide abstract

We consider the problem of estimating the conditional probability of a label in time O(log n), where n is the number of possible labels. We analyze a natural reduction of this problem to a set of binary regression problems organized in a tree structure, proving a regret bound that scales with the depth of the tree. Motivated by this analysis, we propose the first online algorithm which provably constructs a logarithmic depth tree on the set of labels to solve this problem. We test the algorithm empirically, showing that it works successfully on a dataset with roughly 10^6 labels.
Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design

Yury Lifshits, Shengyu Zhang

SODA, 2009

Show abstract

Hide abstract

We study the so called combinatorial framework for algorithmic problems in similarity spaces. Namely, the input dataset is represented by a comparison oracle that given three points x, y, z answers whether y or z is closer to x. We assume that the similarity order of the dataset satisfies the four variations of the following disorder inequality: if x is the a'th most similar object to y and y is the b'th most similar object to z, then x is among the D(a + b) most similar objects to z.

Though the oracle gives much less information compared to the standard general metric space model where distance values are given, one can still design very efficient algorithms for various fundamental computational tasks. For nearest neighbor search we present deterministic and exact algorithm with almost linear time and space complexity of preprocessing, and near-logarithmic time complexity of search. Then, for near-duplicate detection we present the first known deterministic algorithm that requires just near-linear time + time proportional to the size of output. Finally, we show that for any dataset satisfying the disorder inequality a visibility graph can be constructed: all out-degrees are near-logarithmic and greedy routing deterministically converges to the nearest neighbor of a target in logarithmic number of steps. The later result is the first known work-around for Navarro's impossibility of generalizing Delaunay graphs.

The technical contribution of the paper consists of handling "false positives" in data structures and an algorithmic technique up-aside-down-filter.
Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions

Yury Lifshits, Shay Mozes, Oren Weimann, and Michal Ziv-Ukelson

Algorithmica, 2009

Show abstract

Hide abstract

We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete time-independent HMMs. We discuss the application of our method to Viterbi's decoding and training algorithms [33], as well as to the forward-backward and Baum-Welch [6] algorithms. Our approach is based on identifying repeated substrings in the observed input sequence. Initially, we show how to exploit repetitions of all sufficiently small substrings (this is similar to the Four Russians method). Then, we describe four algorithms based alternatively on run length encoding (RLE), Lempel-Ziv (LZ78) parsing, grammar-based compression (SLP), and byte pair encoding (BPE). Compared to Viterbi's algorithm, we achieve speedups of Theta(log n) using the Four Russians method, Omega(r/log r) using RLE, Omega(log n /k) using LZ78, Omega(r/k) using SLP, and Omega(r) using BPE, where k is the number of hidden states, n is the length of the observed sequence and r is its compression ratio (under each compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster in practice. Furthermore, unlike Viterbi's algorithm, our algorithms are highly parallelizable.
Disorder inequality: A combinatorial approach to nearest neighbor search

Navin Goyal, Yury Lifshits and Hinrich Schütze

WSDM, 2008

Show abstract

Hide abstract

We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions.

In this paper we introduce a special property of the similarity function on a set S that leads to efficient combinatorial algorithms for S. Disorder constant D(S) of a set S is defined to ensure the following inequality: if x is the a'th most similar object to z and y is the b'th most similar object to z, then x is among the D(S)(a+b)'th most similar objects to y.

Assuming that disorder is small we present the first two known combinatorial algorithms for nearest neighbors whose query time has logarithmic dependence on the size of S. The first one, called Ranwalk, is a randomized zero-error algorithm that always returns the exact nearest neighbor. It uses space quadratic in the input size in preprocessing, but is very efficient in query processing. The second algorithm, called Arwalk, uses near-linear space. It uses random choices in preprocessing, but the query processing is essentially deterministic. For every fixed query, there is only a small probability that the chosen data structure does not support it.

Finally, we show that for Reuters corpus average disorder is, indeed, quite small and that Ranwalk efficiently computes the nearest neighbor in most cases.
Potential theory for mean payoff games

Yury Lifshits and Dmitri Pavlov

Journal of Mathematical Sciences, Springer, 2007

Show abstract

Hide abstract

We present an O(mn2^n log Z) deterministic algorithm for solving the mean payoff game problem, m and n being respectively the number of arcs and vertices in the game graph and Z being the maximum weight (we assume that the weights are integer numbers). The theoretical basis for the algorithm is the potential theory for mean payoff games. This theory allows to restate the problem in terms of solving systems of algebraic equations with minima and maxima. Also we use arc reweighting technique to solve the mean payoff game problem by applying simple modifications to the game graph that do not change the set of winning strategies, obtaining at the end a trivial instance of the problem. We show that any game graph can be simplified by n reweightings.
Decidability of parameterized probabilistic information flow

Daniele Beauquier, Marie Duflot, and Yury Lifshits

CSR, 2007

Show abstract

Hide abstract

In this paper, we consider the decidability of two problems related to information flow in a system with respect to some property. A flow occurs in a system if the conditional probability of the property under some partial observation differs from the a priori probability of that property. For systems modeled as finite Markov chains we prove that the two following problems are decidable: does a system has information flow for a given regular property? is it true that the system has no information flow for any (sequential) property?
Algorithms and complexity analysis for processing compressed texts

Yury Lifshits

PhD thesis, 2007

Show abstract

Hide abstract

(it's in Russian) Как хранить данные, чтобы одновременно иметь быстрый доступ к нужной информации и максимально уменьшить их размер? Для решения первой задачи разработано множество структур данных (сбалансированные деревья, суффиксные массивы и т.д.), ориентированных на быструю обработку конкретных видов запросов. Для уменьшения размера данных используются алгоритмы архивирования (LZ77, LZW, кодирование длин серий и т.д.). Можно ли объединить достоинства этих двух подходов? В представляемой работе построены алгоритмы вычисления ряда свойств сжатых текстов без распаковки, получены нижние оценки на трудоемкость некоторых других задач на сжатых текстах, и, наконец, предложен новый подход к архивированию текстов.
Processing compressed texts: A tractability border

Yury Lifshits

CPM, 2007

Show abstract

Hide abstract

What kind of operations can we perform effectively (without full unpacking) with compressed texts? In this paper we consider three fundamental problems: (1) check the equality of two compressed texts, (2) check whether one compressed text is a substring of another compressed text, and (3) compute the number of different symbols (Hamming distance) between two compressed texts of the same length. We present an algorithm that solves the first problem in O(n^3) time and the second problem in O(n^2m) time. Here n is the size of compressed representation (we consider representations by straight-line programs) of the text and m is the size of compressed representation of the pattern. Next, we prove that the third problem is actually #P-complete. Thus, we indicate a pair of similar problems (equivalence checking, Hamming distance computation) that have radically different complexity on compressed texts. Our algorithmic technique used for problems (1) and (2) helps for computing minimal periods and covers of compressed texts.
Tiling Periodicity

Juhani Karhumaki, Yury Lifshits, and Wojciech Rytter

CPM, 2007

Show abstract

Hide abstract

We contribute to combinatorics and algorithmics of words by introducing new types of periodicities in words. A tiling period of a word w is partial word u such that w can be decomposed into several disjoint parallel copies of u, e.g. a_b is a tiling period of aabb. We investigate properties of tiling periodicities and design an algorithm working in O(n log(n) log log(n)) time which finds a tiling period of minimal size, the number of such periods and their compact representation. The combinatorics of tiling periods differs significantly from that for classical full periods, for example unlike the classical case the same word can have many different primitive tiling periods. We consider also a related new type of periods called in the paper multi-periods. As a side product of the paper we solve an open problem posted by T. Harju.
Estimation of the Click Volume by Large Scale Regression Analysis

Yury Lifshits and Dirk Nowotka

CSR, 2007

Show abstract

Hide abstract

How could one estimate the total number of clicks a new advertisement could potentially receive in the current market? This question, called the click volume estimation problem is investigated in this paper. This constitutes a new research direction for advertising engines. We propose a model of computing an estimation of the click volume. A key component of our solution is the application of linear regression to a large (but sparse) data set. We propose an iterative method in order to achieve a fast approximation of the solution. We prove that our algorithm always converges to optimal parameters of linear regression. To the best of our knowledge, it is the first time when linear regression is considered in such a large scale context.
Maximal Intersection Queries in Randomized Graph Models

Benjamin Hoffmann, Yury Lifshits, and Dirk Nowotka

CSR, 2007

Show abstract

Hide abstract

Consider a family of sets and a single set, called query set. How can one quickly find a member of the family which has a maximal intersection with the query set? Strict time constraints on the query and on a possible preprocessing of the set family make this problem challenging. Such maximal intersection queries arise in a wide range of applications, including web search, recommendation systems, and distributing on-line advertisements. In general, maximal intersection queries are computationally expensive. Therefore, one need to add some assumptions about input in order to get an efficient solution. We investigate two well-motivated distributions over all families of sets and propose an algorithm for each of them. We show that with very high probability an almost optimal solution is found in time logarithmic in the size of the family. In particular, we point out a threshold phenomenon on the probabilities of intersecting sets in each of our two input models which leads to efficient algorithms mentioned above.
Guaranteed slowdown, generalized encryption schemes and function sharing

Yury Lifshits

Proc. of Institute of System Programming of Russian Academy of Science, 2006

Show abstract

Hide abstract

The goal of the paper is to construct mathematical abstractions of different aspects of real life software protection. We introduce three following notions: program slowdown, generalized encryption scheme and function sharing. These schemes allowed to discover new applications of such known ideas as trap-door functions and self-correcting programs.
Window subsequence problems for compressed texts

Patrick Cegielski, Irene Guessarian, Yury Lifshits and Yuri Matiyasevich

CSR, 2006

Show abstract

Hide abstract

Given two strings (a text t of length n and a pattern p) and a natural number w, window subsequence problems consist in deciding whether p occurs as a subsequence of t and/or finding the number of size (at most) w windows of text t which contain pattern p as a subsequence, i.e. the letters of pattern p occur in the text window, in the same order as in p, but not necessarily consecutively (they may be interleaved with other letters). We are searching for subsequences in a text which is compressed using Lempel-Ziv-like compression algorithms, without decompressing the text, and we would like our algorithms to be almost optimal, in the sense that they run in time O(m) where m is the size of the compressed text. The pattern is uncompressed (because the compression algorithms are evolutive: various occurrences of a same pattern look different in the text).
Quering and embedding compressed texts

Yury Lifshits and Markus Lohrey

MFCS, 2006

Show abstract

Hide abstract

In this work the computational complexity of two simple string problems on compressed input strings is considered: the querying problem (What is the symbol at a given position in a given input string?) and the embedding problem (Can the first input string be embedded into the second input string?). Straight-line programs are used for text compression. It is shown that the querying problem becomes P-complete for compressed strings, while the embedding problem becomes hard for the complexity class Theta^p_2.
Partitioning a k-connected graph

Yury Lifshits

Discrete Mathematics and Applications, VSP, 2005

Show abstract

Hide abstract

We study partitions of graphs by a system of disconnecting sets. We find a sharp upper bound for the number of resulting parts and analyse the case where the bound is attained. The result due to D. V. Karpov about the number of parts in a partition is proved under weaker constraints imposed on the graph. We also prove a theorem on bounding parts which yields an upper bound for the number of parts of the partition adjacent to a given vertex.
On the computational complexity of embedding of compressed texts

Yury Lifshits

Diploma thesis, 2005

Show abstract

Hide abstract

In this work we consider a well-known problem of processing of compressed texts. We study the following question (called Embedding): whether one compressed text is a subsequence of another compressed text? In this paper we show that Embedding is NP- and co-NP-hard.
A lower bound on the size of epsilon-free NFA corresponding to a regular expression

Yury Lifshits

Information Processing Letters, Elsevier, 2003

Show abstract

Hide abstract

Hromkovic et al. showed how to transform a regular expression of size n into an epsilon-free nondeterministic finite automaton (which defines the same language as the expression) with O(n) states and O(n log2(n)) transitions. They also established a lower bound (n*log(n)) on the number of transitions. We improve the lower bound to ( n*log^2 n/log log n).

Some of my publications are listed at Scholar.Google, DBLP and Computer Science Bibliographies.