Providing a new model for the distance between query words based on the minimal displacement

Document Type : Persian Original Article

Authors

Department of Electrical and Computer, University of Yazd, Yazd, Iran

Abstract

Based on the researches performed on search engines, most user queries contain more than one word. For queries with more than one word, two models can be presented. In the first model, query words are assumed to be independent of each other, and in the second model, the place and the order of words are assumed to be dependent. Experiments show that there are dependencies among most query words. One of the parameters that can determine the dependency between query words is the distance between the query words in the document. In this paper, a new distance definition based on the minimum displacement of the document words in order to match the query is presented. Also, given that most ranking algorithms use the word frequency in the documents (Term Frequency) to score the documents and since there is no clear definition for this parameter for queries with more than one word; in this paper, the frequency of the occurrence of a phrase (Phrase Frequency) and Inverted Document Frequency are defined according to the new concept of distance and the proper algorithms are presented to calculate them. Also, the results of the proposed algorithm are compared with the algorithm implemented by the open source Lucene indexer, which shows a good increase in the mean accuracy.

Keywords


[1]       A. Z. Bidoki, “Effective Web Ranking and Crawling(in persian),” University of Tehran, 2009.
[2]       R. Baeza-Yates, B. Ribeiro-Neto, and others, Modern information retrieval, vol. 463. ACM press New York, 1999.
[3]       G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Inf. Process. Manag., vol. 24, no. 5, pp. 513–523, 1988.
[4]       S. E. Robertson, Overview of the Okapi projects, vol. 53, no. 1. MCB UP Ltd, 1997, pp. 3–7.
[5]       Y. Zhang and A. Moffat, “Some Observations on User Search Behaviour.,” Austr. J. Intell. Inf. Process. Syst., vol. 9, no. 2, pp. 1–8, 2006.
[6]       D. Bahle, H. Williams, and J. Zobel, “Compaction techniques for nextword indexes,” in String Processing and Information Retrieval, International Symposium on, 2001, p. 33.
[7]       H. E. Williams, J. Zobel, and D. Bahle, “Fast phrase querying with combined indexes,” ACM Trans. Inf. Syst., vol. 22, no. 4, pp. 573–594, 2004.
[8]       A. Doucet and H. Ahonen-Myka, “An efficient any language approach for the integration of phrases in document retrieval,” Lang. Resour. Eval., vol. 44, no. 1–2, pp. 159–180, 2010.
[9]       I. H. Witten, A. Moffat, and T. C. Bell, Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, 1999.
[10]     D. Bahle, “Efficient Phrase Querying,” Sch. Comput. Sci. Inf. Technol. R. Melb. Inst. Technol., 2003.
[11]     A. Fellinghaug, “Phrase searching in text indexes,” no. June, p. 137, 2008.
[12]     C. J. van Rijsbergen, “A theoretical basis for the use of co-occurrence data in information retrieval,” J. Doc., vol. 33, no. 2, pp. 106–119, 1977.
[13]     R. Nallapati and J. Allan, “Capturing term dependencies using a language model based on sentence trees,” in Proceedings of the eleventh international conference on Information and knowledge management, 2002, pp. 383–390.
[14]     E. M. Keen, “The use of term position devices in ranked output experiments,” J. Doc., vol. 47, no. 1, pp. 1–22, 1991.
[15]     W. B. Croft, H. R. Turtle, and D. D. Lewis, “The use of phrases and structured queries in information retrieval,” in Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, 1991, pp. 32–45.
[16]     D. Metzler and W. B. Croft, “A Markov random field model for term dependencies,” in Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, 2005, pp. 472–479.
[17]     E. K. F. Dang, R. W. P. Luk, and J. Allan, “A context-dependent relevance model,” J. Assoc. Inf. Sci. Technol., 2015.
[18]     F. Song and W. B. Croft, “A general language model for information retrieval,” in Proceedings of the eighth international conference on Information and knowledge management, 1999, pp. 316–321.
[19]     J. Gao, J.-Y. Nie, G. Wu, and G. Cao, “Dependence language model for information retrieval,” in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, 2004, pp. 170–177.
[20]     B. He, J. X. Huang, and X. Zhou, “Modeling term proximity for probabilistic information retrieval models,” Inf. Sci. (Ny)., vol. 181, no. 14, pp. 3017–3031, 2011.
[21]     Y. Rasolofo and J. Savoy, Term proximity scoring for keyword-based retrieval systems. Springer, 2003.
[22]     C. Eickhoff, A. P. de Vries, and T. Hofmann, “Modelling Term Dependence with Copulas,” in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2015, pp. 783–786.
[23]     S. Büttcher, C. L. A. Clarke, and B. Lushman, “Term proximity scoring for ad-hoc retrieval on very large text collections,” in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006, pp. 621–622.
[24]     T. Tao and C. Zhai, “An exploration of proximity measures in information retrieval,” in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007, pp. 295–302.
[25]     J. Zhao and Y. Yun, “A proximity language model for information retrieval,” in Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, 2009, pp. 291–298.
[26]     J. B. P. Vuurens and A. P. de Vries, “Distance matters! Cumulative proximity expansions for ranking documents,” Inf. Retr. Boston., vol. 17, no. 4, pp. 380–406, 2014.
[27]     J. Miao, J. X. Huang, and Z. Ye, “Proximity-based rocchio’s model for pseudo relevance,” in Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, 2012, pp. 535–544.
[28]     C. L. A. Clarke, G. V Cormack, and E. A. Tudhope, “Relevance ranking for one to three term queries,” Inf. Process. Manag., vol. 36, no. 2, pp. 291–311, 2000.
[29]     J. Klekota, F. P. Roth, and S. L. Schreiber, “Query Chem: a Google-powered web search combining text and chemical structures,” Bioinformatics, vol. 22, no. 13, pp. 1670–1673, 2006.
[30]     H. Zaragoza, N. Craswell, M. J. Taylor, S. Saria, and S. E. Robertson, “Microsoft Cambridge at TREC 13: Web and Hard Tracks.,” in TREC, 2004, vol. 4, p. 1.
[31]     M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan, “Time bounds for selection,” J. Comput. Syst. Sci., vol. 7, no. 4, pp. 448–461, 1973.
[32]     R. Courant, Differential and integral calculus, vol. 2. John Wiley & Sons, 2011.
[33]     S. E. Robertson and S. Walker, “Some for Simple Effective Approximations to the 2 – Poisson Model Probabilistic Weighted Retrieval The,” in Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, 1994, vol. 1994, no. 1, pp. 232–241.
[34]     R. Duda  O., P. Hart  E., and D. Stork  G., Pattern Classification. 2000.