Wednesday, November 21, 2007

How does your site get into a search engine?

From DigitalWebMagazine:

A search engine obtains your URL either by you submitting your site directly to the search engine or by others linking to your site. Then, at a time of its choosing, a search engine sends out its spider (or “bot”) to visit your site.


Once there, the spider starts reading all the text in the body of the page, including markup elements, all links to other pages and to external sites, plus elements from the page head including some meta tags (depending on the search engine) and the title tag.


It then copies this information back to its central database for indexing at a later date which can be up to two or three months later.
The spider then follows the links on the page, repeating the same process. Spiders are, for lack of a better term, dumb. They can only follow the most basic HTML code. If you’ve encased a link in a fancy JavaScript that the spider won’t understand, the spider will simply ignore both the JavaScript and the link. The same thing applies to forms; spiders can’t fill out forms and click “submit.”
To get an understanding of what a spider sees, try accessing your site with a Lynx browser from a Unix server. Lynx is non-graphical, does not support JavaScripts, and will display only text and regular a href tags. This is what the spider can see and therefore index. Does your page work without graphics or JavaScript? If no, then the spidering won’t work either and you’d better head back the drawing board.
Once the SE has all your content in its database, it runs an algorithm (a mathematical formula) against the content. These algorithms are unique to each SE and are constantly changing, but, in essence, all the search engines are looking for the important words on your page (based on word density—how often a word or phrase is used in relation to the total amount of text) and they assign a value to these words based on the code surrounding the words.


In addition to content, the search engine looks for what other sites, or pages on the same site, are linking to that page. The more links to a given page, the more important that page is.


Getting other sites to link to your site is very important, but not part of
optimizing your site and will be covered in a future column. From a site
optimization standpoint, make sure you link to your important pages from more
than just the index page (e.g., create a primary navigation that appears on all
pages.)


Tip 1
The first rule of SEO is not to design your site in such a way that the code prevents a spider from being able to index it. This means avoiding pages which are 100% graphics and no text, such as pages that contain all images, or are Flash-only. Furthermore, if the first thing a user encounters is a log-in page, before being able to see the site’s content, then that’s what a spider will see and it won’t go any further, either.
If you’re planning to build a Web site entirely in flash, DON’T. If you have no choice, then read my previous column, Search Engine
Optimization and Non-HTML Sites
.

Tip 2


To find out what a spider sees on your site, run a spider simulator on a given page. The simulator will show you what text the spider sees and what links it finds. There are many good ones on the market at various prices. If you’re looking for something that’s free, I’d suggest Search Engine Spider Simulator.


Tip 3


Each Web site should have a file called robots.txt. This file tells the spiders what directories they should not spider. Make sure this file is present and that it gives the appropriate permissions to the spiders. This includes access to content and to CSS.
For more information on the robot.txt file, see: Guide to the Robots
Exclusion Protocol
.


Page Structure
Once you’ve built an SE-friendly Web site, you then need to be sure each page is also SE-friendly. As I said earlier, good HTML structure is the foundation for building an SEO Web page.


There are two primary areas of a Web page. The area contained between the
head/head tags and that which is contained between the body/body tags. What information you place in these areas has a huge impact on how a page is indexed and, to a certain degree, what will appear in the SE results page.


When designing your page, or placing content on your page, remember that spiders read like people. They go from left to right and from top to bottom (though this may be different for other languages.) They also feel that the most important information is located at the top of the page. If it’s important, why would you place it at the bottom? When reading specific tags title, h1, h2, etc.) search engines value words to the left more highly than words to the right.


The Title Tag


Let’s start at one of the first elements in a Web page—the title tag (). This is one of the, if not the, most important elements for SEO on the entire page. All too often, the information contained in this tag is either left blank, has a default value (e.g. “insert title here”), or is simply the company name.
Why is this tag so important? First of all, it is used by every major search engine as a key indicator of the page’s content, and, second, it used by the search engine as the first line in the SERPs.
Give this tag the consideration it deserves.

Tip 4
Determine the main topic of the page and use it as the title.

For the rest of this useful page, go here.

No comments: