Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate page.url as url.*, to make it searchable #3827

Closed
hmdhk opened this issue May 4, 2020 · 19 comments · Fixed by #3857
Closed

Duplicate page.url as url.*, to make it searchable #3827

hmdhk opened this issue May 4, 2020 · 19 comments · Fixed by #3857

Comments

@hmdhk
Copy link
Contributor

hmdhk commented May 4, 2020

For RUM page load transactions and errors, index the page URL into ECS url fields so it is full-text searchable. For backwards compatibility (up until 8.0), we would need to continue indexing into page.url as well.


Original description

The current Transaction.page.url is not indexed and therefore not searchable (it's only used in the UI). Since searching over the page url has been requested before we should consider indexing this field.

We should, also, clean up the url for indexing. e.g. removing query strings. another possiblity is to add a fullUrl field, that contains the url without any change and set the clean up url on page.url

@vigneshshanmugam
Copy link
Member

What about using the request.url which has the full structure of URL defined properly in the spec.

"url": {
"description": "A complete Url, with scheme, host and path.",
"type": "object",
"properties": {
"raw": {
"type": ["string", "null"],
"description": "The raw, unparsed URL of the HTTP request line, e.g https://example.com:443/search?q=elasticsearch. This URL may be absolute or relative. For more details, see https://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5.1.2.",
"maxLength": 1024
},
"protocol": {
"type": ["string", "null"],
"description": "The protocol of the request, e.g. 'https:'.",
"maxLength": 1024
},
"full": {
"type": ["string", "null"],
"description": "The full, possibly agent-assembled URL of the request, e.g https://example.com:443/search?q=elasticsearch#top.",
"maxLength": 1024
},
"hostname": {
"type": ["string", "null"],
"description": "The hostname of the request, e.g. 'example.com'.",
"maxLength": 1024
},
"port": {
"type": ["string", "integer","null"],
"description": "The port of the request, e.g. '443'",
"maxLength": 1024
},
"pathname": {
"type": ["string", "null"],
"description": "The path of the request, e.g. '/search'",
"maxLength": 1024
},
"search": {
"description": "The search describes the query string of the request. It is expected to have values delimited by ampersands.",
"type": ["string", "null"],
"maxLength": 1024
},
"hash": {
"type": ["string", "null"],
"description": "The hash of the request URL, e.g. 'top'",
"maxLength": 1024
}
}
},

Not sure if its searchable as well. /cc @elastic/apm-server Please let us know your thoughts.

@moix
Copy link

moix commented May 21, 2020

Hi, I was looking for something similar and found this ticket. I'm quite new to elastic apm and rum and are investigating now documents structure, I also read about ecs recommendations and was wondering if the url shouldn't be under url? url.registered_domain looks more appropiate? (https://www.elastic.co/guide/en/ecs/current/ecs-url.html) or am I misunderstanding something?

Thanks!

@axw
Copy link
Member

axw commented May 27, 2020

I tend to agree, @moix. If we record the page URL as url.full in Elasticsearch, then it will be searchable (using url.full.text) as it's a multi-field.

@vigneshshanmugam In theory you could send request.url, but if you send request.url you also have to send request.method. It's meant for HTTP requests, not browser page loads (which might be stored locally, and have no corresponding HTTP request...)

We can alternatively update the server to store context.page.url as url.full in Elasticsearch. I suppose we would need to duplicate the URL in page.url for backwards compatibility, at least until 8.0. @jahtalab @vigneshshanmugam what do you think?

@vigneshshanmugam
Copy link
Member

We can alternatively update the server to store context.page.url as url.full in Elasticsearch. I suppose we would need to duplicate the URL in page.url for backwards compatibility, at least until 8.0.

This solution sound good to me. Also I am in favour of the URL duplication work to be done in the server as it would reduce the payload size in case if we need to duplicate from RUM side.

@hmdhk
Copy link
Contributor Author

hmdhk commented May 27, 2020

Thanks @axw, duplicating the context.page.url as url.full sounds good to me, the only concern here is that page.url contains all parts of the url (including the query string and fragment) which in practice could be very long. So the server needs to shorten the url, instead of rejecting the payload if the length exceeds the allowed limit.

Also, if we plan to remove context.page.url from 8.0 forward then this requires a change in APM UI to use the new field!

@axw
Copy link
Member

axw commented May 28, 2020

Cool. I'm going to transfer this to apm-server then, and we'll take it from there.

@axw axw transferred this issue from elastic/apm-agent-rum-js May 28, 2020
@axw axw changed the title Consider indexing page url Duplicate page.url as url.*, to make it searchable May 28, 2020
@hmdhk
Copy link
Contributor Author

hmdhk commented May 28, 2020

@axw, two questions:

  1. Do you think we can get this on 7.9 milestone?
  2. Is there a possibility to reindex existing data to duplicate the url when users upgrade? This is not a requirement but only nice to have, if the complexity of reindexing is not huge.

@axw
Copy link
Member

axw commented May 28, 2020

@jahtalab

Do you think we can get this on 7.9 milestone?

I think we can probably manage that.

Is there a possibility to reindex existing data to duplicate the url when users upgrade? This is not a requirement but only nice to have, if the complexity of reindexing is not huge.

It's theoretically possible, but unlikely that we would do that - at least not for 7.9, and likely not in the foreseeable future. We don't have any infrastructure in place for things like this.

We could suggest an _update_by_query script to run manually though.

@hmdhk
Copy link
Contributor Author

hmdhk commented May 28, 2020

Thanks @axw , I think it's fine to duplicate url only on new data for now.

@simitt
Copy link
Contributor

simitt commented May 29, 2020

What is the usecase for collecting page.url.*? Depending on the usecase it might make sense to not only duplicate the field in url.full but split it into it's parts and fill as many of the ECS defined url.* fields as applicable.

Since the url.full is indexed as multifield, I'd also not trim it to 1024 chars but leave it unchanged.

@hmdhk
Copy link
Contributor Author

hmdhk commented May 29, 2020

Good idea @simitt , for now the use-cases have been around trying to search for a particular url. but I think if we start splitting the url into its fields (as well as keeping the full url), it can help address many more use-cases (e.g. ad-hoc search queries based on scheme).

cc @andrewvc @shahzad31 @justinkambic @drewpost

@hmdhk
Copy link
Contributor Author

hmdhk commented May 29, 2020

I've updated the description to apply the same solution for both Transactions and Errors

@drewpost
Copy link

drewpost commented Jun 1, 2020

I like @simitt suggest to break it down as much as possible. Question - would you propose normalising the query strings or storing them as received?

@simitt
Copy link
Contributor

simitt commented Jun 2, 2020

I was not thinking to normalize them; we could reuse the logic we have for jaeger in place https:/elastic/apm-server/blob/master/processor/otel/consumer.go#L508.

@simitt
Copy link
Contributor

simitt commented Jun 3, 2020

While on it, we could also map page.referer to the ECS field http.request.referer.

@jalvz
Copy link
Contributor

jalvz commented Jun 8, 2020

where should the duplicated field be stored?
url is used already for context.request.url

@axw
Copy link
Member

axw commented Jun 8, 2020

@jalvz the (previously unwritten) assumption here is that you only ever pass context.page or context.request. I would suggest we codify that by making the JSON Schema more strict.

In either case they would be recorded in the url.* fields.

@jalvz jalvz self-assigned this Jun 8, 2020
@jalvz jalvz mentioned this issue Jun 8, 2020
10 tasks
@jalvz
Copy link
Contributor

jalvz commented Jun 9, 2020

@jahtalab any reasons to not go with #3827 (comment)? (This was expected at least by 1 customer)

@hmdhk
Copy link
Contributor Author

hmdhk commented Jun 9, 2020

I don't see any reason why not. Other than the fact that ECS field is referrer and the actual http header name is referer wrong spelling but it is the standard 😄

@zube zube bot added [zube]: Done and removed [zube]: Ready labels Jun 10, 2020
@axw axw added ecs and removed [zube]: Done labels Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants