PoisonedPrisonPanda

joined 2 years ago
[–] PoisonedPrisonPanda@discuss.tchncs.de 1 points 2 months ago (1 children)

To paraquote H. L. Mencken: For every problem, there is a solution that’s cheap, fast, easy to implement – and wrong.

This can be the new slogan of our development. :')

I have convinced management to switch to a modern server. In addition we hope refactoring our approach (no random reads, no dedupe processes for a whole table, etc.) will lead us somewhere.

As for how to viably decrease the amount of data in your active set, well, that’s hard to say without knowledge of the data and what you want to do with it. Is it a historical dataset or time series?

Actually now. We are adding a layer of processing products to an already in-production system which handles already multiple millions of products on a daily basis. Since we not only have to process the new/updated products but have to catch up with processing the historical (older) products as well its a massive amount of products. We thought since the order is not important to use a random approach to catch up. But I see now that this is a major bottleneck in our design.

If so, do you need to integrate the entire series back until the dawn of time, or can you narrow the focus to a recent time window and shunt old data off to cold storage?

so no. No narrowing.

Is all the data per sample required at all times, or can details that are only seldom needed be split off into separate detail tables that can be stored on separate physical drives at least?

Also no IMO. since we dont want a product to be processed twice, we want to ensure deduplication - this requires knowledge of all already processed products. Therefore comparing with the whole table everytime.

[–] PoisonedPrisonPanda@discuss.tchncs.de 1 points 2 months ago (1 children)

First question: how many separate tables does your DB have? If less than say 20, you are probably in simple territory.

Currently about ~50. But like 30 of them are the result of splitting them into a common column like "country". In the beginning I assumed this lead to the same as partitioning one large table?

Also, look at your slowest queries

The different queries itself take not long because of the query per se. but due to the limitation of the HDD, SQL reads as much as possible from the disk to go through a table, given that there are now multiple connections all querying multiple tables this leads to a server overload. While I see now the issue with our approach, I hope that migrating the server from SQL server to postgreSQL and to modern hardware + refactoring our approach in general will give us a boost.

They likely say SELECT something FROM this JOIN that JOIN otherthing bla bla bla. How many different JOINs are in that query?

Actually no JOIN. Most "complex" query is INSERT INTO with a WHEN NOT EXIST constraint.

But thank you for your advice. I will incorporate the tips in our new design approach.

[–] PoisonedPrisonPanda@discuss.tchncs.de 1 points 2 months ago (6 children)

with pagination you mean paginating to split the query into chunks during comparison of a give data set with a whole table?

[–] PoisonedPrisonPanda@discuss.tchncs.de 1 points 2 months ago (1 children)

first of all many thanks for the bullets. Good to have some guidance on where to start.

2nd level cache shared between services

I have read about this related to how FB does it. In general this means that fetching from the DB and keep it in memory to work with right? So we assume that the cached data is outdated to some extend?

faster storage/cpu/ram faster storage/cpu/ram faster storage/cpu/ram

I was able to convince management to put money into a new server (SSD thank god). So thank you for your emphasizes. We are also migrating to PostgreSQL from SQL server, and refactor the whole approach and design in general.

generate indexes

How would indeces help me when I want to ensure that no duplicate row is added? Is this some sort of internal SQL constraint or what is the difference to compare a certain list of rows with an existing table (lets say column id)?

Sort of harsh approach, but I get it.

Yeah. To me it feels like we used a powertool as a hammer. Brute force in the wrong way.

As an update: I was able to convince my people to migrate to a modern server - altogether we also switch from SQL server to PostgreSQL. During this migration we also try to refactor our workflow since it was flawed by design.

So, many thanks for the input.

[–] PoisonedPrisonPanda@discuss.tchncs.de 3 points 2 months ago (2 children)

Realistically, this setup is 10 years too old

thanks for this input. This was the winning argument for my boss for migrating to a modern server. While I admit that I see many flaws in our design, we are now working on refactoring our architecture and approach itself.

Thanks to the other numerous answers leading me to the right direction (hopefully).

Why aren’t you de-duping the table before processing?

I need to dedupe the to-be-processed data with the data thats already processed in the "final" table. We are working with hundreds of millions of products therefore we thought about "simply" using random batches from the data to be processed. But thanks to the many replies Ive learned already that our approach was in the beginning already wrong.

[–] PoisonedPrisonPanda@discuss.tchncs.de 3 points 2 months ago (1 children)

es is schowieder jeder krank. 🥲 i dafias überhaupt net.

wennst im bus sitzt kommen influenza vibes hoch.

ich leid mit dir. gute besserung.

des geht si dann nur wieder mit am spesenkonto aus.

yes. the problem is, we are fetching products from an API. and since we thought processing power will be a limiting factor, we thought that sorting out duplicates would reduce load.

but since the different microservices which process the data are taking different times we are using the sql tables as a pool. this should help upcscaling by using multiple microservices.

cloud services are yet not a solution as we are still in development.

[–] PoisonedPrisonPanda@discuss.tchncs.de 0 points 3 months ago (2 children)

haha. relating to a switch to ssd? or in which direction?

wenn nicht jetzt. wann dann. klatsch klatsch

view more: ‹ prev next ›